With over a third of the world’s population consuming social media, it’s especially important to detect how bad actors propagate online harassment, including hate speech. Hate speech and toxicity detection systems are used to filter contents in a range of online platforms including Facebook, Twitter, YouTube, and various publications.
According to the researchers, the core of the issue for this kind of discriminatory behaviors is arising during the data creation process. As a result, training dataset becomes highly biased. When trained on biased datasets, models acquire and exacerbate biases, for example flagging text by Black authors as more toxic than text by white authors.
‘Jigsaw’ which is owned by the Alphabet umbrella, claims it’s taken pains to remove bias from its models following a study showing it fared poorly on Black’s speech. Although it’s unclear the extent to which this might be true of other AI-powered solutions by different companies.
Researchers at the Allen Institute investigated recently introduced techniques to address the lexical (e.g., swear words, slurs, identity mentions) and dialectal imbalances in datasets to check whether current model debiasing approaches can mitigate biases in toxic language detection. Here the lexical biases associate toxicity with the presence of certain words, like profanities, while dialectal biases correlate toxicity with “markers” of language variants like African-American English (AAE).
Those researchers looked at one debiasing method designed to tackle “predefined biases” (e.g., lexical and dialectal). In addition to that, they explored a process that filters “easy” training examples with correlations that might mislead a hate speech and toxicity detection model.
In their experiments, both approaches face challenges in mitigating biases from a model trained on a biased dataset for hate speech and toxic language detection. After filtering reduced bias in the dataset also, models trained on filtered datasets still picked up lexical and dialectal biases. Even “debiased” models unfairly flagged text in certain snippets as toxic.
Perhaps more discouragingly, mitigating dialectal bias didn’t appear to change a model’s tendency to label text by Black authors as more toxic than white authors.
“Our findings suggest that instead of solely relying on development of automatic debiasing for existing, imperfect datasets, future work focus primarily on the quality of the underlying data for hate speech detection, such as accounting for speaker identity and dialect,” the researchers wrote. “Indeed, such efforts could act as an important step towards making systems less discriminatory, and hence safe and usable.”
Furthermore, the researchers embarked a proof-of-concept study involving relabeling examples of supposedly toxic text whose translations from African-American English (AAE) to “white-aligned English” were deemed nontoxic. Here they used OpenAI’s GPT-3 to perform the translations and create a synthetic dataset — a dataset, they say, that resulted in a model less prone to dialectal and racial biases.
“Overall, our findings indicate that debiasing a model already trained on biased toxic language data can be challenging,” said the researchers, who caution against deploying their proof-of-concept approach because of its limitations and ethical implications. Moreover, the researchers note that GPT-3 likely wasn’t exposed to many African American English varieties during training, making it ill-suited for this purpose.