Hate Speech Identification in Code-Mixed Data

Photo taken from internet

Hate speech identification in code-mixed text is a challenging but necessary task given the rise of harmful content on social media platforms. The goal of this project was to detect and classify hate speech in code-mixed social media conversations, which often involve multiple languages within a single text. To build an effective model, a dataset of code-mixed text was collected and manually annotated with labels indicating hate speech, offensive language, or neutral text. The primary challenge lay in the implicit nature of hate speech, where users often use metaphors or sarcasm to convey harmful messages across languages. The project utilized transformer-based models like BERT, XLM-RoBERTa, and MURIL to capture the deep semantics of these mixed-language texts. Special attention was given to handling code-switching patterns and the context in which the hate speech was expressed. The results showed that transformer-based models significantly outperformed traditional approaches in identifying hate speech in code-mixed data, especially when dealing with implicit language and sarcasm.

Supriya Chanda
Supriya Chanda
Research Scholar (2018-2024)

Supriya Chanda (pronounced as Supriyo), completed his Ph.D in the Department of Computer Science and Engineering, IIT (BHU), Varanasi. He did his research under the guidance of Dr. Sukomal Pal at the Information retrieval lab.