Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding

Supriya Chanda, Sacchit D Sheth, Sukomal Pal

December, 2022

Image credit: Unsplash

Abstract

We are seeing an increase in hateful and offensive tweets and comments on social media platforms like Facebook and Twitter, impacting our social lives. Because of this, there is an increasing need to identify online postings that can violate accepted norms. For resource-rich languages like English, the challenge of identifying hateful and offensive posts has been well investigated. However, it remains unexplored for languages with limited resources like Marathi. Code-mixing frequently occurs in the social media sphere. Therefore identification of conversational hate and offensive posts and comments in Code-Mixed languages is also challenging and unexplored. In three different objectives of the HASOC 2022 shared task, we proposed approaches for recognizing offensive language on Twitter in Marathi and two code-mixed languages (i.e., Hinglish and German). Some tasks can be expressed as binary classification (also known as coarse-grained, which entails categorizing hate and offensive tweets as either present or absent). At the same time, others can be expressed as multi-class classification (also known as fine-grained, where we must further categorize hate and offensive tweets as Standalone Hate or Contextual Hate). We concatenate the parent-comment-reply data set to create a dataset with additional context. We use the multilingual bidirectional encoder representations of the transformer (mBERT), which has been pre-trained to acquire the contextual representations of tweets. We have carried out several trials using various pre-processing methods and pre-trained models. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team (irlab@iitbhu) second out of 14, seventh out of 11, sixth out of 10, fourth out of 7, and fifth out of six for the ICHCL task 1, ICHCL task 2, Marathi subtask 3A, subtask 3B and subtask 3C respectively.

Type

Preprint

Publication

Forum for Information Retrieval Evaluation

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Create your slides in Markdown - click the Slides button to check out the example.

#Supplementary notes can be added here, including code, math, and images.

hate hasoc

Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding

Abstract

Supriya Chanda

Research Scholar (2018-2024)