Publications

Hate Content Identification in Code-mixed Social Media Data

Social media has witnessed a tremendous boom over the past few years, which has in turn generated vast quantities of user-generated content. Users share their opinions, expressions, and emotions openly on social networking sites. Sometimes, these unabated expressions of feelings and thoughts transcend the limit of decency and lead to swearing, bullying, and character assassination. Harsh and derogatory words are directed toward individuals or groups. Often these acts are termed hate speech. As more and more people gain easy access to the Internet, social media gets flooded with such expressions, comments, and opinions in many diverse languages. It becomes increasingly difficult to police (monitor) such hate speech, especially the posts of multilingual people, which are frequently code-mixed or made up of numerous languages. In order to mitigate the spread of offensive content on social media platforms, the initial step is to detect and identify such textual content. Manually identifying hate speech or bullying in code-mixed languages, if not impossible, is a time-consuming and arduous task. Automating this task involves substantial complications and challenges due to the variety and volume of the data. In a multi-lingual code-mixed environment with multiple languages and scripts, the task becomes even more difficult. This chapter begins with different initiatives and approaches related to the identification of offensive content and hate speech in the English language and then we study their applications on code-mixed data. We evaluate the limitations found in the current state-of-the-art techniques, which demonstrate proficiency in handling text data written in a single language and script. Subsequently, we delve into the challenges associated with processing multi-lingual content. We list down different sources of code-mixed datasets and challenges faced during the processing of the data of multilingual content. We attempt to delineate the timeline with respect to the development of models on hate speech recognition for code-mixed data (with all their advantages, data pre-processing techniques, and limitations), Starting with machine learning (ML) and deep learning (DL) approaches, additional textual representation techniques are increasingly added. We then move on to the multilingual pre-trained models (for example, mBERT, XLMR, MuRIL) for text classification. To corroborate our discussion, we consider a specific dataset (ICHCL 2021, 2022) as a case study for analyzing and identifying hate content from code-mixed online datasets, which are conversational in nature. In addition to considering only single-stand-alone sentences where a sentence holds the context by itself, cases, where previous conversations and the chronology of messages are also taken into account, are discussed. In the end, we discuss the shortcomings of the existing techniques and provide directions for future work.

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media
The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

Stopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords should be eliminated from any document for improved language description. In this paper, we have explored and evaluated the effect of stopwords on the performance of information retrieval in code-mixed social media data in Indian languages such as Bengali–English. A considerable amount of research has been performed in the areas of sentiment analysis, language identification, and language generation for code-mixed languages. However, no such work has been done in the field of removal of stopwords from a code-mixed document. That is the motivation behind this work. In this work, we have devoted our attention to comparing the impact of corpus-based stopword removal over non-corpus-based stopword removal on Information retrieval for code-mixed data. How to find the best stopword list for each constituent language of a code mixed language? It was observed that corpus-based stopword removal generally improved Mean Average Precision (MAP) values significantly compared to non-corpus-based stopword removal by 16%. For both languages, different threshold values were tuned together based on the TF-IDF score, and it gave the optimal list for stopwords.

Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding
Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding

We are seeing an increase in hateful and offensive tweets and comments on social media platforms like Facebook and Twitter, impacting our social lives. Because of this, there is an increasing need to identify online postings that can violate accepted norms. For resource-rich languages like English, the challenge of identifying hateful and offensive posts has been well investigated. However, it remains unexplored for languages with limited resources like Marathi. Code-mixing frequently occurs in the social media sphere. Therefore identification of conversational hate and offensive posts and comments in Code-Mixed languages is also challenging and unexplored. In three different objectives of the HASOC 2022 shared task, we proposed approaches for recognizing offensive language on Twitter in Marathi and two code-mixed languages (i.e., Hinglish and German). Some tasks can be expressed as binary classification (also known as coarse-grained, which entails categorizing hate and offensive tweets as either present or absent). At the same time, others can be expressed as multi-class classification (also known as fine-grained, where we must further categorize hate and offensive tweets as Standalone Hate or Contextual Hate). We concatenate the parent-comment-reply data set to create a dataset with additional context. We use the multilingual bidirectional encoder representations of the transformer (mBERT), which has been pre-trained to acquire the contextual representations of tweets. We have carried out several trials using various pre-processing methods and pre-trained models. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team (irlab@iitbhu) second out of 14, seventh out of 11, sixth out of 10, fourth out of 7, and fifth out of six for the ICHCL task 1, ICHCL task 2, Marathi subtask 3A, subtask 3B and subtask 3C respectively.