Publications

Supriya Chanda

October, 2024 IIT (BHU) Varanasi

Text Processing on Code-Mixed Social Media Data

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software. Create your slides in Markdown - click the Slides button to check out the example.

Supriya Chanda, Anshika Mishra, Sukomal Pal

September, 2024 In NLP

Sentiment analysis of code-mixed Dravidian languages leveraging pretrained model and word-level language tag

The exponential growth of social media data in the era of Web 2.0 has necessitated advanced techniques for sentiment analysis. While sentiment analysis in monolingual datasets has received significant attention that in code-mixed datasets still need to be studied more. Code-mixed data often contain a mixture of monolingual content (might be in transliterated form), single-script but multilingual content, and multi-script multilingual content. This paper explores the issue from three important angles. What will be the best strategy to deal with the data for sentiment detection? Whether to train the classifier with the whole of the dataset or only with the pure code-mixed subset from the dataset? How much important is the language identification (LID) for the task? If LID is to be done, how, and when will it be used to yield the best performance? We explore the questions in the light of three datasets of Tamil–English, Kannada–English, and Malayalam–English YouTube social media comments. Our solution incorporated mBERT and an optional LID module. We report our results using a set of metrics like precision, recall, score, and accuracy. The solutions provide considerable performance gain and some interesting insights for sentiment analysis from code-mixed data.

Supriya Chanda, Abhishek Dhaka, Sukomal Pal

June, 2024 In Websci'24

Towards Safer Online Spaces: Deep Learning for Hate Speech Detection in Code-Mixed Social Media Conversations

In the midst of the widespread adoption of technology, particularly among younger generations, the increasing prevalence of hate speech online has become a pressing global concern. This research paper aims to address this urgent issue by conducting a thorough investigation into hate speech detection in Hindi-English code-mixed data. Existing research has largely approached hate speech recognition as a text classification problem, focusing on predicting the class of a message based solely on its textual content. Our task, however, delves into the classification of hateful content disseminated through tweets, comments, and replies on Twitter, taking into account the contextual intricacies inherent in social media communication. In this context, contextual nuances play a crucial role in understanding communication dynamics. By employing state-of-the-art deep learning techniques tailored to the unique linguistic characteristics of each language, this research makes a significant contribution to the development of robust and culturally sensitive hate speech detection systems. Such systems are essential for creating safer online environments and promoting cross-cultural understanding.

Supriya Chanda, Sukomal Pal

June, 2024 In Chapman and Hall/CRC

Hate Content Identification in Code-mixed Social Media Data

Social media has witnessed a tremendous boom over the past few years, which has in turn generated vast quantities of user-generated content. Users share their opinions, expressions, and emotions openly on social networking sites. Sometimes, these unabated expressions of feelings and thoughts transcend the limit of decency and lead to swearing, bullying, and character assassination. Harsh and derogatory words are directed toward individuals or groups. Often these acts are termed hate speech. As more and more people gain easy access to the Internet, social media gets flooded with such expressions, comments, and opinions in many diverse languages. It becomes increasingly difficult to police (monitor) such hate speech, especially the posts of multilingual people, which are frequently code-mixed or made up of numerous languages. In order to mitigate the spread of offensive content on social media platforms, the initial step is to detect and identify such textual content. Manually identifying hate speech or bullying in code-mixed languages, if not impossible, is a time-consuming and arduous task. Automating this task involves substantial complications and challenges due to the variety and volume of the data. In a multi-lingual code-mixed environment with multiple languages and scripts, the task becomes even more difficult. This chapter begins with different initiatives and approaches related to the identification of offensive content and hate speech in the English language and then we study their applications on code-mixed data. We evaluate the limitations found in the current state-of-the-art techniques, which demonstrate proficiency in handling text data written in a single language and script. Subsequently, we delve into the challenges associated with processing multi-lingual content. We list down different sources of code-mixed datasets and challenges faced during the processing of the data of multilingual content. We attempt to delineate the timeline with respect to the development of models on hate speech recognition for code-mixed data (with all their advantages, data pre-processing techniques, and limitations), Starting with machine learning (ML) and deep learning (DL) approaches, additional textual representation techniques are increasingly added. We then move on to the multilingual pre-trained models (for example, mBERT, XLMR, MuRIL) for text classification. To corroborate our discussion, we consider a specific dataset (ICHCL 2021, 2022) as a case study for analyzing and identifying hate content from code-mixed online datasets, which are conversational in nature. In addition to considering only single-stand-alone sentences where a sentence holds the context by itself, cases, where previous conversations and the chronology of messages are also taken into account, are discussed. In the end, we discuss the shortcomings of the existing techniques and provide directions for future work.

Supriya Chanda, Anshika Mishra, Sukomal Pal

December, 2023 FIRE

Sarcasm Detection in Tamil and Malayalam Dravidian Code-Mixed Text

Supriya Chanda, Abhishek Dhaka, Sukomal Pal

December, 2023 FIRE

Crossing Borders: Multilingual Hate Speech Detection

With the relentless growth of technology usage, particularly among younger generations, the alarming prevalence of hate speech on the internet has become an urgent global concern. This research paper addresses this critical need by presenting an extensive investigation encompassing three distinct hate speech detection tasks across a diverse linguistic landscape. The first task involves hate and offensive speech classification in Gujarati and Sinhala, assessing sentence-level hatefulness. The second task extends to fine-grained BIO tagging, enabling precise identification of hate speech within sentences. Finally, the third task expands the scope to hate speech classification in Bengali, Bodo, and Assamese using social media data, categorizing content as hateful or not. Employing state-of-the-art deep learning techniques tailored to each language’s characteristics, this research contributes significantly to the development of robust and culturally sensitive hate speech detection systems, imperative for nurturing safer online spaces and fostering cross-cultural understanding.

Supriya Chanda, Anshika Mishra, Sukomal Pal

December, 2023 FIRE

Advancing Language Identification in Code-Mixed Tulu Texts: Harnessing Deep Learning Techniques

This study focuses on the task of word-level language identification in code-mixed Tulu-English texts, which is crucial for addressing the linguistic diversity observed on social media platforms. The CoLITunglish shared task served as a platform for multiple teams to tackle this challenge, aiming to enhance our understanding of and capabilities in handling code-mixed language data. To tackle this task, we employed a methodology that leveraged Multilingual BERT (mBERT) for word embedding and a Bi-LSTM model for sequence representation. Our system achieved a Precision score of 0.74, indicating accurate language label predictions. However, our Recall score of 0.571 suggests the need for improvement, particularly in capturing all language labels, especially in multilingual contexts. The resulting F1 score, a balanced measure of our system’s performance, stood at 0.602, indicating a reasonable overall performance. Ultimately, our work contributes to advancing language understanding in multilingual digital communication.

Supriya Chanda, Sukomal Pal

September, 2023 SNCS

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media

Stopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords should be eliminated from any document for improved language description. In this paper, we have explored and evaluated the effect of stopwords on the performance of information retrieval in code-mixed social media data in Indian languages such as Bengali–English. A considerable amount of research has been performed in the areas of sentiment analysis, language identification, and language generation for code-mixed languages. However, no such work has been done in the field of removal of stopwords from a code-mixed document. That is the motivation behind this work. In this work, we have devoted our attention to comparing the impact of corpus-based stopword removal over non-corpus-based stopword removal on Information retrieval for code-mixed data. How to find the best stopword list for each constituent language of a code mixed language? It was observed that corpus-based stopword removal generally improved Mean Average Precision (MAP) values significantly compared to non-corpus-based stopword removal by 16%. For both languages, different threshold values were tuned together based on the TF-IDF score, and it gave the optimal list for stopwords.

Supriya Chanda, Sacchit D Sheth, Sukomal Pal

December, 2022 FIRE

Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding

We are seeing an increase in hateful and offensive tweets and comments on social media platforms like Facebook and Twitter, impacting our social lives. Because of this, there is an increasing need to identify online postings that can violate accepted norms. For resource-rich languages like English, the challenge of identifying hateful and offensive posts has been well investigated. However, it remains unexplored for languages with limited resources like Marathi. Code-mixing frequently occurs in the social media sphere. Therefore identification of conversational hate and offensive posts and comments in Code-Mixed languages is also challenging and unexplored. In three different objectives of the HASOC 2022 shared task, we proposed approaches for recognizing offensive language on Twitter in Marathi and two code-mixed languages (i.e., Hinglish and German). Some tasks can be expressed as binary classification (also known as coarse-grained, which entails categorizing hate and offensive tweets as either present or absent). At the same time, others can be expressed as multi-class classification (also known as fine-grained, where we must further categorize hate and offensive tweets as Standalone Hate or Contextual Hate). We concatenate the parent-comment-reply data set to create a dataset with additional context. We use the multilingual bidirectional encoder representations of the transformer (mBERT), which has been pre-trained to acquire the contextual representations of tweets. We have carried out several trials using various pre-processing methods and pre-trained models. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team (irlab@iitbhu) second out of 14, seventh out of 11, sixth out of 10, fourth out of 7, and fifth out of six for the ICHCL task 1, ICHCL task 2, Marathi subtask 3A, subtask 3B and subtask 3C respectively.

Supriya Chanda, Anshika Mishra, Sukomal Pal

December, 2021 FIRE

Sentiment Analysis and Homophobia detection of Code-Mixed Dravidian Languages leveraging pre-trained model and word-level language tag

Social media platforms have seen a significant rise in user engagement in recent years. More and more people are expressing their views and ideas on social platforms. There is an ardent need to develop an accurate system to classify text based on sentiments. In this paper, our team IRLab@ IITBHU presents a solution architecture submitted to the shared task “Sentiment Analysis and Homophobia Detection of YouTube Comments in Code-Mixed Dravidian Languages" organized by DravidianCodeMix 2022 at Forum for Information Retrieval Evaluation (FIRE) 2022. to reveal how sentiment is expressed in code-mixed scenarios. For task A, we used mBERT model and word-level language tag to classify YouTube comments into positive, negative, neutral, or mixed emotions. And for Task B, we performed basic preprocessing steps and built mBERT model to identify homophobia, transphobia, and non-anti-LGBT+ content from the given corpus. For Task A, our proposed system achieved the best result, securing the first rank for Malayalam-English and Kannada-English code-mixed datasets with the 𝐹1 score of 0.72 and 0.66 respectively.

Supriya Chanda, Rajat Pratap Singh, Sukomal Pal

December, 2021 FIRE

Is Meta Embedding better than pre-trained word embedding to perform Sentiment Analysis for Dravidian Languages in Code-Mixed Text?

This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix - FIRE 2021 Sentiment Analysis for Dravidian Languages pairs Tamil-English (TA-EN), Kannada-English (KN-EN), and MalayalamEnglish (ML-EN) in Code-Mixed text. We have reported three models output in this paper where We have submitted only one model for sentiment analysis of all code-mixed datasets. Run-1 was obtained from the FastText embedding with multi-head attention, Run-2 used the meta embedding techniques, and Run-3 used the Multilingual BERT(mBERT) model for producing the results. Run-2 outperformed Run-1 and Run-3 for all the language pairs.

Supriya Chanda, S Ujjwal, Shayak Das, Sukomal Pal

December, 2021 FIRE

Fine-tuning Pre-Trained Transformer based model for Hate Speech and Offensive Content Identification in English Indo-Aryan and Code-Mixed (English-Hindi) languages

Hate Speech and Offensive Content Identification is one of the most challenging problem in the natural language processing field, being imposed by the rising presence of this phenomenon in online social media. This paper describes our Transformer-based solutions for identifying offensive language on Twitter in three languages (ie, English, Hindi, and Marathi) and one code mixed (English-Hindi) language, which was employed in Subtask 1A, Subtask 1B and Subtask 2 of the HASOC 2021 shared task. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our IRLab@ IITBHU team 16th of 56, 18th of 37, 13th of 34, 7th of 24, 12th of 25 and 6th of 16 for English Subtask 1A, English Subtask 1B, Hindi Subtask 1A, Hindi Subtask 1B, Marathi Subtask 1A, and English-Hindi Code-Mix Subtask 2 respectively.

Anita Saroj, Supriya Chanda, Sukomal Pal

December, 2020 FIRE

IRlab@ IITV at SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media Using SVM

This paper describes the IRlab@IIT-BHU system for the OffensEval 2020. We take the SVM with TF-IDF features to identify and categorize hate speech and offensive language in social media for two languages. In subtask A, we used a linear SVM classifier to detect abusive content in tweets, achieving a macro F1 score of 0.779 and 0.718 for Arabic and Greek, respectively.

Supriya Chanda, Eshita Nandy, Sukomal Pal

November, 2020 W-NUT 2020

IRLab@IITBHU at WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets using BERT

This paper reports our submission to the shared Task 2 Identification of informative COVID-19 English tweets at W-NUT 2020. We attempted a few techniques, and we briefly explain here two models that showed promising results in tweet classification tasks DistilBERT and FastText. DistilBERT achieves a F1 score of 0.7508 on the test set, which is the best of our submissions.

Supriya Chanda, Sukomal Pal

September, 2020 FIRE

IRLab@ IITBHU@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text

This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix - FIRE 2020 Sentiment Analysis for Dravidian Languages pairs Tamil-English (TA-EN) and Malayalam-English (ML-EN) in Code-Mixed text. We submitted three models for sentiment analysis of code-mixed TA-EN and MA-EN datasets. Run-1 was obtained from the BERT and Logistic regression classifier, Run-2 used the DistilBERT and Logistic regression classifier, and Run-3 used the fastText model for producing the results. Run-3 outperformed Run-1 and Run-2 for both the datasets. We obtained an F1-score of 0.58, rank 8/14 in TA-EN language pair and for ML-EN, an F1-score of 0.63 with rank 11/15.