Code-Mixed Sentiment Analysis

Feb 22, 2025

Photo taken from internet

Sentiment analysis on code-mixed text is essential for understanding user opinions, particularly in regions where people frequently switch between languages in their conversations. Traditional sentiment analysis models face difficulty in handling such data due to the complexity introduced by the mixture of languages and the lack of large labeled datasets. This project aimed to develop a robust model for detecting and classifying sentiment (positive, negative, neutral) in code-mixed social media text. A dataset was collected from social media platforms like Facebook and Twitter, and preprocessing steps included tokenization, language identification, and normalization of code-mixed text. The hybrid model combined traditional machine learning algorithms like Support Vector Machines (SVM) with deep learning techniques like Bidirectional Long Short-Term Memory (BiLSTM). The project also leveraged multilingual embeddings to capture the semantics of the words in different languages. One of the key challenges was handling context-sensitive elements like sarcasm and slang. The results showed improved accuracy by fine-tuning transformer models like BERT and multilingual BERT, significantly enhancing the performance of sentiment classification on code-mixed data.

Social media NLP Sentiment Analysis Code-Mixing cmsa Tamil-English Kannada-English Malayalam-English

Publications

Sentiment analysis of code-mixed Dravidian languages leveraging pretrained model and word-level language tag

The exponential growth of social media data in the era of Web 2.0 has necessitated advanced techniques for sentiment analysis. While sentiment analysis in monolingual datasets has received significant attention that in code-mixed datasets still need to be studied more. Code-mixed data often contain a mixture of monolingual content (might be in transliterated form), single-script but multilingual content, and multi-script multilingual content. This paper explores the issue from three important angles. What will be the best strategy to deal with the data for sentiment detection? Whether to train the classifier with the whole of the dataset or only with the pure code-mixed subset from the dataset? How much important is the language identification (LID) for the task? If LID is to be done, how, and when will it be used to yield the best performance? We explore the questions in the light of three datasets of Tamil–English, Kannada–English, and Malayalam–English YouTube social media comments. Our solution incorporated mBERT and an optional LID module. We report our results using a set of metrics like precision, recall, score, and accuracy. The solutions provide considerable performance gain and some interesting insights for sentiment analysis from code-mixed data.

Supriya Chanda, Anshika Mishra, Sukomal Pal

Crossing Borders: Multilingual Hate Speech Detection

With the relentless growth of technology usage, particularly among younger generations, the alarming prevalence of hate speech on the …

Supriya Chanda, Abhishek Dhaka, Sukomal Pal

Crossing Borders: Multilingual Hate Speech Detection

Sarcasm Detection in Tamil and Malayalam Dravidian Code-Mixed Text

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software. Create your slides in Markdown - click the Slides button to check out the example.

Supriya Chanda, Anshika Mishra, Sukomal Pal

Sarcasm Detection in Tamil and Malayalam Dravidian Code-Mixed Text

Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding

We are seeing an increase in hateful and offensive tweets and comments on social media platforms like Facebook and Twitter, impacting …

Supriya Chanda, Sacchit D Sheth, Sukomal Pal

Coarse and Fine-Grained Conversational Hate Speech and Offensive Content Identification in Code-Mixed Languages using Fine-Tuned Multilingual Embedding

Is Meta Embedding better than pre-trained word embedding to perform Sentiment Analysis for Dravidian Languages in Code-Mixed Text?

This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix - FIRE 2021 Sentiment Analysis for Dravidian Languages pairs …

Supriya Chanda, Rajat Pratap Singh, Sukomal Pal

Is Meta Embedding better than pre-trained word embedding to perform Sentiment Analysis for Dravidian Languages in Code-Mixed Text?

Sentiment Analysis and Homophobia detection of Code-Mixed Dravidian Languages leveraging pre-trained model and word-level language tag

Social media platforms have seen a significant rise in user engagement in recent years. More and more people are expressing their views …

Supriya Chanda, Anshika Mishra, Sukomal Pal

Sentiment Analysis and Homophobia detection of Code-Mixed Dravidian Languages leveraging pre-trained model and word-level language tag

Fine-tuning Pre-Trained Transformer based model for Hate Speech and Offensive Content Identification in English Indo-Aryan and Code-Mixed (English-Hindi) languages

Hate Speech and Offensive Content Identification is one of the most challenging problem in the natural language processing field, being …

Supriya Chanda, S Ujjwal, Shayak Das, Sukomal Pal

Fine-tuning Pre-Trained Transformer based model for Hate Speech and Offensive Content Identification in English Indo-Aryan and Code-Mixed (English-Hindi) languages

IRlab@ IITV at SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media Using SVM

This paper describes the IRlab@IIT-BHU system for the OffensEval 2020. We take the SVM with TF-IDF features to identify and categorize …

Anita Saroj, Supriya Chanda, Sukomal Pal

IRlab@ IITV at SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media Using SVM

IRLab@ IITBHU@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text

This paper describes the IRlab@IITBHU system for the Dravidian-CodeMix - FIRE 2020 Sentiment Analysis for Dravidian Languages pairs …

Supriya Chanda, Sukomal Pal

IRLab@ IITBHU@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis for Dravidian Languages in Code-Mixed Text