Word-level Language Identification in Code-Mixed Data

Photo taken from internet

Code-mixed word-level language identification is crucial for accurately tagging the language of each word in a sentence, especially in multilingual social media text. The primary challenge in this task arises from the mixing of languages within a single sentence, script variations, and the phonetic spelling of words. The objective of this project was to develop models that can accurately identify the language at the word level in such mixed-language environments. To achieve this, datasets were collected from social media platforms where users often switch between languages, and each word was manually annotated with its language tag. The project involved using traditional machine learning algorithms like Conditional Random Fields (CRF) and deep learning models like Recurrent Neural Networks (RNN) and transformer-based models such as XLM-RoBERTa. The major challenges included handling ambiguous tokens, mixed-script words, and non-language tokens like hashtags and emoticons. The results demonstrated significant improvements in word-level language identification by fine-tuning multilingual embeddings, providing a strong foundation for further tasks like translation, sentiment analysis, and hate speech detection.

Supriya Chanda
Supriya Chanda
Research Scholar (2018-2024)

Supriya Chanda (pronounced as Supriyo), completed his Ph.D in the Department of Computer Science and Engineering, IIT (BHU), Varanasi. He did his research under the guidance of Dr. Sukomal Pal at the Information retrieval lab.