Advancing Language Identification in Code-Mixed Tulu Texts: Harnessing Deep Learning Techniques

Supriya Chanda, Anshika Mishra, Sukomal Pal

December, 2023

Image credit: Unsplash

Abstract

This study focuses on the task of word-level language identification in code-mixed Tulu-English texts, which is crucial for addressing the linguistic diversity observed on social media platforms. The CoLITunglish shared task served as a platform for multiple teams to tackle this challenge, aiming to enhance our understanding of and capabilities in handling code-mixed language data. To tackle this task, we employed a methodology that leveraged Multilingual BERT (mBERT) for word embedding and a Bi-LSTM model for sequence representation. Our system achieved a Precision score of 0.74, indicating accurate language label predictions. However, our Recall score of 0.571 suggests the need for improvement, particularly in capturing all language labels, especially in multilingual contexts. The resulting F1 score, a balanced measure of our system’s performance, stood at 0.602, indicating a reasonable overall performance. Ultimately, our work contributes to advancing language understanding in multilingual digital communication.

Type

Preprint

Publication

Forum for Information Retrieval Evaluation

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Create your slides in Markdown - click the Slides button to check out the example.

#Supplementary notes can be added here, including code, math, and images.

cmsa

Advancing Language Identification in Code-Mixed Tulu Texts: Harnessing Deep Learning Techniques

Abstract

Supriya Chanda

Research Scholar (2018-2024)