cmir

The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media
Stopwords often present themselves littered throughout the documents, their presence in sentences has the least significant semantic impact and these terms represent an impressive collection of archives without any semantic value. Thus, stopwords should be eliminated from any document for improved language description. In this paper, we have explored and evaluated the effect of stopwords on the performance of information retrieval in code-mixed social media data in Indian languages such as Bengali–English. A considerable amount of research has been performed in the areas of sentiment analysis, language identification, and language generation for code-mixed languages. However, no such work has been done in the field of removal of stopwords from a code-mixed document. That is the motivation behind this work. In this work, we have devoted our attention to comparing the impact of corpus-based stopword removal over non-corpus-based stopword removal on Information retrieval for code-mixed data. How to find the best stopword list for each constituent language of a code mixed language? It was observed that corpus-based stopword removal generally improved Mean Average Precision (MAP) values significantly compared to non-corpus-based stopword removal by 16%. For both languages, different threshold values were tuned together based on the TF-IDF score, and it gave the optimal list for stopwords.
The Effect of Stopword Removal on Information Retrieval for Code-Mixed Data Obtained Via Social Media