Developing Corpora using Wikipedia and Word2vec for Word Sense Disambiguation

Farza Nurifan, Riyanarto Sarno, Cahyaningtyas Sekar Wahyuni


Word Sense Disambiguation (WSD) is one of the most difficult problems in the artificial intelligence field or well known as AI-hard or AI-complete. A lot of problems can be solved using word sense disambiguation approaches like sentiment analysis, machine translation, search engine relevance, coherence, anaphora resolution, and inference. In this paper, we do research to solve WSD problem with two small corpora. We propose the use of Word2vec and Wikipedia to develop the corpora. After developing the corpora, we measure the sentence similarity with the corpora using cosine similarity to determine the meaning of the ambiguous word. Lastly, to improve accuracy, we use Lesk algorithms and Wu Palmer similarity to deal with problems when there is no word from a sentence in the corpora (we call it as semantic similarity). The results of our research show an 86.94% accuracy rate and the semantic similarity improve the accuracy rate by 12.96% in determining the meaning of ambiguous words.


Word Sense Disambiguation; Word2vec; Wikipedia; Lesk; Wu Palmer

Full Text:



A. R. Pal, D. Saha, and S. K. Naskar, “Word Sense Disambiguation in Bengali: a Knowledge based Approach using Bengali WordNet,” in Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), 2017, pp. 1-5.

N. Bouhriz, F. Benabbou, E. Habib, and B. Lahmar, “Word Sense Disambiguation Approach for Arabic Text,” IJACSA) Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 4, pp. 381–385, 2016.

S. Gupta, A. Namavari, and T. O. Smith, “Word Sense Disambiguation Using Skip-Gram and LSTM Models,” 2017.

Q. P. Nguyen, A. D. Vo, J. C. Shin, and C. Y. Ock, “Effect of Word Sense Disambiguation on Neural Machine Translation: A Case Study in Korean,” in IEEE Access, 2018, pp. 38512 - 38523.

V. Chahuneau, E. Schlinger, N. A. Smith, and C. Dyer, "Translating into Morphologically Rich Languages with Synthetic Phrases," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1677-1687.

B. S. Rintyarna and R. Sarno, “Adapted weighted graph for Word Sense Disambiguation,” in 2016 4th International Conference on Information and Communication Technology (ICoICT), 2016, pp. 1–5.

N. Sharma and S. Niranjan, “An Integration of Supervised and Unsupervised Machine Learning Algorithms to Optimize Word Sense Disambiguation,” International Journal of Advance Research in Computer Science and Management Studies, vol. 3, no. 10, pp. 45-59, 2015.

F. Hastarita Rachman, R. Sarno, and C. Fatichah, “Music Emotion Classification based on Lyrics-Audio using Corpus based Emotion,” Int. J. Electr. Comput. Eng., vol. 8, no. 3, pp. 1720–1730, 2018.

F. H. Rachman, R. Sarno, and C. Fatichah, “CBE: Corpus-based of emotion for emotion detection in text document,” in 2016 3rd International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), 2016, pp. 331–335.

S. Vijayarani, M. R. Janani, and A. Professor, “TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS,” Adv. Comput. Intell. An Int. J., vol. 3, no. 1, pp. 37–47, 2016.

P. Basile, A. Caputo, and G. Semeraro, "An Enganced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model," in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistic, 2014, pp. 1591-1600.

P. Sharma, R. Tripathi, R. C. Tripathi, "Finding Similar Patents through Semantic Expansion," in International Conference on Computer Communication and Informatics (ICCCI), 2016, pp. 1-5.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” in Proceedings of the International Conference on Learning Representations (ICLR 2013), 2013.

T. Mikolov, W. Yih, and G. Zweig, “Linguistic Regularities in Continuous Space Word Representations,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 746–751.

S. Tabakhi, P. Moradi, and F. Akhlaghian, "An Unsupervised Feature Selection Algorithm based on Ant Colony Optimization, Engineering Application of Artificial Intelligence, vol. 32, pp. 112-123, 2014.

D. Wali and N. Modhe, “Word Sense Disambiguation Algorithms in Hindi,” 2015.

Total views : 191 times


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

shopify stats IJEECS visitor statistics