Urdu Parts of Speech Tagging
Pakistan has the 5th largest population in the world and hence the number of people associated with a regional language is significantly large. Urdu is the national language of Pakistan. Therefore, developing a natural language processing system capable of understanding Urdu is crucial.
Natural language processing is a challenging research area with interesting problems. NLP-related tasks involve parts of speech tagging, developing word embedding’s, etc. whereas its applications revolve around translation, intent classification, sentiment analysis, etc. Therefore, there is a requirement for both developments of basic processes and applications involving NLP.
Our project’s primary focus is to collect text data, tag parts of speech, and then train a system to recognize parts of speech in Urdu text.
The goals of the project are:
Collection of 100,000 tokens for the training of the system
Annotation/tagging of tokens for reference
Development of deep learning-based approach for Urdu parts of speech tagging.
Mobile Application for tagging POS.
Keywords: Natural Language Processing (NLP), Parts of speech, Text Classification, Rule-based approach, XLM Roberta Base Model, Active Learning, Deep Learning, Annotation
Tools: Hugging Face, Jupyter Lab, Visual Studio Code, TensorFlow, XLM Roberta Base Model, Keras, Sci-kit,python
Department: Department of Computer Science
Project Team Members
Name |
Email |
Kiran Zafar
|
kiranzafar2019@namal.edu.pk |
Hurmat Ilyas
|
hurmat2019@namal.edu.pk |
Muhammad Bilal
|
mbilal2019@namal.edu.pk |
Project Poster