Urdu Parts of Speech Tagging

Pakistan has the 5th largest population in the world and hence the number of people associated with a regional language is significantly large. Urdu is the national language of Pakistan. Therefore, developing a natural language processing system capable of understanding Urdu is crucial. Natural language processing is a challenging research area with interesting problems. NLP-related tasks involve parts of speech tagging, developing word embedding’s, etc. whereas its applications revolve around translation, intent classification, sentiment analysis, etc. Therefore, there is a requirement for both developments of basic processes and applications involving NLP. Our project’s primary focus is to collect text data, tag parts of speech, and then train a system to recognize parts of speech in Urdu text. The goals of the project are: Collection of 100,000 tokens for the training of the system Annotation/tagging of tokens for reference Development of deep learning-based approach for Urdu parts of speech tagging. Mobile Application for tagging POS.

Keywords: Natural Language Processing (NLP), Parts of speech, Text Classification, Rule-based approach, XLM Roberta Base Model, Active Learning, Deep Learning, Annotation
Tools: Hugging Face, Jupyter Lab, Visual Studio Code, TensorFlow, XLM Roberta Base Model, Keras, Sci-kit,python
Department: Department of Computer Science

Project Team Members

Name	Email	CV
Kiran Zafar	kiranzafar2019@namal.edu.pk
Hurmat Ilyas	hurmat2019@namal.edu.pk
Muhammad Bilal	mbilal2019@namal.edu.pk

Urdu Parts of Speech Tagging

Project Team Members

Related Projects