African_UD: Universal Dependencies Treebank for African Languages

Increasing the representation of African languages in NLP by creating quality datasets for eleven African languages

African pattern green circles (Adobe)

Although Africa is home to approximately one-third of the world's languages, African languages continue to lag behind in the rapid advancements of language technology and applications driven by large language models (LLMs). Due to lack of quality data, African languages are classified as "low-resource" and “data scarce” for AI/Machine Learning applications. Without robust linguistic resources, the development of technologies that could benefit speakers of African languages such as speech recognition, machine translation, grammar checking, text mining is severely limited.

The goal of the project is to increase the representation of African languages in AI research by creating a quality dataset with theoretically sound and consistent syntactic human annotations for eleven typologically diverse African languages: Kinyarwanda, Chichewa, Xhosa, Hausa, Naija Pidgin, Yoruba, Zulu, Luganda, Igbo, Wolof and Efik.

The African_UD project embraces a responsible, scholarly and community-focused approach. We are partnering with the Universal Dependencies (UD) group, an international scholarly project for cross-linguistic annotation and open publication of annotated textual corpora called “treebanks.” The UD Treebanks have greatly facilitated the development of multilingual natural language processing (NLP) tools and resources, but they currently contain few African languages.

The African_UD project is a collaboration with Masakhane, a grassroots organization of African technologists who have been creating datasets and models for African languages since 2019.

Talk at Princeton Language + Intelligence Symposium 2024

Team

Project Director

Grants

2024–2025

Princeton Language + Intelligence (PLI) Seed Grant