New Languages for NLP: Building Linguistic Diversity in the Digital Humanities

Diversifying NLP by teaching humanists to create data and models for new languages

AI/ML
Curriculum and Pedagogy
Digital Humanities
Linguistics
Multilingual
Natural Language Processing
View project website
AdobeStock_515721960

Natural Language Processing (NLP) has revolutionized our ability to analyze texts at scale. However, the major NLP resources only support a fraction of the world's more than 7,500 languages. This means that text mining, topic modeling and other computational methods are unavailable for the vast majority of languages — especially those that are historical, minority, or endangered. The proliferation of data and tools in several dominant languages will hinder research and perpetuate the existing structural inequalities on both local and global scales.

“New Languages for NLP: Building Linguistic Diversity in the Digital Humanities,” was an educational initiative, funded by a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities grant, to enable scholars to create high-quality linguistic data and train models for under-resourced, domain-specific and historical languages.

Between June 2021 and May 2022, eighteen scholars from around the world joined the workshop and worked on eleven languages: Ottoman Turkish, Tigrinya, Kanbun, Efik, 19th c. Russian, Classical Arabic, Old Chinese, Yoruba, Quechua, Yiddish and Kanada. They learned how to use cutting-edge NLP tools to advance their humanities research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.

Hosted by the CDH, this Institute was a collaboration with the University of Pennsylvania, the Library of Congress Labs, and DARIAH, the European Digital Research Infrastructure for the Arts and Humanities.

Any views, findings, conclusions, or recommendations expressed on this page do not necessarily represent those of the National Endowment for the Humanities.

Related projects

Computational Approaches to Nigerian Literature

Experiments in NLP for texts in Yoruba and Efik

Nigerian pattern (Adobe Stock)

Related events

Computational Approaches to Nigerian Literature: Analyzing Texts in Yoruba and Efik at DH2024

Aug 9 2024 2:00PM–3:30PM
Happy Buzaaba
Natalia Ermolaev
Utitofon Inyang
Temitayo Olatoye
DH2024
Nigerian pattern (Adobe Stock)

Team

Project Director

Andrew Janco

Instructor

David Lassner
Toma Tasovac

University Administrative Fellow

Grants

2020–2024

NEH Institutes for Advanced Topics in the Digital Humanities