Home
Projects
New Languages for NLP: Building Linguistic Diversity in the Digital Humanities

New Languages for NLP: Building Linguistic Diversity in the Digital Humanities

Diversifying NLP by teaching humanists to create data and models for new languages

Natural Language Processing (NLP) has revolutionized our ability to analyze texts at scale. However, the major NLP resources only support a fraction of the world's more than 7,500 languages. This means that text mining, topic modeling and other computational methods are unavailable for the vast majority of languages — especially those that are historical, minority, or endangered. The proliferation of data and tools in several dominant languages will hinder research and perpetuate the existing structural inequalities on both local and global scales.

“New Languages for NLP: Building Linguistic Diversity in the Digital Humanities,” was an educational initiative, funded by a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities grant, to enable scholars to create high-quality linguistic data and train models for under-resourced, domain-specific and historical languages.

Between June 2021 and May 2022, eighteen scholars from around the world joined the workshop and worked on eleven languages: Ottoman Turkish, Tigrinya, Kanbun, Efik, 19th c. Russian, Classical Arabic, Old Chinese, Yoruba, Quechua, Yiddish and Kanada. They learned how to use cutting-edge NLP tools to advance their humanities research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.

Hosted by the CDH, this Institute was a collaboration with the University of Pennsylvania, the Library of Congress Labs, and DARIAH, the European Digital Research Infrastructure for the Arts and Humanities.

Any views, findings, conclusions, or recommendations expressed on this page do not necessarily represent those of the National Endowment for the Humanities.