New Languages for NLP: Building Linguistic Diversity in the Digital Humanities
Diversifying NLP by teaching humanists to create data and models for new languages
View project websiteNatural Language Processing (NLP) has revolutionized our ability to analyze texts at scale. However, the major NLP resources only support a fraction of the world's more than 7,500 languages. This means that text mining, topic modeling and other computational methods are unavailable for the vast majority of languages — especially those that are historical, minority, or endangered. The proliferation of data and tools in several dominant languages will hinder research and perpetuate the existing structural inequalities on both local and global scales.
“New Languages for NLP: Building Linguistic Diversity in the Digital Humanities,” was an educational initiative, funded by a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities grant, to enable scholars to create high-quality linguistic data and train models for under-resourced, domain-specific and historical languages.
Between June 2021 and May 2022, eighteen scholars from around the world joined the workshop and worked on eleven languages: Ottoman Turkish, Tigrinya, Kanbun, Efik, 19th c. Russian, Classical Arabic, Old Chinese, Yoruba, Quechua, Yiddish and Kanada. They learned how to use cutting-edge NLP tools to advance their humanities research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.
Hosted by the CDH, this Institute was a collaboration with the University of Pennsylvania, the Library of Congress Labs, and DARIAH, the European Digital Research Infrastructure for the Arts and Humanities.
Any views, findings, conclusions, or recommendations expressed on this page do not necessarily represent those of the National Endowment for the Humanities.
Related projects
Computational Approaches to Nigerian Literature
Experiments in NLP for texts in Yoruba and Efik
Related events
Computational Approaches to Nigerian Literature: Analyzing Texts in Yoruba and Efik at DH2024
Related posts
“New Languages for NLP” Scholars Will Bring Global Perspectives to Text Analysis
26 March 2021
Announcing ten language teams selected to participate in The New Languages for NLP: Building Linguistic Diversity in the Digital Humanities series of workshops, held at CDH and funded by the NEH.
Event Recap: New Languages for NLP Workshop I
22 July 2021
The series aims to expand natural language processing (NLP) resources to low-resource and historical languages.
May 11–12: New Languages for NLP Conference
1 May 2022
Participants from the New Languages for NLP Institute will share results, challenges and lessons learned while training NLP models for under-resourced languages.
Recording Available: NLP Conference Keynote, Ines Montani
27 June 2022
Ines Montani, co-founder and CEO of Explosion AI, spoke at the New Languages for NLP: Building Linguistic Diversity in the Digital Humanities Conference in May.
Announcing Issue 3 of Startwords: “Parrots”
1 August 2022
Startwords Issue 3, “Parrots,” features three leading digital humanities researchers discussing the implications of “Stochastic Parrots” for humanities research employing NLP methods.
Links
Team
Project Director
Instructor
University Administrative Fellow
Grants
2020–2024
NEH Institutes for Advanced Topics in the Digital Humanities