New Languages for NLP: Building Linguistic Diversity in the Digital Humanities

Diversifying NLP by teaching humanists to create data and models for new languages

AI/ML
Curriculum and Pedagogy
Digital Humanities
Linguistics
Multilingual
Natural Language Processing
View project website
AdobeStock_515721960

Natural Language Processing (NLP) has revolutionized our ability to analyze texts at scale. However, the major NLP resources only support a fraction of the world's more than 7,500 languages. This means that text mining, topic modeling and other computational methods are unavailable for the vast majority of languages — especially those that are historical, minority, or endangered. The proliferation of data and tools in several dominant languages will hinder research and perpetuate the existing structural inequalities on both local and global scales.

“New Languages for NLP: Building Linguistic Diversity in the Digital Humanities,” was an educational initiative, funded by a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities grant, to enable scholars to create high-quality linguistic data and train models for under-resourced, domain-specific and historical languages.

Between June 2021 and May 2022, eighteen scholars from around the world joined the workshop and worked on eleven languages: Ottoman Turkish, Tigrinya, Kanbun, Efik, 19th c. Russian, Classical Arabic, Old Chinese, Yoruba, Quechua, Yiddish and Kanada. They learned how to use cutting-edge NLP tools to advance their humanities research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.

Hosted by the CDH, this Institute was a collaboration with the University of Pennsylvania, the Library of Congress Labs, and DARIAH, the European Digital Research Infrastructure for the Arts and Humanities.

Any views, findings, conclusions, or recommendations expressed on this page do not necessarily represent those of the National Endowment for the Humanities.

Related projects

Computational Approaches to Nigerian Literature

Experiments in NLP for texts in Yoruba and Efik

Nigerian pattern (Adobe Stock)

Related events

Computational Approaches to Nigerian Literature: Analyzing Texts in Yoruba and Efik at DH2024

Aug 9 2:00PM–3:30PM
Happy Buzaaba
Natalia Ermolaev
Utitofon Inyang
Temitayo Olatoye
DH2024
Nigerian pattern (Adobe Stock)

Related posts

“New Languages for NLP” Scholars Will Bring Global Perspectives to Text Analysis

26 March 2021

Announcing ten language teams selected to participate in The New Languages for NLP: Building Linguistic Diversity in the Digital Humanities series of workshops, held at CDH and funded by the NEH. 

screen_shot_2021-03-26_at_4.44.30_pm.png

Event Recap: New Languages for NLP Workshop I

22 July 2021

The series aims to expand natural language processing (NLP) resources to low-resource and historical languages.

Screen Shot 2021-07-19 at 12.36.02 AM.jpg

May 11–12: New Languages for NLP Conference

1 May 2022

Participants from the New Languages for NLP Institute will share results, challenges and lessons learned while training NLP models for under-resourced languages.

NLP_1472x400.jpg

Recording Available: NLP Conference Keynote, Ines Montani

27 June 2022

Ines Montani, co-founder and CEO of Explosion AI, spoke at the New Languages for NLP: Building Linguistic Diversity in the Digital Humanities Conference in May.

20220512-0218_SMS.jpg

Announcing Issue 3 of Startwords: “Parrots”

1 August 2022

Startwords Issue 3, “Parrots,” features three leading digital humanities researchers discussing the implications of “Stochastic Parrots” for humanities research employing NLP methods.

parrot.jpeg

Links

Team

Project Director

Andrew Janco

Instructor

David Lassner
Toma Tasovac

University Administrative Fellow

Grants

2020–2024

NEH Institutes for Advanced Topics in the Digital Humanities