Thanks to a recent grant from the National Endowment for the Humanities, the Center for Digital Humanities will play a critical role in diversifying the digital humanities’ linguistic landscape worldwide.
In late July, the NEH Office of Digital Humanities announced that the New Languages for NLP: Building Linguistic Diversity in the Digital Humanities project would receive support from its Institutes for Advanced Topics in the Digital Humanities program.
The project, led by CDH Associate Director Natalia Ermolaev and Andrew Janco (Digital Scholarship Librarian, Haverford College) is a collaboration with two partners: the Library of Congress LC Labs; and the Digital Research Infrastructure for the Arts and Humanities (DARIAH-EU), a European Research Infrastructure Consortium. Staff from Princeton University Library’s Research Data Service will also be involved.
"This NEH grant gives the CDH a chance to expand our core work—bringing new technologies to humanities researchers—to a broader scholarly community," Ermolaev said. "We are excited to build a community of researchers who will probe difficult questions in text analysis from various linguistic vantage points, while learning critical approaches to digital tools and best practices in project and data management. And importantly, this project furthers the CDH’s core commitment to diversifying DH research so that it is more inclusive and equitable."
A critical problem animates the New Languages for NLP project: that the major resources in Natural Language Processing (NLP), the means by which computers analyze human language, only support a small subset of the world’s languages. That means that tools for computational text analysis are unavailable for scholars of thousands of languages.
"Humanities scholars do not care only about patterns and frequencies," explained DARIAH-EU Director Toma Tasovac. "We also care about what is unique, unusual, and strange. But discovering either—mediocrity or weirdness—in textual corpora is much more difficult if you don't have access to NLP tools for the particular language variety you're working on. Which, from the outset, puts some scholars—and some languages—at a great disadvantage."
These underresourced languages, or "new languages," include both minority or endangered languages, such as Mauritian Creole and Plains Cree, and domain-specific languages that currently lack NLP tools, such as early modern Portuguese.
The New Languages for NLP workshop series, to be hosted by the CDH beginning in June 2021, will address this gap in resources by bringing together researchers working in new languages and a talented group of NLP experts.
"The instructors for the Institute come from a really diverse range of research backgrounds," Janco noted. "We have linguists. We have computer scientists. We have historians and literary scholars. As a group, we not only study languages other than English, but we tend to study mixed-language and multilingual corpora."
The call for participants will be released in the coming weeks; no technical experience is required to apply.
Before the workshops begin, participants must identify a machine-readable corpus of approximately twenty-thousand words or tokens in their language—that is, a collection of texts suitable for NLP. A particularly useful source of this material is the Library of Congress Digital Collections, which contains texts in various world languages.
"As a team who supports computational uses of our digital collections, LC Labs is excited to field questions from scholars who seek to use our digital collections in this way," said Eileen Jakeway, innovation specialist at LC Labs. Jakeway added that her team also considers the New Languages for NLP project as "an opportunity to highlight texts from our expansive collection that people may not already be aware of."
At the workshops, participants will learn to annotate their corpus and to train their data using the popular NLP tool spaCy, creating NLP models for their languages. The result? New tools that will enrich their scholarship.
Moreover, Janco explained, participants can "expect to learn practical Python skills and methods for the computational analysis of text," and to "gain experience with project management, data management, and community engagement." Keynotes by distinguished guests and small-group discussion will enhance the experience for all participants.
At the final workshop, scheduled for May 2022, participants will share their research at a public conference.
The impact of the New Languages by NLP project does not end with the final workshop. As part of the program, participants will publish their data, annotations, and models on an open-source platform, facilitating future scholarship in their language. Instructor materials will also be revised and published on DARIAH-CAMPUS so that others can benefit from them.
The New Languages for NLP project is just the latest CDH-supported initiative advancing DH research in diverse languages.
This summer, the CDH wrapped up development on the Princeton Ethiopian Miracles of Mary Project (PEMM), which centers on miracle stories written in Gəˁəz, or Classical Ethiopic. This academic year, the CDH is partnering with the Princeton Geniza Project, a database of documents written in four languages: Judaeo-Arabic, Hebrew, Aramaic, and Arabic.
Both PEMM and the Geniza Project have reinforced the need to diversify DH tools and resources even beyond NLP.
"Working with materials in languages other than English has made clear the possibilities and limitations of working with today’s DH tools," said Ermolaev. "We look forward to partnering with major institutions such as LOC and DARIAH to inspire and disseminate new scholarship in new languages."
Added Tasovac: "The humanities, both analogue and digital, can only thrive in diversity."
Author’s note: This is the first in a series of posts on multilingual DH at the CDH. Stay tuned for posts on CDH-affiliated working groups in Slavic, East Asian, and South Asian DH, and for more on our current partnership with the Princeton Geniza Project.
Any views, findings, conclusions, or recommendations expressed in this blog post do not necessarily represent those of the National Endowment for the Humanities.