Princeton’s Digital Humanists Introduce “Unsolved Data Problems” to Data and Computer Scientists

Natalia Ermolaev, Suzanne Roth

April 2, 2019

What computational challenges can a historian of the medieval Middle East, a scholar of Victorian poetry, and an experimental musician pose to a room of computer and data scientists?

On March 13, three members of Princeton’s humanities faculty - Marina Rustow (Near Eastern Studies and History), Meredith Martin (English) and Dan Trueman (Music) - presented their landmark digital humanities projects in an panel discussion called “Unsolved Data Problems,” part of the Center for Digital Humanities (CDH) year-long Year of Data initiative. Co-sponsored by the Department of Computer Science and the Center for Statistics and Machine Learning (CSML), the event was moderated by Jennifer Rexford, chair of Computer Science. Rustow, Martin and Truman spoke to an audience of approximately fifty data and computer scientists, and shared how new possibilities in humanities research - such as large-scale digitization, curating and sharing vast amounts of metadata, optical character recognition (OCR) and experimental audio engineering - have also created a host of new challenges that require creative technical and scholarly solutions.

The discussion started with Rustow, who directs Princeton’s Geniza Lab, a repository of over 4,000 images and transcriptions of ancient Hebrew and Arabic texts from the Cairo Geniza, posing the question of whether image recognition algorithms could be developed to help identify text fragment matches. Rustow also brought up the problem of wrangling messy metadata, an issue encountered by digital humanists and data scientists alike. Likening data cleaning to gardening, where one spends significant amounts of time “weeding” out bad data, Rustow asked about developing computer-assisted techniques for processing, analyzing and curating large sets of multilingual and non-standardized metadata that would also retain the nuance of rich cultural resources.

Meredith Martin, Director of the Princeton Prosody Archive (PPA) and CDH Faculty Director, delved deeper into the problems of working with large collections of machine-encoded text. A major goal of the PPA, which contains thousands of pages from books about poetics, prosody, rhetoric, grammar, speech, and literary history published between 1570-1923, is to allow scholars to do full-text search and analysis of these important works. Martin and her team have bumped up against the limits of OCR technology, which can’t always decipher typographically unique items--such as phonetic and musical notations--which are key to tracking various theories of prosody using a computational approach. Martin asked the audience if they could imagine developing a “smarter” OCR that could identify and encode non-standard characters at scale, giving the scholarly community a better understanding of the creative evolution of English language and poetry.

Finally, Dan Trueman, composer and computer programmer, discussed bitKlavier, a “prepared digital piano” software program that uses algorithmic intervention to explore the musical interaction between human and machine. Playing a piece he composed for bitKlavier, Trueman demonstrated how the instrument is able to adapt to changes in tempo and adjust to match the performer. While currently adaptive and responsive, Trueman wondered if the information encoded in the data produced by bitKlavier in the MIDI (Musical Instrument Digital Interface) format could be leveraged to create a predictive tempo, and thereby radically altering the performance experience.

Dan Trueman demonstrates the bitKlavier — Dan Trueman demonstrates bitKlavier

“Unsolved Data Problems” introduced members of the computer and data science communities to specific challenges in transforming humanities material into data and code. Session attendees raised a number of insightful questions, proposed possible alternatives, and highlighted aspects of the source materials that hold research potential. This panel was part of growing efforts at Princeton to foster connections between digital humanities research and data and computer sciences.

Written with contributions from Rebecca Koeser, Rebecca Munson and Wafa Isfahani.