Info: Call for Graduate Fellows!

Applications for the CDH Graduate Fellowship are open through October 20, 2025. Apply now.

Humanists and Technologists Join Forces to Advance Historical Text Recognition and Research

11 August 2025

SCOOP-CDH-Carrie-Ruddick-01

Graduate students participate in the ATR/HTR Training Workshop (Photo: Carrie Ruddick)

Forging the future of text recognition for research focused on historical manuscripts, the Source Codes of the Past (SCOOP) conference connected an international network of experts at the Institute for Advanced Studies (IAS) in Princeton in June 2025. 

Collaborative in every regard, the conference was organized by Professor of History Helmut Reimitz, Center for Digital Humanities (CDH) Postdoctoral Research Associate Christine Roughan, and History Ph.D. candidate Lucia Waldschuetz, along with Professor of Medieval Studies at the IAS Suzanne Akbari and CDH Executive Director Natalia Ermolaev.  The launch of this network was a joint venture of the IAS, Princeton Humanities Initiative (PHI), the CDH, the Manuscript, Rare Book, and Archival Studies Initiative (MARBAS) and the Institute for Medieval Research at the Austrian Academy of Sciences. Along with the organizers, the Center for Collaborative History, Department of Classics, the Seeger Center for Hellenic Studies, the Program in Medieval Studies and the Committee for the Study of Late Antiquity also joined in sponsoring the workshop. 

“Fostering collaboration is a major goal of the Princeton Humanities Initiative, and SCOOP brings together teams that are working across institutions, disciplines, and countries to advance our ability to learn about the past and inform our future.” 
— John Paul Christy, Executive Director, Princeton Humanities Initiative
IMG_1579

Christine Roughan opens the inaugural SCOOP conference at the Institute for Advanced Studies (Photo: Kirstin Ohrt)

Humanities and social science scholars, software engineers, and machine learning researchers—some wearing several hats—pooled their expertise, mutually informing one another’s understanding of automatic text recognition (ATR) and handwritten text recognition (HTR) technologies. This intensive think tank centered the challenge of adapting existing technologies for diverse scripts, textual traditions, and manuscript structures, especially for understudied languages and materials. Given the long-running separate streams of time and resource investment devoted to developing these projects, the convening of project leaders to share successes and challenges represents an efficiency windfall.

“AI and machine learning tools for text recognition are transformative—not only for deciphering individual manuscript traditions, but for enabling large-scale, comparative research that brings diverse cultural histories into meaningful conversation with one another,” said Ermolaev. “As these technologies become more sophisticated, it is essential that humanists are at the table, helping shape how these tools are designed and deployed. Scholars of historical languages and cultures bring deep knowledge that is critical to developing more accurate, ethical, and inclusive AI systems. The long-standing collaboration between humanists and technologists in digital humanities is more urgent than ever, as we work together to ensure that the cultural data of the past informs the technological futures we’re building today.”

“The long-standing collaboration between humanists and technologists in digital humanities is more urgent than ever, as we work together to ensure that the cultural data of the past informs the technological futures we’re building today.” 
— Natalia Ermolaev, Executive Director, CDH

Presenter Tobias Hodel (University of Bern) underscored the importance of leaning into the ATR/HTR community of stakeholders and experts.  Having bested the tech hurdle, he said, the critical question “what’s next?” requires a collaborative answer.  Achim Rabus (University of Freiburg) agreed that discussion between parties is imperative, as is sufficient training.  He noted the debilitating gap between those with technological and humanistic expertise, underscoring the importance of elevating training for both to arrive at maximum usability. Evaluating output anomalies of a program, for example, requires a collaborative examination when faced with the recurring problem: “We don’t know if it’s a bug or a feature.” Among definitive features, Achim pointed to strides and further opportunities in smart transcription, which automatically interprets and expands abbreviations in original text. 

IMG_1729

Benjamin Kiessling presents “Large Multilingual ATR Models and Humanities Practice - Conflicts and Pathways” (Photo: Kirstin Ohrt)

Benjamin Kiessling (Paris Sciences et Lettres University) campaigned for new text reader models to resolve the lingering problem that bespoke models cater to niche research questions. What’s needed, he said, is a way to align output with research questions and allow models to become more interchangeable or generalized. With this goal in mind, Kiessling has developed PARTY, or Page-wise Recognition of Text. 

The workshop illustrated that when stakeholders work together across functions and areas of expertise, the boon for scholarship can be exponential. Using technology to make manuscripts accessible to scholars in languages unfamiliar to them allows for connections heretofore left on the table. This democratization of knowledge, said Achim, is game-changing. 

Launching a Graduate Student Text Recognition Technology Boot Camp

The conference included a comprehensive three-day ATR/HTR Training Workshop designed to train graduate students and scholars with various experience levels and backgrounds in text recognition technology. Led by instructors Helmut Reimitz, Christine Roughan, Anna Michalcová, Martin Roček, and Jan Odstrčilík, the workshop provided a structured progression from technical fundamentals to practical application. 

“We took care to structure the workshop so that it would offer training relevant to researchers working in any historical written tradition, because the underlying methods of ATR are not limited by language or discipline,” shared Roughan. “We were pleased to be joined by participants from history to NES, from art & archaeology to music as a result.”

The first day covered introductory concepts, including how HTR/ATR works technically, available ATR tools, and the basics of key platforms such as Transkribus and eScriptorium. The second day delved into the practicalities of using such platforms, covering topics such as layout and model training, data formats, and methodological considerations for both Latin and non-Latin scripts. The last day of the training workshop turned to the practicalities of using text recognition tools and outputs in research: HTR model evaluation, data sharing through platforms such as Zenodo and GitHub, and techniques for developing custom models using existing published models. Supervised hands-on practice sessions were conducted throughout to reinforce learning objectives.

“Knowing how to use the tools is just step one,” said Roughan. “The scholarly community continues to publish a wealth of data and models – knowing how to interact with and build upon that foundation empowers people to get the most out of research using text recognition methodologies.”

“Knowing how to use the tools is just step one. The scholarly community continues to publish a wealth of data and models – knowing how to interact with and build upon that foundation empowers people to get the most out of research using text recognition methodologies.” 
— Christine Roughan, Postdoctoral Research Associate, CDH
SCOOP-CDH-Carrie-Ruddick-07

Training Workshop organizers from left to right: Christine Roughan, Helmut Reimitz, Jan Odstrčilík, Martin Roček, and Anna Michalcová. (Photo: Carrie Ruddick)

SCOOP 2.0

By all accounts, the conference exceeded its goals. “It was an amazing and extremely encouraging start for the network,” said Reimitz. “Everyone agreed that a platform for exchanging ideas between AI experts, computer scientists, and humanities scholars is urgently needed in order to take the application of HTR in the humanities to the next level.”

Paving the way to that next level, a second SCOOP workshop is already in the making. Hosted by the Austrian Academy of Sciences, SCOOP will reconvene in Vienna in summer 2026.

In the meantime, members of the SCOOP network are working to establish a digital communication platform to evolve conversations on the implementation of text recognition tools in diverse projects involving various languages, scripts, layouts, and visualizations in original manuscripts and documents.  Furthermore, the forum facilitates shared experimentation and modeling. “As an important focus, we agreed in Princeton on the question of interoperability issues and experiences with large established engines and smaller research groups working on under-resourced scripts and languages,” said Reimitz.

SCOOP partners, the Princeton Humanities Initiative, Center for Digital Humanities, Institute for Advanced Study at Princeton, and the Austrian Academy, are committed to carrying forward the momentum of this inaugural SCOOP conference.  “Fostering collaboration is a major goal of the Initiative,” said Christy, “and SCOOP brings together teams that are working across institutions, disciplines, and countries to advance our ability to learn about the past and inform our future.”

“It was an amazing and extremely encouraging start for the network. Everyone agreed that a platform for exchanging ideas between AI experts, computer scientists, and humanities scholars is urgently needed in order to take the application of HTR in the humanities to the next level.”
— Helmut Reimitz, Professor of History

To join the SCOOP network or learn more: scoop@oeaw.ac.at, croughan@princeton.edu, anna.michalcova@oeaw.ac.at.