SCOOP: Source Codes of the Past: Launching an international ATR/HTR Network for Manuscript Analysis

This workshop is by invitation only. For more information, please get in touch with croughan@princeton.edu, hreimitz@princeton.edu, or lucia.waldschuetz@princeton.edu.

Organized by:

Institute for Advanced Study (IAS), Princeton
Center for Digital Humanities (CDH), Princeton University
Manuscripts, Rare Books, and Archival Studies (MARBAS), Princeton University
Humanities Initiative, Princeton University
Digital Lab, Institute for Medieval Research (IMAFO), Austrian Academy of Sciences

While recent years have seen many significant developments in ATR/HTR (automatic text recognition, handwritten text recognition), there remains work to be done when adapting these technologies for different scripts, textual traditions, and manuscript structures – and especially so for low-resource languages and materials. Additionally, the integrations of ATR with dataset curation, text re-use analysis, editorial workflow automation, and other methodologies are transforming the way researchers engage with manuscript sources. This workshop will bring together humanities/social science scholars, software engineers, and machine learning researchers so that technological and humanistic expertise might mutually inform one another. The workshop will also lay the groundwork and define the agenda for a second SCOOP exchange meeting in Vienna in the summer of 2026, to be hosted by the Austrian Academy of Sciences and the University of Vienna.

Program Schedule

Thursday, June 12th
9:15 - 10:30	Session 1a: HTR Technology Development Moderator: Martin Roček (Charles University, Prague, Institute for Medieval Research, Austrian Academy of Sciences, Vienna) Tobias Hodel (University of Bern) “Building General Models: Approaches, State-of-the-Art, and Challenges” Achim Rabus (University of Freiburg) “Pragmatic HTR: Smart Models, Synthetic Data, and Navigating the Performance-Usability Landscape” Benjamin Kiessling (Paris Sciences et Lettres University) “Large Multilingual ATR Models and Humanities Practice - Conflicts and Pathways”
10:45-12:00	Session 1b: HTR Technology Development Moderator: Martin Roček (Charles University, Prague, Institute for Medieval Research, Austrian Academy of Sciences, Vienna) Matthew Miller (University of Maryland) “Approaches to Open Source, Large-scale Arabic-script Text Recognition” Andrew Janco (Princeton University) & Ann Farnsworth-Alvear (University of Pennsylvania) “Auto-Cataloging Research Materials with 'Small' Vision-Language Models” John Pavlopoulos (Athens University of Economics and Business, and Archimedes, Athena Research Center, Greece) “Learning to Adapt: Addressing Character Frequency Distribution Shifts in ΗΤR”
1:00-2:30	Session 2: Document or Handwriting Classification Moderator: Tobias Hodel (University of Bern) Aaron Hershkowitz (Institute for Advanced Study) & Nicholas Howe (Smith College) “Classifying Squeezes: Experiments in HTR for Greek Epigraphy” Sebastian Sobecki (University of Toronto) “Communities of Practice: Scripts, Scribes, and the Production of Literature in London, 1377-1471” Serena Ammirati (Università Roma Tre) & Paolo Merialdo (Università Roma Tre) “Explanatio manifesta: towards high-level explanations of medieval handwriting identification systems” Isabelle Marthot-Santaniello (University of Basel) & Giuseppe De Gregorio (University of Basel) "Comparing Alphas: Detection and Recognition of Ancient Greek Characters on Papyri and their Applications in Digital Paleography"
2:45-4:30	Session 3: HTR Methodological Challenges Moderator: Christine Roughan (Princeton University) Alexandra Gillespie (University of Toronto) “What is a Book in the Age of Machine Learning?” Bernhard Bauer (University of Graz) “HTR and Early Medieval Multilingual Glosses: Establishing the GlossIT Corpus” Anna Michalcová (Charles University Prague, Czech Language Institute at the Czech Academy of Sciences, Institute for Medieval Research, Austrian Academy of Sciences, Vienna) “Orthographic Variability as HTR Challenge: Insights from Medieval Czech Manuscripts” Maria Konstantinidou (Democritus University of Thrace) “First Pass at the Unsung: HTR for Byzantine Music Notation” Jan Odstrčilík (Institute for Medieval Research, Austrian Academy of Sciences, Vienna) “Different Transcription Conventions for Various Languages in ATR: The Case of Latin-Czech Medieval Sermons”
4:45-6:15	Session 4: Language Challenges Moderator: Anna Michalcová (Charles University Prague, Czech Language Institute at the Czech Academy of Sciences, Institute for Medieval Research, Austrian Academy of Sciences, Vienna) George Kiraz (Beth Mardutho: The Syriac Institute, and Institute for Advanced Study) “Challenges in Building Syriac OCR: HTR Models for Syriac” Jajwalya Karajgikar (University of Pennsylvania) “An Overview of HTR for South Asian Manuscripts” Ajay Rao (University of Toronto) & Sloane Geddes (University of Toronto): "Opportunities and Obstacles: Deploying Escriptorium in the HTR of Early Modern Sanskrit Manuscripts" Osama Eshera (University of Maryland) "From Script to Structure: Open Problems in the Automatic Analysis of Islamic Manuscripts"

Friday, June 13th
9:00-10:45	Session 5: Datasets and Institutions Moderator: Jan Odstrčilík (Institute for Medieval Research, Austrian Academy of Sciences, Vienna) Alix Chagué (Inria, Paris, and Université de Montréal) “HTR-United schema for dataset descriptions” Jessie Dummer (University of Pennsylvania) “Collections as Data at Penn Libraries and Beyond” Thibault Clérice (ALMAnaCH, Inria, Paris) “From a post-doc about old French to a 200k lines dataset in 10 different languages: building CATMuS Medieval” Seth Kulick (Linguistic Data Consortium, University of Pennsylvania) “Linguistic Data Consortium Pilot Project: Motivations and Design” Christine Roughan (Princeton University) “Integrating ATR Software with University HPC Infrastructure”
11:00-12:30	Session 6: Leveraging Outputs: Text Reuse, NLP, and more Moderator: Tim Geelhaar (Goethe University Frankfurt) William Mattingly (Yale University) “Semantic Searching with Vector Databases and their Applications in Quote Identification” David Smith (Khoury College of Computer Science, Northeastern University) “Textual Criticism as Language Modeling: From Transcription to Collation and Back Again” Martin Roček (Charles University, Prague, Institute for Medieval Research, Austrian Academy of Sciences, Vienna) “Enhancing Sentence Similarity Search with S-BERT: A Semantic Approach” Seth Kulick (Linguistic Data Consortium, University of Pennsylvania) “Orthographic variation and post-OCR correction for Yiddish”
2:00-3:30	Plenary Discussion: Future of HTR
4:00-5:30	Organizational session; planning for the 2026 meeting

SCOOP: Source Codes of the Past: Launching an international ATR/HTR Network for Manuscript Analysis

Speakers

Program Schedule