Info: Welcome to the new CDH website!

In 2024-25, we are celebrating ten years as a center at Princeton. Explore our redesigned website to get better acquainted with us and the many things we do!

Info: We're hiring!

Apply as our new Research Software Engineer (RSE): More info.

Evaluating Augmented Training Data for Complex Document Layouts: the Case of Arabic Scientific Manuscripts at DH2024

Add to calendar

Arlington, VA
Arlington, VA

Speakers

  • Christine Roughan
Garrett_2259Yq

CDH/MARBAS Postdoctoral Research Associate Christine Roughan will present her short paper as part of the panel on “Navigating the Intersection of Data, Design, and Discovery.”

Evaluating Augmented Training Data for Complex Document Layouts: the Case of Arabic Scientific Manuscripts

Advances in handwritten text recognition (HTR) combined with the availability of platforms like Transkribus and eScriptorium have supported a multitude of projects involved in automatically extracting information from images of historical documents. When recognizing information on a manuscript folio, many projects focus on the main text – it is not uncommon for this to be the object of study in the first place and, furthermore, extracting it tends to present a less complicated task for the computer. However, even if a project is only interested in the main text, other elements on the folio can still present complications when present. Segmentation models used for layout analysis may erroneously identify (sections of) marginalia, tables, or even graphics as main text, introducing errors when these regions are passed forward to be processed through HTR pipelines. Further, when paratextual material is itself the object of study, segmentation models capable of handling complex layouts are necessary to access this material at scale.

This study evaluates approaches to complex segmentation problems for the case of Arabic scientific and mathematical manuscripts. Such materials provide examples of complex layouts with multiple types of content, including but not limited to main text, marginalia, tables, diagrams, and illustrations. Improved access to this kind of data supports a variety of scholarly inquiries, from examination of mathematical diagrams to extraction of contemporary scholarship recorded in marginalia.

Image: Folio 5b from “The Book of Fixed Stars” by Al-Sūfī (10th century) in MS Princeton, Garrett no. 2259Yq