Evaluating Augmented Training Data for Complex Document Layouts: the Case of Arabic Scientific Manuscripts at DH2024

Speakers

DH2024

Aug 07 4:00 – 5:30 pm
Arlington, VA
Arlington, VA

CDH/MARBAS Postdoctoral Research Associate Christine Roughan will present her short paper as part of the panel on “Navigating the Intersection of Data, Design, and Discovery.”


Evaluating Augmented Training Data for Complex Document Layouts: the Case of Arabic Scientific Manuscripts

Advances in handwritten text recognition (HTR) combined with the availability of platforms like Transkribus and eScriptorium have supported a multitude of projects involved in automatically extracting information from images of historical documents. When recognizing information on a manuscript folio, many projects focus on the main text – it is not uncommon for this to be the object of study in the first place and, furthermore, extracting it tends to present a less complicated task for the computer. However, even if a project is only interested in the main text, other elements on the folio can still present complications when present. Segmentation models used for layout analysis may erroneously identify (sections of) marginalia, tables, or even graphics as main text, introducing errors when these regions are passed forward to be processed through HTR pipelines. Further, when paratextual material is itself the object of study, segmentation models capable of handling complex layouts are necessary to access this material at scale.

This study evaluates approaches to complex segmentation problems for the case of Arabic scientific and mathematical manuscripts. Such materials provide examples of complex layouts with multiple types of content, including but not limited to main text, marginalia, tables, diagrams, and illustrations. Improved access to this kind of data supports a variety of scholarly inquiries, from examination of mathematical diagrams to extraction of contemporary scholarship recorded in marginalia.

Image: Folio 5b from “The Book of Fixed Stars” by Al-Sūfī (10th century) in MS Princeton, Garrett no. 2259Yq