Segmenting Paratextual Material in Arabic Scientific Manuscripts
Computational methods for classifying and analyzing visual aspects of the manuscript folio
A notable amount of work in the digital humanities has centered on the text. As a result, digital approaches to works containing paratextual content frequently focus on the text (and often just the main text), overlooking any non-textual material. This has been the case even in approaches to premodern scientific or mathematical works, despite the fact that these often include significant paratextual material in the form of tables and diagrams.
Advances in deep learning, however, are making it possible to work with visual material far beyond the capabilities of earlier computer vision techniques. This project trains and evaluates deep learning models to study paratextual material in Arabic manuscripts of mathematical and scientific texts, taking advantage of Princeton University Library’s expansive collection of Arabic manuscripts: of that collection, over 800 are on mathematical or scientific topics, and over 100 have digital reproductions. To address the need for a sufficiently large dataset on which to train the models, the project additionally evaluates the efficacy of various data augmentation methods.
Related events
Evaluating Augmented Training Data for Complex Document Layouts: the Case of Arabic Scientific Manuscripts at DH2024
Related research groups
Text Technologies for Manuscript Cultures
Using emerging technologies to transform research, teaching and understanding of pre-modern evidence
Team
Researcher
Grants
2023–
Staff Project