Info: Welcome to the new CDH website!

In 2024-25, we are celebrating ten years as a center at Princeton. Explore our redesigned website to get better acquainted with us and the many things we do!

Info: We're hiring!

Apply as our new Research Software Engineer (RSE): More info.

Segmenting Paratextual Material in Arabic Scientific Manuscripts

Computational methods for classifying and analyzing visual aspects of the manuscript folio

AI/ML
Automated Text Recognition
Computer Science
Image Analysis
Near Eastern Studies
Segmentation cover image

A notable amount of work in the digital humanities has centered on the text. As a result, digital approaches to works containing paratextual content frequently focus on the text (and often just the main text), overlooking any non-textual material. This has been the case even in approaches to premodern scientific or mathematical works, despite the fact that these often include significant paratextual material in the form of tables and diagrams.

Advances in deep learning, however, are making it possible to work with visual material far beyond the capabilities of earlier computer vision techniques. This project trains and evaluates deep learning models to study paratextual material in Arabic manuscripts of mathematical and scientific texts, taking advantage of Princeton University Library’s expansive collection of Arabic manuscripts: of that collection, over 800 are on mathematical or scientific topics, and over 100 have digital reproductions. To address the need for a sufficiently large dataset on which to train the models, the project additionally evaluates the efficacy of various data augmentation methods.

Related research groups

Text Technologies for Manuscript Cultures

Using emerging technologies to transform research, teaching and understanding of pre-modern evidence

Text Technologies for Manuscript Cultures

Team

Researcher

Grants

2023–

Staff Project