The Pages of Early Soviet Performance (PESP) uses machine learning to generate multiple datasets of early-Soviet illustrated periodicals related to the performing arts. By using computer vision techniques and training a YOLO (You Only Look Once) real-time object detection model, we are producing textual and image data that will facilitate new avenues of research about Soviet culture during the first decades after the October Revolution (1917-1932).
Our starting point is Princeton University Library's Digital PUL (DPUL) where ten titles - totaling 526 issues and approximately 26,000 pages - of Soviet performance journals have been digitized and can be freely viewed online. Journals are a diverse and complex genre: taken together, this collection contains hundreds of thousands of articles, poems, editorial commentary, advertisements as well as images, illustrations and graphic art. Today, researchers can browse the journals and view and download high-quality page images on DPUL.
Our project asks: what if we could access this collection as data? What patterns -- of words, phrases, or images -- can we discover across the whole collection? Which words or names are most frequent, and how does their appearance change over time? What type of images, or subjects, appear or disappear at points in a journal’s publication history? How did the role of advertisements evolve over the course of the 1920s? Which plays or concerts were the most frequently performed during this period?
During the term of this CDH grant, our team has annotation data on several hundred page images using makesense.ai and a custom annotation tool called "Mayakovsky." We used transfer learning to train a YOLO v5 computer vision model to recognize three basic content categories across our corpus of 24,175 page images - text, image and mixed-text. To provide the highest-quality textual dataset, we are comparing the OCR output by Tesseract, Google Vision, ABBYY. We aim to also provide a dataset of images from the journals, as well as the highly idiosyncratic but important category of “mixed text,” which includes elements where text and graphic design were combined (e.g. advertisements), and also a variety of paratextual elements such as covers; decorative borders and other ornaments; mastheads; and circulation statements. Our plan is to publish our datasets and models, and to thoroughly document our process.
The PESP team is interdisciplinary, multi-institutional and international. It is spearheaded by Princeton’s Kat Reischl (Slavic), Thomas Keenan (PUL), and Natalia Ermolaev (CDH), with assistance from Alexander Jacobson (graduate student, Slavic). Our technical lead is Andrew Janco, Digital Scholarship Librarian at Haverford College. We partner with scholars from the Digital Humanities Research Center at St. Petersburg State University of Information Technology, Mechanics and Optics (ITMO): Antonina Puchkovskaya, Vladislav Tretyak, Anastasia Mamonova and Alexander Kudryashov.
For more on this project, see Princeton’s Slavic DH Working Group website.
CDH Grant History
- 2020–2022 Dataset Curation