PPA Hosts Four Undergraduate Summer Interns

2 September 2021

A team of four undergraduates helped with metadata work essential for the Princeton Prosody Archive's new features, while undertaking independent research projects that advanced the PPA in a unique way.

screen_shot_2020-10-09_at_3.14.17_pm.png

This summer, the Princeton Prosody Archive team was hard at work preparing the 3.7 release, which incorporates over a thousand new works from Eighteenth Century Collections Online, as well as support for HathiTrust journal articles and book excerpts. A team of four English majors helped with metadata work essential for these new features, while also undertaking independent research projects that advanced the PPA in a unique way. Read about their experiences below.

Selena Hostetler ’23

0-4.jpeg

What I worked on: I worked with T. V. F. Brogan’s bibliography English Versification, 1570-1980 A Reference Guide With a Global Appendix to locate works in the HathiTrust Digital Library so that they can be added to the PPA in the future once the copyright for that year expires [read more about the importance of Brogan’s work to the PPA]. I made a spreadsheet for each year from 1925 to 1934 with details about how to find each book or excerpt Brogan listed.

What I learned: I was unfamiliar with HathiTrust before working on this project, but I gained valuable experience with many of its search features. I became proficient at using the search filters, especially the date and author filters. I learned when to search in the full-text section and when to search the catalog, and towards the end of my research, I began to use the advanced search features as well. My work helped me practice my research skills and learn how to track down sources from a bibliography, even obscure ones. It also reminded me that though bibliographies are often thorough, they are never perfect, and their details can always be questioned and challenged until the source is found. 

Gavin T.A. Keasler ’22

0-1.jpeg

What I worked on: I was assigned to work with the early modern era of poetry. Scanning through bibliographies in search of works that would be valuable to the PPA, I discovered the intriguing practice of poets commenting on, criticizing, and praising one another and one another’s works through poetry, a kind of “metapoetry,” as PPA project manager Mary Naydan put it. Aside from becoming vastly exposed to the academic and theoretical sides of poetry through the scanning of bibliographies and candidate texts for potential inclusion in the PPA, I was also given the responsibility of correcting and filling out missing information in a spreadsheet of materials from Gale’s Eighteenth Century Collections Online (ECCO) that were actively being integrated into the archive’s website.

What I learned: While doing this along with scanning through bibliographies and various works to discern the potential inclusion of materials, I learned how to better search through databases such as HathiTrust, Princeton University Library, and WorldCat for sought-after materials, whether they were entire books, multi-volume works, or individual issues from long-running journals. I also learned how to better navigate within texts, using tools provided in the database readers, to find specific chapters, issues, and articles. Furthermore, I learned how to scan through these works, looking for key discussions and elements that would allow me to discern whether a text might be included within the Princeton Prosody Archive.

Andrew Matos ’23

0.jpeg

What I worked on: I documented a record of the changes between editions of individual works held in the PPA. I covered a total of 511 editions, from 44 works written by 18 different authors. These editions cover the most reprinted works and the most reprinted authors in our database. I noted every substantial change in each edition in a spreadsheet and described the most important changes of each text during their time being reprinted in bullet-point lists. The differences between editions for each text is documented in a sheet of its own, organized by the author. I also compared the number of editions we have of each work in the PPA versus the number of editions WorldCat holds in a spreadsheet.

What I learned: I became most interested in the historic role editors had over authors’ works. Most of the edits between editions were simple things, such as typography corrections or short annotations, but there were several cases where the text as a whole was changed in a more fundamental way between editions. Several of the largest changes appeared when the editor worked for an educational institution, in which case the text needed to be transformed to be feasibly taught in a school year. I was interested in how these editors sanitized or otherwise took the chance to “improve” the texts and to what extent the writer retained authorship over the work.

Sydney Peng ’22

What I worked on: I looked into the digitization practices behind the Princeton Prosody Archive’s databases and how to evaluate their accuracy and usability for search or full-text analysis. Both HathiTrust and ECCO store scanned images of the source texts and the plain text produced by optical character recognition (OCR), with ECCO providing confidence percentages — however, these percentages do not directly correspond to accuracy. The OCR particularly struggles with older texts, lower resolution images, and certain typographical features, such as metrical markings or diacritics found in the PPA. For improving the plain text output, one possibility involves redoing the OCR with a specially trained model specific to the PPA that may process these prosodic markings or unique features better. Current courses of action could involve determining the size of the PPA for sufficient samples for accuracy testing, grouping texts by searchability (ex: “completely different alphabets/scripts” would not need to have 100% accurate OCR, while “poor quality image” or “1700s text” would be higher priority for improvements), and running preliminary tests on texts with prosodic markings with a personal OCR model to see if it can be taught to recognize them. 

What I learned: It was helpful to research the processes and databases behind large-scale digitization in order to understand their limitations and the potential uses of these vast corpuses for computational humanities. Training OCR or testing accuracy could help with the PPA’s search functionality or for text analyses, and hopefully these notes can be useful moving forward.

Editor’s note: This post is part of a series on undergraduate engagement at the Center for Digital Humanities. Check back to learn more about what our undergraduates have working on!