Milestones at the Machine Learning + Humanities Working Group
This year, the Center for Digital Humanities hosted the Machine Learning + Humanities (ML + H) Working Group, founded by graduate student Dan Friedman (Computer Science). The Working Group connects students and researchers from a variety of interests and disciplines, with the goal of exploring research questions and case studies in machine learning applied to the humanities.
As the academic year is wrapping up, here is a recap of the main topics discussed, with links to helpful resources.
How has the Machine Learning + Humanities Working Group contributed to the academic community at Princeton?
Discussions have been dynamic every time the ML + H Working Group met at the CDH. Everyone had an opportunity to share reflections on and curiosity for new concepts in the humanities, as well as for the technology involved in machine learning methods.
Our discussions concentrated on three main kinds of questions. How can machine learning benefit research in the humanities? What new challenges can humanities resources pose for machine learning? What insights can humanistic methods contribute to machine learning, in terms of social and cultural scale?
The overarching themes centered on the applications and challenges of computational methods in the humanities, and, conversely, humanistic observations that impact technical decisions and implementations in machine learning projects. Working with graduate students, postdoctoral scholars, and faculty on the one hand, and engaging and understanding academic questions and needs on the other, we have seen a range of methods and scale in humanistic projects. Through panels, lectures, and social gatherings, we have been collecting information on many fronts of the humanities, ranging from language models to cultural heritage collections. There are two fields where Machine Learning has become so relevant in the humanities: large language models and cultural heritage.
Large Language Models (October 26–27, November 17)
The starting point for our engagement with large language models was "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" by Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell (2021). The Working Group co-organized a roundtable on October 26 with leading experts, each of whom submitted a position paper. In addition to co-authors McMillan-Major (University of Washington), and Mitchell (now at Hugging Face), we hosted Gimena del Rio Riande (University of Buenos Aires), Lauren F. Klein (Emory University), and Ted Underwood (University of Illinois). The discussion was moderated by Toma Tasovac (DARIAH-EU).
The roundtable inspired discussions on October 27 and November 17. The Stochastic Parrots paper has generated conversations in the disciplines of computer and data science, in particular on the concepts of data structuring and models in Natural Language Processing ( NLP), which is not surprising given that the paper originated from state-of-the-art academic and industry-based research. The process of preparing Large Language Models for machine learning analysis has been paralleled to the context of the humanities, with possible scenarios of data collection, as well as NLP methods, models, and documentation in the field of literature, history, and visual culture. Large Language Models (LLMs) have become common in Machine Learning research and applications; however, there are many ethical questions related to their use and data provenance more broadly. Our working group discussions brought together researchers from machine learning, the humanities, and social sciences to consider those ethical questions, along with the possibilities and challenges in applying LLMs to humanistic research.
Since literary analysis benefits from NLP methods, it becomes clear that scholars employing NLP in their research need to take into account the significance of the essay’s arguments, and the limitations it reveals, starting from linguistics, an established field for automating research techniques. Some questions remain unexplored, or provisional at best, as scholars concentrate on the ethical implications of assembling and calibrating models while also establishing standards in machine learning. Those unknown consequences would also signify that there are unforeseen implications and potential dangers in NLP projects. In machine learning, implementation refers to turning computational theory into humanistic practice, which ideally includes both formal and informal pathways of data gathering, processing, and mining. Having defined tasks covered by researchers, project managers, collaborators, and supervisors, implementation could involve redefining/reassessing the planning and decision-making process in order to choose the most effective practices and ethical considerations. In a way, at the Working Group we agree that using machine learning often involves negotiation between researchers and stakeholders with common interests, and the public at large, researchers included, who are affected by the research findings and data collection practices.
More resources on language models and computational methods:
- What is Machine Learning in the humanities?
- Machine Learning for the Humanities: A very short introduction and a not-so-short reflection (2020)
- Models trained without a text have been discussed by Yossi Adi et al. in their blog post, “Textless NLP” (September 9, 2021)
- Data in the Humanities
- Explore the Journal of Cultural Analytics, which has recently celebrated its fifth anniversary as an open-access journal published by the Department of Languages, Literatures, and Cultures at McGill University.
- Miriam Posner wrote an often-quoted blog post, “Humanities Data: A Necessary Contradiction” (June 25, 2015).
- New technologies pose challenges that Dennis Tenen discussed in “Blunt Instrumentalism: On Tools and Methods” (2016).
- Machine Learning Tools
- There is a tool for analyzing and visualizing word embeddings: Embedding Projector.
- For NLP techniques, see David Bamman’s BookNLP, Melanie Walsh’s BERT for Humanists, and spaCy
- Machine Learning Projects
- Always Already Computational and Collections as Data document the investigators’ work and rationale for what they call “data-driven scholarship” and “data oriented services”
- New Languages for NLP, hosted by the CDH in partnership with DARIAH-EU and funded by the National Endowment for the Humanities (2021-2022)
- Standards and Applications
- There are standards recently defined and implemented for the sciences – FAIR is the acronym for practices of Findability, Accessibility, Interoperability, and Reuse of digital assets. Find out more in The FAIR Guiding principles for scientific data management and stewardship (2016).
- Steps to mitigate machine learning applications to facial analysis technology have been proposed in the Safe Face Pledge.
- Academic Conferences
- There has been an increasing number of conferences and publications on the computational study of literature and culture--for example, Computational Humanities 2021 and SIGHUM 2021.
Cultural Heritage (November 17, February 22–23)
The Working Group discussed highlights and challenges of digital access on November 17 and February 22–23. The public can enjoy texts, images, and information provided by galleries, libraries, archives, and museums. Cultural institutions have pushed for change in the GLAM sector (Galleries, Libraries, Archives, and Museums—with the occasional addition of the letter R standing for Records). Once digitization and access are provided to the public, though, there is still the problem of navigation, inconsistent data and metadata, and consequently the type of User Interface and User Experience allowing for viable analysis based on those digital resources.
Our February conversation was informed by a talk by Benjamin Lee (University of Washington), who discussed Novel Machine Learning Methods for Computing Cultural Heritage: An Interdisciplinary Approach (recording below). As an innovator in residence at the Library of Congress, Lee developed a project, the Newspaper Navigator, which transforms how we can search, connect, and examine millions of digitized historic newspaper pages. Lee has shared how the project offers opportunities for scholars in the humanities and social sciences and the public to explore and analyze the visual content of cultural heritage collections through interactive Artificial Intelligence (AI). Drawing from Chronicling America—an open-access, open-source newspaper database produced by the United States National Digital Newspaper Program—the Newspaper Navigator comprises millions of digitized historic newspaper pages in which Lee develops open-faceted search systems for petabyte-scale web archives. Lee’s new method opens new ways to search the collections of Chronicling America, as well as a a range of digitized and born-digital collections at the United States Holocaust Museum and multilingual projects with Ladino documents.
At the Working Group, we asked and discussed the following questions: What kinds of new research questions are now possible to ask using a tool like the Newspaper Navigator? Additionally, what kinds of research questions might we want to ask about large digital collections that cannot currently be answered with available technology?
The cases we discussed in our meetings demonstrate the need for better strategies for using machine learning methods in the humanities. Machine learning has the potential to create new ways of studying large digital collections that document and impact the fields of literature, history, and art. Furthermore, humanistic perspectives affect how computational tools for large models of sources are accepted by the wider academic community. Researchers in the humanities need to work together with team members to ensure that the computational process and documentation are both successful and useful to apply object detection models for reading historical newspapers through computer vision techniques. Finally, implementation is an evolving process, and needs to react to change and additional research questions.
Find more resources on cultural heritage and computational methods:
- What is Computer Vision?
- Computer vision has been discussed by the company IBM, among others, with a current definition of goals, scope, and technologies involved.
- Image recognition and processing uses machine learning technologies, as explained in this video on convolutional networks.
- The Newspaper Navigator
- For a demo of the Newspaper Navigator and its features, watch this video.
- Want a quick introduction to the Newspaper Navigator? Check the Library of Congress demo (2:45–7:55) and presentation.
- Find technical details on GitHub, Newspaper Navigator.
- Cultural Heritage Projects
- Image analogies and pattern recognition have inspired the 4535 Time Magazine Covers project and WikiView, which is applicable to pedagogical purposes.
- DH centers and grants have advanced web tools and applications, for example PixPlot, a tool at the Yale DHLab, and the Distant Viewing Lab at the University of Richmond.
- There is work in progress on a tool extracting text from maps, image annotation, and entity linking at Turing Institute’s Machines Reading Maps.
- See a blog post on synthetic maps by Chris Fleet, “Maps with a sense of the past: what are synthetic maps, and why do we love them?” (2021)
- Melanie Walsh has written an Introduction to Cultural Analytics & Python.
- Academic Conferences
- Applications of computational methods to cultural heritage have inspired conferences and symposia, such as the 2021 Computational Humanities Research Conference; Computer Vision in DH was the theme for the DH2018 conference.
Carousel photo by Charles Deluvio on Unsplash