Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning

Reading Group

May 15 11:30 am – 12:30 pm

To register for this event, visit the event listing on the PUL website.

Presented by Ben Lee, a 2020 Innovator-in-Residence at the Library of Congress

Ben writes: The 16-million digitized, historic newspaper pages within Chronicling America, a joint initiative by the Library of Congress and the NEH, represent an incredibly rich resource for a wide range of users. Historians, journalists, genealogists, students, and members of the American public explore the collection regularly via keyword search. But how do we navigate the abundant visual content? Newspaper Navigator is a project that I am currently carrying out while an Innovator-in-Residence at the Library of Congress, in collaboration with Library of Congress Labs, the National Digital Newspaper Program, and my PhD advisor, Professor Daniel Weld, at the University of Washington. Newspaper Navigator consists of two parts. The first is to extract headlines, images, illustrations, maps, comics, and editorial cartoons from millions of newspaper pages by training an image recognition model on thousands of crowdsourced annotations collected by the Library of Congress’s Beyond Words initiative. The second part of Newspaper Navigator is to reimagine how we can navigate this wealth of visual content through an exploratory search interface, enabling users to define queries for concepts of their own choosing (which I refer to as “open faceted search”).

In this talk, I will share my current progress with Newspaper Navigator, including running the visual content recognition pipeline at scale. I will also discuss how this project, including the resulting datasets and search interface, can contribute to both computer science research and research within digital humanities.

Read more about the Newspaper Navigator project.

This event is part of the 2019-20 Collections as Data Discussion Series.