An update from Blue Mountain Springs
15 September 2015
This year, Blue Mountain is going to flood the library.
By one measure, Blue Mountain contains nearly 60,000 page images from nearly 2,400 issues of 31 magazines. By making these page images freely available on the World Wide Web, with an online catalog that can be searched and browsed and a full-text index to all the words on those pages, we have fulfilled one of the promises of digitization: vastly expanded access to rare and fragile materials. Most digital library projects stop there, just as most libraries limit themselves to providing discovery and access: their role ceases when the patron checks out the book or accepts the box in the reading room.
By another measure, Blue Mountain contains detailed bibliographic metadata about 21,117 textual constituents, 6,944 graphical constituents, 1,162 musical items, and 11,751 advertisements — a trove of information about publication and authorship: who published what where, when, and with whom. Right now, this database takes the form of a set of XML-encoded files that can be freely downloaded from GitHub. Like any real-world metadata, Blue Mountain’s isn’t pristine, and this summer our project began a systematic clean-up and assessment of 7 titles in order to gauge what it will take to make Blue Mountain’s metadata suitable for detailed bibliographic research. We’ll report on that in a future post.
By yet another measure, Blue Mountain contains over 12 million words in a dozen languages, making it a rich lode for computational linguists and digital philologists, who seek sets of thematically linked data on which to conduct basic research in natural-language processing. This text base is structured but unrefined: it is organized by constituent, but the text is uncorrected OCR. We’ll write about the quality of Blue Mountain’s texts in a future post, too.
By any measure, Blue Mountain is a large and heterogeneous resource, one that has coalesced at a time when intellectual prospectors of all stripes are becoming familiar with computational methods of research and analysis. As they do, they are demanding new ways to work with digitized texts besides turning virtual pages. As a research project whose mission is to serve the needs of these scholars and students, Blue Mountain must develop far more sophisticated ways to expose Blue Mountain’s resources to an entirely new audience of users: the scholars and scientists engaged in the activities loosely defined as “digital humanities.”
What exactly constitutes digital humanities is itself a topic of vigorous debate. We take digital humanities to refer to a set of practices that use computers and machine-readable representations of cultural heritage materials to investigate questions relating to space, time, and human community. Blue Mountain’s domain — the Western Avant-Garde in literature, art, music, and architecture — is full of these sorts of questions. What was the Avant-Garde? When did it begin? Where? In what ways was it international? In what ways did it cross disciplinary boundaries?
In order to answer these questions, one must be able to place events and people in space and time and trace the relationships among them. To do so, information scientists and digital humanities researchers need to bypass reader-oriented interfaces and access full-text data directly and programmatically for use with their own analytical tools, with web tools like Voyant, Palladio, and Raw, or with available tools and toolkits like Gephi, Mallet, or NLTK.
This year, the Blue Mountain Project is collaborating with the Center for Digital Humanities at Princeton to develop an application programming interface (API) to Blue Mountain’s metadata, page images, and machine-readable full-text transcriptions. We’re calling this initiative “Blue Mountain Springs,” because it will make Blue Mountain an abundant source of clean data that can be drawn out of the resource and piped or poured into a scholar’s waiting tools.
Services like Blue Mountain Springs erode the distinctions between digital libraries, data sets, and data bases — that’s what flowing resources do.