From Oct. 2-4, members of the CDH attended the DARIAH beyond Europe conference at the Library of Congress in Washington, D.C. to compare notes with DARIAH-EU members on DH methods. This post is a very brief summary of what we heard.
From DARIAH-EU director Toma Tasovac we learned that DARIAH has received funding from the European Commission through the Horizon 2020 initiative for an ambitious new project to build the "marketplace" - a unified platform for the exchange of DH tools, data, and services. This will be DARIAH's contribution to the European Open Science Cloud.
Digital Newspapers and Text Analysis
Mikko Tolonen of the University of Helsinki gave an overview of the current state of digital newspaper work in the EU.
Digitization projects are broadly similar across Europe - most national libraries are engaged in the lengthy process of digitizing document microforms via outsourced labor, often performed in India (the Swedish national library's work on re-scanning primary documents is a notable exception). Institutions are thus facing a similar set of problems, including widespread quality issues with OCR, recognizing and indexing multilingual content, and performing semantic enrichment to alleviate weaknesses in the data.
Although institutions once shouldered the burden of these problems alone, the private sector has taken notice and in some cases released tools for scholars - a recent example is Gale's Digital Scholar Lab, which provides a variety of services for using Gale content in DH research. We also heard about the success of the Transkribus project, an effort to leverage machine learning for handwritten text recognition, layout analysis, and other tasks.
The project aims to pool expertise among member institutions on improving OCR quality and performing a wide variety of enrichment tasks, including named entity recognition, sentiment analysis, binary classification, and more. By adopting a novel algorithm that focuses on locale-agnostic properties of natural language, NewsEye has already been able to produce models that maintain a high degree of precision across a variety of languages. This creates the capacity to enrich and analyze multilingual content missing from other state-of-the-art, high-precision natural language processing models.
Fien Danniau of the University of Ghent discussed the dichotomy between research and public history in the EU: while researchers are often concerned with DH methods as a means to produce academic output, public history offers a chance for institutions to break out of the ivory tower and focus on the process of creating knowledge for the wider community. Although public humanities is beginning to be understood as a collaborative meeting place for these two approaches, many unsolved questions remain - particularly in terms of scale.
One example is the well-known Europeana, a portal that aggregates a staggering 58,000,000+ digital items from European libraries, archives, and museums. Although Europeana publishes usage statistics, it's unclear if the numbers really answer questions about who is using the portal and why: on such a massive scale, is it even possible to make statements about a "public" that is being served? Moreover, as Danniau pointed out, one can easily discover items held at the Library of Congress within Europeana, which complicates even the pretense that it serves a pan-European "public".
Recalling that collections as data designed for everyone serve no one, it's no surprise that current public humanities work in the EU - as in the United States - often centers around local cultural heritage institutions. Many projects take this a step further by involving a geospatial dimension - for example, the ambitious De Krook library project in Ghent to create a new type of collaborative public space.
Kees Teszelszky of the National Library of the Netherlands gave us a short history of web archiving in the EU, including the NEDLIB harvester, a now-defunct project to create a pan-European web archiving infrastructure.
Building on the lessons from NEDLIB, many national libraries in the EU have struck out separately in efforts to archive the web, with mixed results. In many countries, lack of a legal deposit law can severely restrict web archiving efforts - a problem for which institutions in the US have developed their own solutions. We learned that the task of web archiving is similarly decentralized in Switzerland, with librarians and archivists individually managing the archiving of content in their local communities. At the National Library of the Netherlands, web archiving efforts only began in earnest in 2007, despite the .nl domain being the first country code TLD ever registered.
Olga Holownia of the IIPC offered the consortium's successes over the past 15 years as a counterpoint. As of 2018, the consortium includes 56 members - among them national and regional libraries, independent archives, and NGOs. In particular, we heard that web archives in Croatia, Iceland, and Portugal actively invite researchers and in some cases will even fund research using archived content!
Finally, the IIPC collaboratively publishes an excellent tools list, and its members developed the pioneering Memento protocol and WARC format for storing and sharing web archives. Web archive training curriculum is also being developed and will be released under an open CC license.