Editor’s note: At the Center for Digital Humanities, we are committed to diversifying DH by working in many languages. This summer, we wrapped up work on the Princeton Ethiopian Miracles of Mary Project, which features miracle stories written in Gəˁəz, or Classical Ethiopic. Starting in spring 2021, the CDH will host a series of workshops, supported by the National Endowment for the Humanities, with the aim of creating Natural Language Processing tools for under-resourced languages.
This academic year, the CDH is partnering with the Princeton Geniza Project (PGP), a database of thousands of medieval documents written in four languages: Judaeo-Arabic, Hebrew, Arabic, and Aramaic. As CDH Lead Developer Rebecca Sutton Koeser explains, the diversity of languages poses technical obstacles for the team.
“The text analysis and search tools we typically use have limited support for Hebrew and Arabic, and were not implemented with ancient languages in mind.”
But what happens if numbers speak different languages?
Below, PGP Co-PI Marina Rustow, Khedouri A. Zilkha Professor of Jewish Civilization in the Near East, considers the challenge of dating the geniza documents, which were found in a synagogue in Cairo. As Rustow writes, the documents refer to multiple calendar systems. Moreover, many of the documents do not include sufficient information for scholars to assign them specific dates. Challenges ensue: How can the team design a database that allows users to search by date if researchers don't know what the documents’ dates are?
Rustow’s reflections are part of a larger discussion within the PGP team about how to acknowledge complexities in dating while providing helpful and reliable information to users. She begins by identifying five intertwined aspects of dating: 1) calendar systems; 2) explicit uncertainty; 3) implicit uncertainty; 4) epistemological uncertainty; and 5) relational data.
The document has been lightly edited for length and clarity for readers new to PGP. For more on dating in DH, see Stanford’s Topotime project.
Dates and temporality are complicated. Historians intuitively handle them in complex ways, but articulating that complexity can be challenging.
This document is an attempt to do so as a step toward representing dates computationally, but with a level of complexity that mirrors the texts we have and the scholarship on them.
1) CALENDAR SYSTEMS, or: How do we render dates?
This is the most technical aspect of dating, and also the simplest: it's the least dimensional, the most precise, and, for the calculated calendars, highly compatible with a computational approach (algorithms and whatnot). The empirical calendars (e.g., the Sunnī and Imāmī Islamic calendars) require observation of the crescent moon to determine the start of months, so they're more complicated to render computationally.
It's also simple because everything in this section assumes that the text is giving us a complete date. It's just giving it to us in a system that we need to convert.
There are multiple calendars used in geniza documents.
- anno mundi (AM), the calendar we now think of as “the” Jewish calendar, though in the Middle Ages, there were others (see below). The months are soli-lunar (some say just plain lunar): the months change with the new moon, but the years alternate in a regular pattern of 12 or 13 months according to the agricultural (solar) calendar, so that the months don’t slip backwards, as happens in the Islamic calendar.
- hijrī (AH = anno hegirae, but even the hard-core orientalists don’t say it that way anymore), the Islamic calendar. The months are lunar. The lunar year is 11 days shorter than the solar year, so the months slip backwards through the seasons.
- Coptic calendar. Used mainly for months, and they’re solar; used mostly when discussing agriculture and taxation.
- kharājī calendar, the fiscal calendar, which combines Coptic or Islamic months with a variant on hijrī years, usually AH -1 or AH -2 (due to delays in tax collection). We still don't know the kharājī dates for each year, because this dating tends to pop up only in technical administrative and fiscal documents, and those are a new field of exploration, which is a polite way of saying that only like six people know how to read them or care to do so.
- “Seleucid” calendar, in fact a much more common way of representing years in geniza documents than AM; the basic formula is (CE = Sel. – 311), so 1545 Sel. = 1234 CE (but it's actually a bit more complicated than this).
NEXT: Regardless of which calendar a text uses, we need to cite the original and the date as converted to common era (CE) dates, since talking in CE dates is a convention among historians. There are two kinds of CE dates:
The changeover was decreed in October 1582. Anything before then should be converted to Julian, and anything after, to Gregorian.
There’s one caveat to this generally happy state of affairs. Sometimes the date mentioned on a document is not when it was written, but a date in the near past or future to which the writer is referring. (It’s only very, very rarely the distant past). An example is a tax demand note that asks for taxes for the year 425. We know that taxes were paid in advance (whereas rent was paid in arrears), so we can date the document to ca. 423–25 with reasonable certainty (and also some fuzziness).
2) EXPLICIT UNCERTAINTY, or: Documents mock us with their gaps, holes and stains.
Explicit uncertainty is when the text actually gives us a date, but not the whole date.
- A text that gives us the year, but not the month or day.
- A lacuna in the manuscript just where the date is, so you can see the hundreds (the century) but not the decade or digit (no year!).
- A lacuna in one of digits that can be partly filled if we have the day of the week and month, or enough people in the document to narrow down the option.
- A text that gives us the day and the month, but not the year (maddeningly common, more common than any of the above). Those cases fall under another category because for our purposes, the text is effectively undated, and we have to infer the date in other ways (see the next three entries).
3) IMPLICIT UNCERTAINTY, or: The document is undated, but don’t worry, we have a workaround.
A workaround in this case means we have analytical criteria that can yield a date, such as handwriting.
Frequently the best we can do is a range of dates, such as a century. This is computationally significant, because we need ways of searching for documents dated not to a year but to a range of years (fuzzy dates).
We can sometimes get more precise, as when people appear in the documents (whether mentioned or as their authors) whose active date-range we know (their floruit). See under → relational data.
Sometimes we don't have to deploy workarounds because others have done it before us.
For inferred or fuzzy dates, previous specialists have usually used “ca.” (circa) before or after a hard number, e.g., “ca. 1030,” “ca. 900–950,” or a century or quarter or half-century (e.g., “second half of the 10th century”); or they mention the known dates associated with a person in the document (e.g., “dated documents: 1100–1137”).
But do we trust these previous scholars? See under → epistemological uncertainty.
4) EPISTEMOLOGICAL UNCERTAINTY, or: Can our teachers be trusted?
It’s helpful for site-users to know where a date is coming from.
If it’s a hard date in the document, they know they can trust it insofar as they trust our decipherment and transcription of it.
If there’s no hard date, they have to decide whether to accept our paleographic dating (dating from handwriting), or our inferential dating (based on floruit or other analytical criteria).
Our basic philosophy around here seems to be this: in cases of legitimate uncertainty or disagreement, let’s give site-users the information and let them decide what to believe. In this case, then, let’s leave a trail of epistemological bread-crumbs to help them.
5) RELATIONAL DATA, or: OMG we are going to be able to put dates on so many more documents!
Dates can be linked to people, as when we have an undated letter in the hand of someone we know wrote letters between 1100 and 1137.
Dates can (more rarely) be linked to places: Cairo was founded in 972, so if a document says it was written there, it postdates 972.
If these cross-referential dates can be made explicit for our users, we will be way ahead of the game in terms of filling in the dates for our corpus (and thus also being able to do things with the data).
UPDATE: Team members continue to discuss these issues as they work on the project.
“We’re still determining how we’re going to implement dates in the new relational database for this project, but our current thinking is that we’ll need to store dates as written, with all their lacunae, alongside a normalized machine-readable date or date range where one can be calculated,” Koeser explains.
“People, places, and documents can all potentially have dates, and it’s exciting to think about ways of automatically inferring related dates and surfacing that information and the logic behind it to people using the data.”
[Carousel image: Cambridge Digital Library, University of Cambridge]