Creating an Argument from Data

1 May 2023

Authors

Keeping the original curiosity as central to a data driven project ensures that the tool is working in service of the question, instead of the other way around.

Collections as Data

“I wonder why this novel is so focused on class?”
“What does this data tell me about contemporary notions of family?”
“What patterns exist in the rhetoric of documents from the USDA?”

Any project, whether it be an article, monograph, experiment, or computational work, starts from a curiosity. Turning this curiosity into a well formed research question is crucial to a successful project. There are established methods for this process in more traditional scholarship, but with a DH or computational project it can be tempting to sacrifice your question to mold it to the capabilities of the tool, rather than finding a tool or tools that can help you investigate your question. A strong research question is pivotal to stay on track with your original curiosity.

To help our students, like those in HUM 307: Literature as Data (read more about the course here), create a research question from data, I rely first on the traditional methods of writing a thesis statement. As a composition instructor at the University of Oregon, I, like my colleagues, used John Gage’s book The Shape of Reason and his concept of the enthymeme, or a reasoned thesis, to teach argumentative writing. The enthymeme goes further than a typical thesis statement in creating a thesis that looks like a syllogism; the first half stating your claim, and the second half stating your reason for believing your claim to be true. The enthymeme is rigid, but it asks students to think about the underlying assumptions of their ideas and to make their ideas more precise. Once a reasoned thesis is present it becomes clearer to the author what needs to be addressed in the essay.

Reasoned Thesis = Research question phrased as a statement + Hypothesis

Subject (the main thing you’re investigating) + Because/Therefore + Shared subject and reason

  • EXAMPLE: The spotted owl should be protected because it is an indicator species.
    • Assertion: The spotted owl should be protected.
    • Reason: The spotted owl is an indicator species.
    • Assumption: Anything that is an indicator species should be protected.

Using this example, a reader would expect to see a definition of “indicator species,” and what “protected” means in this context, as well as a conclusion that affirms the thesis.

The same method, with a few tweaks, works well for creating a data driven project with both students and researchers. In using this method to create a thesis from data, the research question is bound not only by the enthymeme form, but also the bounds of the data itself.

For instance, one of my favorite projects is “The Largest Vocabulary in Hip-Hop” by Matt Daniels, which takes the first 35,000 words in each rapper’s oeuvre and charts each rapper by the size of their vocabulary against Shakespeare and Melville. Curiosity has led to the research question “Which rapper has the largest vocabulary?” From his data he is not looking to find the best rapper ever, or the fastest, or the most popular. Instead, he is looking to find something measurable and knowable from the dataset he has. To answer this question, Daniels has determined “vocabulary” means the percent of unique words each rapper uses in their first 35,000. The higher the percentage of unique words, the higher the vocabulary. But how does a computer determine unique words, or how many strings only appear once in the 35,000 word corpus? Only now do we determine which tool can best determine which strings are unique.

Python can do this type of calculation using a type/token ratio to find the number of unique words in a particular collection of words, in this case the first 35,000 words in a rapper’s published lyrics. We would first need to translate our research question into a lexicon Python can understand:

  • Number of Unique Words = Number of Types
  • Number of All Words = Number of Tokens
  • Vocabulary Size = Number of Types
  • “Lexical Diversity” = Types / Tokens = Type Token Ratio

Using this translation we can then execute this calculation in Python:

# A string representing the oeuvre of a rapper
aesop_rock - """All the lyrics of Aesop Rock here"""

# Lowercase and 'tokenize' or split into words
aesop_words = aesop_rock.lower().split()

# Count the number of tokens
aesop_num_tokens = len(aesop_words)

# Count the number of *unique* tokens
aesop_num_types = len(set(aesop_words))

# calculate the type/token ratio
aesop_ttr = aesop_num_tokens / aesop_num_types

Python code by Ryan Heuser

Though Daniels doesn’t go so far as to make an interpretive analysis of his graph, finding that Aesop Rock had the largest vocabulary through computational methods allows for the interpretation that this vocabulary means he also has the most complex lyrics and that his vocabulary exceeds that of Shakespeare and Melville as shown on his graph. In this way, the computation becomes evidence for the larger argument and not the argument itself.

  • EXAMPLE: Aesop Rock has the largest vocabulary, therefore he has the most complex lyrics in this corpus.
    • Assertion: Aesop Rock has the most complex lyrics in this corpus.
    • Reason: Aesop Rock has the largest vocabulary among rappers.
    • Assumption: Anyone who has the largest vocabulary among rappers, has the most complex lyrics.

Keeping the original curiosity as central to a data driven project ensures that the tool is working in service of the question, instead of the other way around. The enthymeme format also reveals the underlying assumptions implied by the question, which helps to expose potential biases early in the project. Though this method may take a bit longer at the outset, creating a strong thesis will make for a more robust and focused data driven exploration.