Private Signals, Opaque Models, and an AI-Surveillance World
6 May 2024
Reflections on the first LLM forum and a growing discomfort with content privacy in an AI-hungry world of monetized surveillance
In the 2023–24 academic year, The Center for Digital Humanities held a new speaker series focused on how Artificial Intelligence (AI), particularly Large Language Models (LLMs) like ChatGPT, is changing our world as researchers, teachers, and individuals. The Princeton LLM Forum, co-organized with the Department of Computer Science and supported by the Humanities Council, brought together leading scholars and researchers to discuss the implications that LLMs have on our understanding of language, society, culture, and theory of mind. Consistent with AI's broad applications and implications, the LLM Forum speakers represented a wide range of disciplines and backgrounds, from literature to politics to data science. Over the course of the series, the CDH hosted four guests, each paired with a Princeton faculty respondent. In October 2023, we kicked off the forum with Meredith Whittaker (President of Signal) and respondent Arvind Narayanan (Computer Science; Center for Information Technology Policy) on the topic of Society. As the academic year comes to a close, the CDH’s Lead Research Software Engineer, Rebecca Sutton Koeser, reflects on that first discussion and shares some of her impressions—and concerns—below.
Sending signals
I use the Signal encrypted messaging app every day.
I started using it a few years ago because my father, a retired network engineer, wanted to use a messaging system with encryption. We briefly used WhatsApp, until he learned that it was owned by Facebook. End-to-end encryption isn’t that good when someone you don’t trust has access to one (or both!) endpoints. I mostly use Signal to stay in touch with my parents and share pictures and stories about my children’s latest adventures; I’m slowly starting to gather other friends and colleagues as contacts. I don’t generally use it for sensitive information (except perhaps very rarely exchanging financial information with family), and I never felt like any of my communication required encryption. Honestly, at first, I was just using it to honor my father.
Meredith Whittaker and Arvind Narayanan on the implications of LLMs on society
This familiarity with Signal was my primary context going into the LLM forum conversation between Meredith Whittaker and Arvind Narayanan back in October 2023, on the topic of Large Language Models and Society. The conversation between Whittaker and Narayanan was illuminating, entertaining, wide-ranging, and thought-provoking. There were moments when you could tell just about everyone in the room was jotting down a funny or insightful comment.
Throughout the conversation, it was clear to me how much language matters and informs the way we think about technologies. To start the conversation, Whittaker gave us a brief history of how “Artificial Intelligence” has become a branding and sales pitch term, initially used to differentiate academic territory from “cybernetics.” Whittaker repeatedly pointed out power dynamics: how the hierarchies of the field and structures of computer scientists in academia working with industry are a kind of “capture,” and that data capture and data gathering always enact a power dynamic of some kind. Narayanan, co-author of the book AI Snake Oil, shared his own experience of the difficulty of working on privacy in computer science, which is structurally within the subfield of information security — the ideologies of the people reviewing his work were often quite opposed to it. Whittaker discussed AI as surveillance and stated that instead of “microtargeting” marketing, it would be more accurate to call it “surveillance advertising” (a much more disturbing term!). She noted that the people who build the technology don’t necessarily control it, and emphasized the need for explainers — people don’t understand the technology, but they are afraid to look stupid by asking questions, which is particularly concerning when it comes to government officials making decisions and writing policies related to technology.
During a discussion of open models and open LLMs, Whittaker pushed on the idea of “open” a bit: How open is it if it’s in an ecosystem controlled by big tech, where you’re required to use their tools and frameworks (e.g., pytorch) even to play? For her, calling these efforts open-source models is a misnomer.
Growing discomfort with privacy and content in online spaces
Whittaker's comments about her work with Signal are the ones that have stayed with me the most. She described how hard Signal works to protect user data from themselves — something that is not the default, and that goes against the grain of our current surveillance culture. I have access to server logs and analytics for our CDH web applications, so I have a sense of this kind of access, on a much smaller and less critical scale. She talked about how expensive it is to run Signal (Signal Foundation is a nonprofit organization), and how they have to be on the servers in the network in order to get the kind of response time that people expect from a messaging application. As Lauren Klein has written, “all technologies are imbricated in … unequal power” (see Klein’s 2022 essay Are Large Language Models Our Limit Case? in Startwords issue 3).
The increased visibility of big tech companies scooping up vast swaths of content from the internet to create large language and image models is changing how we think about and protect our own content. I’ve heard researchers like David Bamman (who gave a talk in February on “The Promise and Peril of LLMs for Cultural Analytics”) make the case that this work is truly transformative and constitutes “fair use” — maybe it’s fair in a technical sense, but it certainly feels uncomfortable. And for me, that discomfort is growing, which is probably a good thing. The recent change to Dropbox’s terms of service to allow sharing with third-party AI providers prompted a discussion among CDH staff about our Slack usage and data retention. I tend to avoid posting pictures of my children on social media, but until recently I’ve felt like Slack is a “safe” place for that content. Several of us joined a livestream of a recent talk Matt Jones gave for a CITP seminar where he talked about the history of surveillance at scale, and how old assumptions that the content of communications should be protected but not the metadata, are dangerous and outdated in today’s world. Sometime after attending this LLM forum, a Firefox browser update prompted me to try popular plugins, and I installed the NoScript security plugin; sometimes it makes browsing the web a lot less convenient, but it also makes the online tracking much more visible. CDH faculty director Meredith Martin recently shared Associate Professor of Sociology and CDH Executive Committee Member Janet Vertesi’s Opt Out Project. I find her advice encouraging, that it’s ok to start with one system at a time. One of my lines has always been Facebook; I’ve never had an account. But I’m deeply embedded into Google systems, both for work and for personal content; maybe it’s time to start assessing and planning where else I personally can opt-out.
The stories and pictures of my children that I share with my parents and siblings through Signal messages are precious to me, and I want them preserved. However, it’s becoming increasingly clear to me that this data—along with large swaths of my work and communications online—are increasingly sensitive and worth protecting.