Implications of the Bulgarian National Anthem for Information Security

How did the Puerto Rican reggaeton mega-hit “Despacito” become the national anthem of Bulgaria? For at least a few days in October 2017, Apple’s digital assistant Siri offered this Luis Fonsi-Daddy Yankee collaboration as the answer to the query “What’s the national anthem of Bulgaria?” Somewhere deep down in Apple’s knowledge graph that powers Siri’s “intelligence” this erroneous key-value pairing was made available to millions of Siri users around the globe. Technical infrastructures of enormous scale and consequence like Siri depend on data of uncertain provenance and quality, but these data assuredly include peer production platforms like Wikipedia, WikiData, and OpenStreetMap. The under-acknowledged dependence upon and interoperability of peer production platforms’ data in other socio-technical infrastructures is a vastly under-appreciated threat vector: what goes into Wikipedia is uncritically refracted and amplified through a complex web of seemingly unrelated technologies.

I am writing this in mid-March as revelations about the relationship between Cambridge Analytica, Facebook, and foreign influence in the 2016 U.S. election raise questions about the technical and ethical boundaries of securing users’ private information on massive platforms. Twitter published a request for proposals to improve “conversational health” after admitting its platform has enabled “abuse, harassment, troll armies, manipulation, misinformation campaigns, and echo chambers.” YouTube CEO Susan Wojcicki announced at SXSW that Wikipedia content would appear as de-biasing attempts alongside conspiracy videos. Peer-produced data sources have been a boon for training the complex artificial intelligence infrastructures that power conversational agents like Siri, deep learning models used for translating text, and supporting basic data fusion and labeling tasks for data scientists everywhere. While Wikipedia has stronger norms around neutrality, verifiability, and civility and better governance models that other social platforms, it is far from perfect: its user base is profoundly unrepresentative which results in content that reproduces biases about gender, ethnicity, geography, and time.

Socio-technical systems like Wikipedia are designed around the assumption that the motivations and contexts for contributions are constant over time. But many social systems exhibit “burstiness” characterized by short timeframes of intense activity followed by long times of low activity. Activity bursts in systems like Wikipedia are not edge cases, but responsible for significant fractions of total information production and consumption. Who, what, when, where, why, and how peer-produced knowledge is produced and consumed around information-seeking bursts will enable powerful applications. However, WikiData can also be manipulated to destabilize technologies that are built upon it by injecting false information that is rapidly ingested and propagated. The pervasiveness and interoperability of peer-produced data in other socio-technical infrastructures are introducing new opportunities and risks. Understanding the practices and consequences of high-tempo collaborations will inform the design of “anticipatory infrastructures” to generate resilient responses against implicit biases of peer production systems as well as coordinated misinformation campaigns. Luis Fonsi sang, “tú eres el imán y yo soy el metal”; if peer production platforms are a magnet, who are the metals being attracted to them?