Data-Driven Dreams | Brian C. Keegan

What makes misinformation spread? It’s a topic of vital importance with empirical scholarship going back to 1940s on how wartime rumors spread. Rumors, gossip, and misinformation are pernicious for many reasons, but they can reflect deeply-held desires or are reasonably plausible, which makes them hard to stay ahead of or rebut. I have an interest in the spread of misinformation in social media and have published some preliminary research on the topic. So it was fascinating for me to witness misinformation spread like wildfire through my own academic community as it speaks to our data-driven anxieties and dreams.

What we wish to be true

Scientific American published a brief article dated June 1 (but released a week beforehand) titled “Twitter to Release All Tweets to Scientists.” The article claims “[Twitter] will make all its tweets, dating back to 2006, freely available to researchers… everything is up for grabs.” (emphasis added) This claim appears to refer to the Twitter Data Grants initiative announced on February 5th in which they will “give a handful of research institutions access to our public and historical data.” (emphasis added) On April 17, Twitter announced it had received more than 1,300 proposals form more than 60 different countries, but selected only 6 institutions “to receive free datasets.” There have been no subsequent announcements of another Twitter-sponsored data sharing initiative and the Scientific American article refers to a February announcement by Twitter. The semantics of the article’s title and central claim are not technically false as a corpus of tweets was made available gratis (free as in beer) to a selected set of scientists.

Following in the tradition of popular fact-checking websites such as PolitiFact, I rate this claim MOSTLY FALSE. It is not the case that Twitter has made its historical corpus of tweets available to every qualified researcher. This was the interpretation I suspect many of my friends, collaborators, and colleagues were making. But collectively wishing something to be true doesn’t make it so. In this case, the decision about who does and does not get access to their data was already made on April 17, and Twitter (to my knowledge) has made no public announcements about new initiatives to further open its data to researchers. Nothing appears to have changed in their policies around accessing and storing data through their APIs or purchasing data from authorized resellers such as gnip. There’s no more Twitter data available to the median scientific researcher now than there was a day, a month, or a year ago.

What we wish could be changed

Indeed, by selecting only 6 proposals out of approximately 1,300 submission, this proposal process had an acceptance rate of less than 0.5% — lower than most NSF and NIH grant programs (10-20%), lower than the 5.9% of applicants accepted to Harvard’s Class of 2018, lower than Wal-Mart’s 2.6% acceptance rate for its D.C. store, but about average for Google’s hiring. There is obviously a clear and widespread demand by scientific researchers to use Twitter data for a broad variety of topics, many of which are reflected in the selected teams interests across public health, disaster response, entertainment, and social psychology. But we shouldn’t cheer 99.5% of interested researchers being turned away from ostensibly public data as a victory for “open science” or data accessibility.

A major part of this attraction is the fact that Twitter data is like an illusory oasis in the desert where the succor seems just beyond the next setback. The data contains the content and social network structure of communication exchanges for millions of users spread across the globe with fine-grained metadata that captures information about audience (followers), impact (retweets/favorites), timestamps, and sometimes locations. Of course researchers from public health, to disaster response, to social psychology want to get their hands on it. With my collaborators at Northeastern University and other institutions, we put in two separate Twitter data grant proposals to study both misinformation propagation as well as dynamics around the Boston Marathon bombings. We felt these were worthy topics of research (as have NSF funding panels), but they were unfortunately not selected. So you’re welcome to chalk this post up to my sour grapes, if you’d like.

Twitter’s original call for proposals notes that “it has been challenging for researchers outside the company… to access our public, historical data.” This is the methodological understatement of the decade for my academic collaborators who must either write grants to afford the thousands of dollars in fees resellers charge for this data, build complex computing infrastructures to digest and query the streams of public data themselves, or adopt strategies that come dangerously close to DDoS attacks to get around Twitter’s API rate limits. This is all to simply get raw material of tweets that are inputs to still other processes of cleanup, parsing, feature extraction, and statistical modeling. Researchers unfamiliar with JSON, NoSQL, or HDFS must partner with computer and information scientists who go, at considerable peril to their own methods, interests, and infrastructures, into these data mines.

What should be changed

I said before this data is “ostensibly public”, but Twitter has very good reasons to build the walls around this garden ever higher. Let me be clear: Twitter deserves to be lauded for launching such a program in the first place. Hosting, managing, querying, and structuring these data require expensive technical infrastructures (both the physical boxes and the specialized code) as well as expensive professionals to develop, maintain, and use them. Twitter is a private, for-profit company and its users grant Twitter a license to use, store, and re-sell the things they post to it. So it’s well within its rights to charge or restrict access to this data. Indeed there are foundational issues of privacy and informed consent that should probably discourage us from making tweets freely available as there are inevitably quacks hiding among the quantoids like myself submitting proposals.

It’s also important to reflect on issues that Kate Crawford and danah boyd have raised about inequalities and divides in access to data. Twitter did an admirable job in selecting research institutions that are not solely composed of American research universities with super-elite computer science programs. But it doesn’t alter the structural arrangements in which the social, economic, and political interests of those who (borrowing from Lev Manovich) create these data are distinct from the interests of those who collect, analyze, and now share it with other researchers. The latter group obviously sets the terms of who, what, where, when, why, and how this data is used. Twitter would have good reasons to not provide data grants to researchers interested in criticizing Silicon Valley tech culture or identifying political dissenters, yet this model nevertheless privileges Twitter’s interests above others’ for both bad and good reasons.

Some models for big open data

So what might we do going forward? As I said, I think Twitter and other social media companies should be lauded for investing resources into academic research. While it makes for good press and greases the recruitment pipeline, it can still involve non-trivial costs and headaches that can be risky for manager and make investors unhappy. I think there are a few models that Twitter and similar organizations might consider going forward to better engage with the academic community.

First, on the issue of privacy, social media companies have an obvious responsibility to ensure the privacy of their users and to prevent the misuse of their data. This suggests a “visiting researcher” model in which researchers could conduct their research under strict technical and ethical supervision while having privileged access to both public and private metadata. This is a model akin to what the U.S. Census uses as well as what Facebook has been adopting to deal with the very sensitive data they collect.

Second, organizations could create outward facing “data liaisons” who provide a formal interface between academic communities’ interests and internal data and product teams. These liaisons might be some blend of community management, customer service, and ombudsperson who mobilize, respond to, and advocate for the interests of academics. The Wikimedia Foundation is an exemplar of this model as it has staffers and contractors who liaise with the community as well as analytics staff who assist researchers by building semi-public toolservers.

Third, organizations could publish “data dumps” in archives on a rolling basis. These might be large-scale datasets that are useful for basic tasks (e.g., userid to username key-value pairs), anonymized data that could be useful in a particular domain (e.g., evolution of the follower graph), or archives of historical data (e.g., data from 5 years ago). The Wikimedia Foundation and StackOverflow both provide up-to-date data dumps that have been invaluable for academic research by reducing the overhead for researchers to scrape these data together by themselves.

Finally, on the issue of academic autonomy and conflicts of interest, social media companies could adopt a “data escrow” model. Twitter would provide the data to a panel of expert reviewers who could then in turn award it to worthy research teams after peer review. This privileges academic autonomy and peer review that has underpinned norms of teaching, publishing, funding, and promotion for decades and would prevent Twitter’s conflicts of interests from biasing legitimate academic research interests. Twitter could rotate between different disciplinary themes such as public health or incentivize interdisciplinary themes like crisis informatics. I haven’t seen a model of this in action before, but let me know of any in the comments or… via Twitter 🙂

And stop linking to that goddamned Scientific American article and link to this one instead!