My 15 Minutes of Fame as a B-List Gamergate Celebrity

Culture, Data, Politics

On Monday, October 27 Andy Baio posted an analysis of 72 hours of tweets with the #Gamergate hashtag. With the very best of intentions, he also shared the underlying data containing over 300,000 tweets saved as CSV file. There are several technical and potential ethical problems with that, which I’ll get to later, but in a fit of “rules are for thee, not for me,” I grabbed this very valuable data while I could knowing that it wouldn’t be up for long.

I did some preliminary data analysis and visualization of the retweet network using this data in my spare time over the next day. On Wednesday morning, October 29, I tweeted out a visualization of the network describing the features of the visualization and offering a preliminary interpretation, “intensely retweeting and following other pro-#gamergate is core to identity and practice. Anti-GG is focused on a few voices.” I intended this tweet as a criticism of pro-Gamergaters for communicating with each other inside an insular echo chamber, but it was accidentally ambiguous and it left room for other interpretations.

The tweet containing the image has since been retweeted and favorited more than 300 times. I also received dozens of responses ranging from benign questions about how to interpret the visualization, more potentially problematic questions about identifying users, and finally responses that veered into motivated and conspiratorially-flavored misreadings. Examples of the latter are below:

To be clear, I do not share these interpretations and I’ll argue that they are almost certainly incorrect (a good rule of thumb is to always back away slowly from anyone who says “data does not lie“). But I nevertheless feel responsible for injecting information having the veneer of objectivity into a highly charged situation. Baio mentioned in his post that he had a similar experience in posting results to Twitter before writing them up in more detail. A complex visualization like this is obviously a ripe for misinterpretation in a polarized context like Gamergate and I wasn’t nearly clear enough describing the methods I used or the limitations on the inferences that can be drawn from this approach. I apologize for pre-maturely releasing information without doing a fuller writeup about what went in and what you should take away.

So let’s get started.

Data collection

I will have to defer to Baio for the specific details on how he collected these original data. He said:

“So I wrote a little Python script with the Twython wrapper for the Twitter streaming API, and started capturing every single tweet that mentioned the #Gamergate and #NotYourShield hashtags from October 21–23.”

This data collection approach is standard, but has some very important limitations. First, this only looked at tweets containing the hashtags “#gamergate” and “#notyourshield.” These hashtags have largely been claimed by the “pro-gamergate” camp, but there are many other tweets on the topic of Gamergate under other partisan hashtags (e.g., “#StopGamerGate2014″) as well as tweets from people speaking on the topic but consciously not using the hashtag to avoid harassment. So the tweets in this sample are very biased towards a particular community and should not be interpreted as representative of the broader conversation. A second and related point is that these data do not include other tweets made by these users. On a topic like Gamergate, users are likely to be involved in many related and parallel conversations, so grabbing all these users’ timelines would ideally give a fuller account of the context of the conversation and other people involved in their mentions and replies.

A third point is that Baio’s data was saved as a comma-separated value (CSV) file, which is a common way of sharing data, but is a non-ideal way to share textual data. Reading the data back in, many observations end up being improperly formatted because of errant commas and apostrophes in the tweets break fields prematurely. So much of the analysis involves checking to make sure the values of fields are properly formatted and tossing those entries that are improperly formatted. Out of 307,932 tweets, various stages of data cleanup will toss thousands of rows of data for being improperly formatted depending on the kind of analysis I’m focusing on. While this was not a complete census of data to begin with, this is still problematic as these data are likely non-random because they contain a combination of mischief-causing commas and apostrophes, which is another important caveat. Please use formats like JSON (ideally) to share textual data like this in the future!

To review, besides only being a three-day window, this dataset doesn’t include other Gamergate-related conversations occurring outside of the included hashtags, ignores participating users’ contextual tweets during this timeframe, and throws out data for tweets contains particular grammatical features. With these important caveats now made explicit, let’s proceed with the analysis.

Data analysis

Baio worked with the very awesome Gilad Lotan to do some network analysis of the follower network. I wanted to do something along similar lines, but looking at the user-to-user retweet network to understand how messages are disseminated within the community. For our purposes of looking at the structure of information sharing in GamerGate, we can turn to some really interesting prior scholarship that’s looked at how retweet networks can be used to understand political polarization [1,2] and what are the factors that influence people to retweet others [3,4]. Their work does far more in-depth analyses and modeling than I’ll be able to replicate for a blog post in the current time frame, but I wanted to highlight a few. boyd and her coauthors [3] identify a list of uses for retweeting, including amplifying information to new audiences, entertaining a specific audience, making one’s presence as a listener known to the author, or to otherwise agree, validate, or demonstrate loyalty. These are obviously not exhaustive of all the uses of the retweet, but they can help frame the goals users have in mind when retweeting.

Using additional metadata in the file about the number and statuses and followers a user has around the time time of his tweet, I can create additional variables. One measure is “tweet_delta”, or the difference between the maximum and minimum observed value for a user’s “user_statuses_count” field recorded for each of their tweets. This ideally captures how many total tweets the user made outside of the observations in the dataset. A second related variable, “tweet_intensity” is the ratio of tweets in the (cleaned) dataset to the tweet_delta. This value should range between 0 (none of the tweets this user made over this timespan contain #Gamergate/#NotYourShield) and 1 (all of the tweets this user made over this timespan contain #Gamergate/#NotYourShield).

A third measure is “friend_delta”, or the difference between the maximum and minimum observed value for the number of other users that a given user follows. Like the “tweet_delta” above, this capture how many friends (I prefer the term “followees”, but “friends” is the official Twitter term) a user has at the time of each tweet. A similar measure can be defined for followers. Since you have less control over who or how many people follow you, friends/followees is a better metric for measuring changes in an individual’s behavior like actively seeking out information by creating new followee links. This value varies between 0 (no change in followees) to n (where n is the maximum number of followers observed over these 3 days).

I wrote the data cleanup and analysis in an IPython Notebook using the pandanetworkX, and seaborn libraries and visualized the data using Gephi. While I have posted the code to replicate these analyses to my GitHub for others to inspect or use on other data they’ve collected, I’ve decided not to share the data itself owing to very real concerns I have about how it might be used to target individuals for harassment in addition to secondary concerns about how far Twitter’s terms of use extent to secondary data and the pragmatic fact that the files are larger than GitHub is willing to host.

The retweet graph visualization

The basic network relationship I captured was whether User A retweeted User B. This network has directed (A retweeting B is distinct from B retweeting A) and weighted (A could retweet many of B’s tweets) connections. Again, to be explicit, the large colored circles are users and the colored lines connecting them indicate whether one retweeted each other (read an edge “A points to B” in the clockwise direction). This is not every retweet relationship in the data, but only those nodes belonging to retweet relationships where A retweeted B at least twice. This has the effect of throwing out even more data and structural information (so inferences about the relative size of clusters should reflect single instances of retweeting have been discarded), but reveals the core patterns. This is an extremely coarse-grained approach and there are smarter ways to highlight the more important links in complex networks, but this is cheap and easy to do.

The x and y coordinates don’t have any substantive meaning like in a scatterplot, instead I used the native ForceAtlas2 force-directed layout algorithm to position nodes relative to each other such that nodes with more similar patterns of connections are closer together. Making this look nice is more art than science and most of you can’t handle all my iterative layout heuristics jelly.

  • I’ve sized the nodes on “in-degree” such that users that are retweeted more by many unique users are larger and users that are retweeted less are smaller.
  • The color of the node corresponds to the “friend_delta” such that “hotter” colors like red and orange are larger changes in users followed and “cooler” colors like blue and teal are 0 or small changes in users followed. Nodes are colored grey if there’s no metadata available for the user.
  • The color of the link corresponds to the “weight” of the relationship, or the number of times A retweeted B. Again hotter colors are more retweets of the user and cooler colors are fewer retweets within the observed data.


Manual inspection of a few of the largest nodes in the larger cluster reveal that these are accounts that I would classify as “pro-Gamergate” while the largest nodes in the cluster in the lower left I would classify as “anti-Gamergate.” I didn’t look at every node’s tweet history or anything like that so maybe there are some people on each side being implicated by retweet association. There were a lot of questions about who the large blue anti-GG node is. Taking him at his word as someone who would welcome being targeted, this is the “ChrisWarcraft” account belonging to Chris Kluwe, who tweeted out this (hilarious) widely-disseminated post on October 21, which is during the time window of our data.

Let’s return back to my original (and insufficient) attempt at interpretation:

The technical term that network scientists like myself use for images like the one above are a “hairball” that often offer more sizzle than steak in terms of substantive insights. Eyeballing a diagram is a pretty poor substitute for doing statistical modeling and qualitative coding of the data (much more on this in the next section). Looking at a single visualization of retweet relationships from three days of data on a pair of hashtags can’t tell you a lot about authoritarianism, astroturfing, or other complex issues that others were offering as interpretations. I don’t claim to have the one “right” answer, but let me try to offer a better interpretation.

  • The pro-GG sub-community is marked by high levels of activity across several dimensions. They retweet each other more intensively (larger in a network where all edges are at least 2 retweets). They are actively changing who they follow more than the anti-GG group (this would need an actual statistical test). It’s certainly the case that participants are highly distributed and decentralized, but as I discuss more below, it also suggests they’re highly insular and retweeting each others’ content is an important part of supporting each other and making sense of outside criticism by intensively sharing information.
  • I suspect the anti-GG sub-community is smaller not because there are fewer people opposed to GG, but that the data analysis and visualization choices Baio and I made included only those people using the hashtag and excluded people who only retweeted once. In other words, one shouldn’t argue there are more Republicans than Democrats by only looking at highly active #tcot users. Ignoring Kluwe’s post as an outlier, the anti-GG sub-community looks smaller but similarly dense.
  • There’s a remarkable absence of retweeting “dialogue” between the two camps, something that’s also seen in other online political topics. Out of the thousands of users in the pro-GG camp, only 2 appear to retweet Kluwe’s rant. So contra the “diversity” argument, there actually appears to be a profound lack of information being exchanged between these camps which suggests they’re both insular. But if #Gamergate is where a lot of the pro-GG discussion happens while anti-GG discussion happens across many other channels not captured here, we can’t say much about anti-GG’s size or structure but we can have more confidence about what pro-GG looks like.
  • The reaction among the pro-GG crowd to my visualization also gives me an unanticipated personal insight into the types of conversations that this image became attached to. The speed and extent to which the visualization spread, the kinds of interpretations it was used to support, and the conversations it sparked all suggested to me that there were many pro-Gamergaters looking for evidence to support their movement, denigrate critics, or delegitimize opponents. My first-hand experience observing these latter two points (the tweets above being a sample) lend further weight to many other critics’ arguments about these and other forms of harassment being part and parcel of tactics used by many pro-GGers.


If anything, I hope this exercise demonstrates that while visualization is an important part in the exploratory data analysis workflow, hairballs will rarely provide definitive conclusions. I already knew this, but as I said before, I should have known better. But to really drive the point home that you might fashion the same hairball visualization to support very different conclusions, here are some more hairballs below from the very same dataset using other kinds of relationships.

First off, here is the mention network where User A is linked to User B if User B’s account in mentioned in User A’s tweet. Now this is a classic hairball (Kluwe is again the isolated-ish green node in the upper left, for those of you keeping score at home). Links of weight 1 are black and higher weights range from cool to hot. Unlike the highly polarized retweet network, here we have an extremely densely-connected core of nodes. I’ve decided the color the nodes by a different attribute than above, specifically normalized degree difference. This is calculated as (out-degree – in-degree)/(out-degree + in-degree) and varies from -1 where a user receives only mentions but never mentions anyone else (bluer) to 1 where a user only makes mentions but is never mentioned by anyone else (redder). There’s really no discernable structure as far as I can tell and anti-GG accounts are mixed in with pro-GG accounts and other accounts like Adobe and Gawker that have been caught up.


But the node colors do tell us something about the nature of the conversation, namely, there are very many nodes that appear to be engaged in harassment (red nodes talking at others but not being responded to) and many nodes that are being targeted for harassment (blue nodes being talked at but not responding, like Feliciaday). Indeed, plotting this relationship out, the more tweets a user makes mentioning another account (x-axis), the lower their normalized degree difference (y-axis). I’ve fit a lowess line to clarify  this relationship in red. In other words, we’re capturing  one feature of harassment where more tweets mentioning other people buys you more responses from others up to about a dozen tweets and then continuing to tweet mentioning other people results in fewer people mentioning you in return.


Second, here’s the multigraph containing the intersection of the retweet and mention networks. User A is linked to User B if A both retweeted and mentioned B within the dataset. Unlike the previous posts, I haven’t filtered the data to include edges and nodes above weight 1, so there are more nodes and weaker links present. I’ve colored the nodes here by account_age, or the number of days the account existed before October 24. Bluer nodes are accounts created in recent week, redder nodes are accounts that have existed for years, grey nodes we have no data on. I’ve left the links as black rather than coloring by weight, but the edges are still weighted to reflect the sum of the number of mentions and number of retweets. This network shows a similarly polarized structure as the retweet network above. Manual inspection of nodes suggests the large, dense cluster of blue nodes in the upper-right is pro-GG and the less dense cluster of greener nodes in the lower-left is anti-GG. By overlapping the data in this way, we have another perspective on the structure of a highly-polarized conversation. The pro-GG came looks larger in size, owing to the choice not to discard low-weight links, which suggests that anti-GG participation is not as intense and cohesive as the tightly-connected pro-GG camp that suggests more insularity.


It’s also worth noting there are substantially more new accounts in the pro-GG camp than the anti-GG camp. We can examine whether there’s a relationships between the age of the account and the clustering coefficient. The clustering coefficient captures whether ones friends are also friends with each other: the pro-GG appears to have more clustering and more new accounts and the anti-GG appears to have less clustering and older accounts. The boxplots below bear this rough relationship out: as the clustering coefficient increases (the other users mentioned by a user also mention each other), the average age of these accounts goes down substantially. This also seems to lend more weight to the echo chamber effect — newly created accounts are talking within dense networks that veer towards pro-GG with older accounts are talking within sparser networks that veer towards anti-GG.


Third, here’s the hashtag network where User A (users are blue) is linked to Hashtag B (hashtags are red) if the user mentions the hashtag in the tweet. I’ve intentionally omitted the #Gamergate and #NotYourShield hashtags as one of these would show up in every tweet, so it’s redundant to include them. I’ve also focused only on the giant component, ignoring the thousands of other unconnected hashtags and users in the network. This graph is distinct from the others as it is a bipartite graph containing two types of nodes (hashtags and users) instead of one type of node (users in the previous.) This graph is also weighted by the number of times the user mentions a hashtag (wamer = more). Some of the noticeable related hashtags are #gamer (top), #fullmcintosh (centerish), and #StopGamergate2014 (bottom right). Interestingly, many of these hashtags appear to be substrings “gamergate” such as “gamerg”, “gamerga”, “gam”, etc. that is some combination of an artifact of Twitter clients shortening hashtags, or improvisation among users to find related backchannels. But a number of anti-GG hashtags are present and connected here suggesting the discussion isn’t as polarized as the RT graph would suggest. This likely reflects users including hashtags sarcastically, like a pro-GG including #StopGamergate2014. There are also outwardly unrelated hashtags such as #ferguson, #tcot, and #ebola included.



Each of these new networks reveals alternative perspectives about the structure and cohesiveness of Gamergate supporters and opponents. I should have shared all these images from the start, but these later three required a bit more work to put together over the past few days. My priors about it being an insular echo chamber are borne out by some evidence and not by other hairballs. Each side might divine meaning from these blobs of data to support their case, but in the absence of actual hypothesis testing, statistical modeling, and qualitative coding of data, it’s premature to draw any conclusions. I did some other exploratory data analysis that suggests features associated with being pro-GG like highly-clustered networks and authoring many tweets mentioning other users are associated with potentially harassing behavior like using newly-created accounts and getting few replies from others.

So where do should we go from here? First and most obviously, I hope others are collecting data about how Gamergate has unfolded over a wider range of time than three days and set of hashtags than #gamergate itself. I hope that literature around online political polarization, online mobilization of social movements, and the like is brought to bear on these data. I hope qualitative and quantitative methods are both used to understand how content and structure are interacting to diffuse ideas. I hope researchers are sensitive to the very real ethical issues of collecting data that can be used for targeting and harassment if it fell into the wrong hands. I hope my 15 minutes of fame as an unexpected B-list celebrity in the pro-GG community doesn’t invite ugly reprisals.

On Starting a New Job

Academics, Data

I am starting a new job in November. This is not a prank like last time. But before the grand reveal of where, first I’ll subject you to a lengthy blog post about my thoughts about the how and why. Hopefully this provides an additional perspective to the excellent posts by Lana Yarosh and Jason Yip on their experiences on the computer/information science academic job market. But those of you who know the rhythms of the academic job market are already realizing that (spoiler alert), I’m not starting a tenure-track faculty role. Instead, I’m going to spend the next few years being a data scientist. But I definitely promise not to be this guy:

This blog post is a mixture of how to get into data science as well as how to leave academia for industry. I want to be clear that this is not my farewell letter to academia, but rather advice to other PhDs—especially in the social sciences—who are considering going into industrial data science. This is the amalgamation of notes I’ve kept, thoughts I’ve restrained myself from tweeting, and lessons from  innumerable pep-talks fromclose friends and family who have counseled me through this process. I hope my experience can clarify some of the fuzzy contours of a process that academia leaves you completely unprepared for. But fair warning, this is still a really, really, really long treatise. I’ve tried to make up for that with a liberal application of GIFs.

This story about a boy leaving a plum post-doc at a great lab on good terms for a non-tenure-track data science position is broken into four acts. The first, “The Big Decision”, is about the choice to pursue opportunities outside of the safe confines of academia. The second, “The Search”, is about my experience starting the search outside the Ivory Tower’s ladder. The third, “The Recruitment”, touches on some of the frustrations and anxieties I confronted through the process. And the final act, “The Little Decision”, is about my process of negotiating and choosing an offer.

At the outset, let me admit that I’m writing from a position of relative privilege as a network and computational social scientist who can pass as a “sexy” data scientist rather than, say, students of literature or biology, who will not granted the same assumptions about the merit of their interests and applicability of their training. That said, if you spend 4+ years in graduate school without ever taking classes that demand general programming and/or data analysis skills, I unapologetically believe that your very real illiteracy has held you back from your potential as a scholar and citizen. That’s tough love, but as someone who only started to learn programming via Python in the fourth year of my PhD, it can be remedied — often more quickly and easily than you’d believe. The rest of the world thinks this stuff is some arcane dark art, but I guarantee you’ll surprise yourself at how quickly you’ll be reading developer documentation, be able to ask and answer technical questions on StackOverflow, and ultimately be able to “pass” as the imposter that almost every other data scientist dabbling in these magicks feels too.

Stage 1: The Big Decision

The anchor and the list

Why am I interested in data science if I want to end up in academia? In my case, I am anchored in time and space as my partner still has two years left in her degree program even at the end of my two-year post-doc contract. We both plan on moving when she is done, so it doesn’t make sense for me to find a tenure-track job here. Although graduating up to a soft-money, non-tenure-track “research assistant professor” role was possible to ride out the next two years, I wanted to use this second two-year window to branch and try something new. In particular, I had never done an internship while in graduate school, nor had I had a “real” job in between undergraduate and graduate school, so I was curious about what life outside the asylum was like. 12 months into my post-doc, I began to scratch the itch of persistent recruiters and to think about what life in an industry research lab, corporate data science group, or start-up setting would be like. And while it is not Bismark, North Dakota, it is nevertheless the case that the hub of the data science universe is not in Boston, Massachusetts (SF and NYC are). So the requirement to stay here altered the calculus for the kinds of jobs I could consider. But these formed the outlines of some of the things that I put on a pro-con list that I started (but should have actually written down) before the search. I think writing down the pros and cons before starting the search could be important, both in terms of documenting what originally motivated you as well as capturing how your thinking evolved. So write a pro-con list before starting.

Skilling up isn’t selling out

Going into this process, the whispers that followed my academic colleagues who had preceded me in going to industry rang loudly in my ears. “He was so talented.” “She had so many best papers.” You can be forgiven if you think these are eulogies for deceased colleagues. Academia has a strange path dependency where venturing off the farm means you can never return. Well not never, but rarely. But my time in grad school and as a post-doc was marked by spending lots of time being on the outside looking in at other people’s cool data. Instead of devising ever-more-clever ways to move already-collected data from someone else’s machine to my own machine, I decided to look to industry as an opportunity to “skill up” and develop technical competencies. I learn best by doing, but there is little reward for “doing” machine learning, map-reduce, or graph databases in the social sciences if they are not in service of a research question. Others have pointed out that going into industry also lets you work on products that are essential infrastructure in the information economy, gives you the flexibility to move between product areas, or provides greater work-life balance. But for me, I very much hope to bring a really interesting battery of tools, skills, and practices back into the academy in a few years’ time.

Stage 2: The Search

Strength of All The Ties

Mark Granovetter’s famous “strength of weak ties” theory was formulated specifically in the context of job searching: strong ties share the same information but weaker ties provide new information about opportunities. I naively hoped that spamming people I trusted would lead them to return an informal e-mail with the subject “Looking for job” from out of the blue. Some of these initial messages received enthusiastic responses within hours, others languished for weeks, and some were likely forgotten. But the point of casting these stones is you neither know how far their ripples will propagate, nor what surprises they might dislodge. Of course be judicious in to whom and how you reach out, but also discard the academic job market logic of thinking that the only jobs available are the ones with publicly-posted calls. In fact, spending your time applying through sites like won’t generate the leads you need. Oh, and come to conferences like CSCW and ICWSM where lots of amazing social and computer scientists from academia and industry get together!

Recruiters are important, but they’re not your friend

As I’m sure many others have experienced, I had recruiters contacting me seemingly within minutes of making my first GitHib commit. In addition to spamming colleagues, I also entertained some of the opportunities recruiters sent along. I admit that being a recruiter must be terribly difficult job of trying to match insatiable demand for unicorns with the fallibility of actual human beings. Recruiters play an important role, not only in exposing you to a wider set of possibilities than your network might offer but also in steeling yourself for the realities of the search ahead. While recruiters are super enthusiastic, they aren’t a new friend trying to hook you up with a job as a favor: they get paid when you take an offer and they will pressure you to interview repeatedly and take any offer that comes along. This is often hard for academics who have come up through a system that demands deference to others’ agendas under the assumption they have your interests at heart as future advocates. Your days of delayed gratification are behind you. You will need to learn to assert yourself so their interests do not override your own and protect your time from distractions like peripherally-related opportunities. These skills, politely practiced on recruiters, will become even more invaluable when you enter the negotiation stage later on.

Growing into the role

By the virtue of even being awarded a PhD, you have accomplished something that makes you uniquely expert in the world. You’re going to be hired because your background includes some combination of “harder” technical skills in terms of using tools to perform analyses and “softer” integrative skills in terms of asking the right questions of complex data. But your interests and skills are surely extremely narrow and will need to expand considerably to fit into the nebulous boundaries of the “data scientist” who’s expected to have some combination of hacking, stats, and expertise. If you’re a social scientist like me, you’ve probably never taken machine learning and won’t know what k-means clustering means or why you should prune a decision tree. If you’re a computer scientist and never taken a statistics course, you may not know how to interpret effects from a regression model or how to design a behavioral experiment.

The intuitions behind these aren’t particularly hard, but you will have to study and practice using these to be conversant in them during interviews. There are no shortage of blog posts on how to break into data science, but I would especially recommend Trey Causey’s. I also strongly recommend Doing Data Science and Data Science for Business as two exemplary introductory books that don’t get lost in the weeds of formalized math or conceptual abstraction. Again, resist the academic urge to understand why they work from first principles, but focus instead on how you can use them to tackle problems and develop heuristics for the kinds of problems they won’t work on. Part of growing into the data scientist box will involve quickly teaching yourself to do these analyses beyond what the introductary documentation and manuals say. And you should seek out interesting data sets in “pop culture” (think entertainment, sports, geography, etc.), document these analyses on a GitHub repo, participate in Kaggle competitions, or implement a website with some interactive features. After you tweet (surely you already tweet!), go post your analysis to /r/DataIsBeautiful, HackerNews, or DataTau.

Stage 3: The Recruitment

Don’t do coding interviews

I am not a computer scientist nor am I a software engineer, so the experience of the “coding interview” was foreign to me. If you’re not familiar with the exercise, after the initial phone screens with recruiters or managers, you’re put on the phone or in a room with an engineer and asked to program some basic function to “make sure you know how to code” or “see how you think through problems.” You will never ever actually need to implement a Fibonacci number generator or sorting algorithm in your actual job. But in a coding interview, you will need to be able to demonstrate you can implement a workable version from scratch within minutes in a closed-book environment while a stranger judges you. This is a terrible approach to recruitment that selects on personality types rather than technical competence. If I had to do it all over again, I would simply refuse to do them — and make that clear ahead of time. If there are concerns about your ability to code, have them do a code review on the analyses you’ve posted to a GitHub repo or Kaggle competition. If they want to see how you think through a data analysis problem, walk through a case study. If some domain expertise needs to be demonstrated, arrange for a “take home” assignment to return after 24 hours. After 4+ years in a PhD program, you’ve earned the privilege to be treated better than the humiliation exercises 20-year old computer science majors are subjected to for software engineering internships.

Ignore the hangups of academia

I’ve alluded to this above, but it can’t be re-inforced enough: industry is not academia. If you’ve made the choice to go into industry, you need to re-calibrate your dystopian comic strip mindset from “PhD” and prepare yourself for “Dilbert”. You’re entering into a new kind of relationship where success is measured by purusing a very applied research program that will demand flexibility, scalability, and attention to details. You really need to be fundamentally honest with yourself about this: industry will not be academia by another means. Your manager will provide a different kind of mentorship that is likely both more and less hands-on than you’re used to. Getting things done fast matters more than worrying about novelty. The pace of work revolves around fiscal quarters rather than academic semesters. You will be compensated and promoted for implementing ideas that are “good” because they make or save money. On the flip side, you don’t need to commit to toiling away somewhere for 4+ years. Your work will be used by thousands or millions of people. There shouldn’t be a shortage of extremely motivated people who have amazing skills and ideas. There will be many fringe benefits that you won’t need “seniority” to take advantage of. You could gross more in your first year as an industrial data scientist than you did in all your years in graduate school, combined. You’re right to think many of these are deeply unfair, but don’t forgo the privileges that an industry role entitles you to out of deference to academic norms you’ve internalized but no longer apply. I admit to having many, if not all, of these hangups, but you need to confront and tame them lest they lead you to self-sabotage.

Serenity in the face of chaos

I’m probably not alone in having the tendency to conjure detailed future scenarios far before the prerequisite actions have remotely come to pass: “Wouldn’t it be great to live in X”, “I can’t wait to work on Y”, “Z is such a brilliant person.” But while enthusiastically pursuing several leads, things well outside of your control will shut them down. Higher-ups might surprise the rest of the group with re-organizations, the politics of hiring decisions might surprise a potential manager, etc. These aren’t a reflection on your performance in an interview, but the disappointment you’ll feel from wanting something for which you’re both qualified and recognized for nevertheless being “taken away” is still very real. So in addition to keeping the fantasies on ice, never stop pursuing other opportunities during the search, no matter how “sure” something feels. You’ll need a backup plan in the worst case and leverage in the best case, so keep other recruiting efforts going even after there are offers on the table.

Stage 4: The Little Decision

Negotiating offers: thar be bigints here

Make sure to take advantage of websites like Glassdoor and PayScale to get a sense for what others at the company or in similar roles earn. Does a median starting salary of $120k seem like an outrageous sum of money when you’d really be happy with just $90k when all your assistant professor friends at Big State U are just making $60k? It turns out your potential employer would also be happy to pay you less than the market rate too! But that starting salary you negotiate becomes the base on which all your raises, bonuses, and future salary negotiations will be based. I made this “mistake”, but apparently it’s widely-acknowledged to never disclose your current salary or how much you expect to make. Remember, this is a business negotiation where they are hiring you to make them a lot of money, some (small) fraction of which they’ll return to you as compensation. If the transactional logic of shaking the most money out of a for-profit corporation who will unflinchingly lay you off in a heartbeat makes you uncomfortable, there are an increasing number of exciting data science opportunities in government and non-profit spaces too. But after an offer has been made, the worst thing an employer can say to your salary request for what seems like a really big number is “No”. Really — it doesn’t go on your permanent record or anything. Having other offers and using them as leverage in a negotiation is not dishonest, especially if you’re upfront throughout the process about looking at other roles (as you should be doing). There are lots of other excellent resources out there on negotiating offers, and never ever accept the first one given to you!

Equity and other fringe benefits

If you’re going into a start-up environment, issues around equity loom larger than salary, but it’s a complicated game. Remember again, if you’re a data scientist with a PhD, you’re worth more than an entry-level engineer and you should be asking along the lines of what other mid or senior-level engineers get: something like 10 and 50 “points” (0.1% – 0.5% of equity), which may vary substantially depending on the size and stage of the company. But always remember that your stake is likely to get diluted down as the company grows and the employee equity pool is usually last in line cash out after the other investors and founders. A fraction of a percent doesn’t sound like a lot, but when $100 million exits aren’t rare, software developers aren’t buying big houses or starting non-profits by dutifully saving up their salaries. You should use sites like AngelList and Wealthfront to get a sense for what the going rate in an industry, role, etc. is. How you choose to balance the trade-off between more salary and more equity comes down to your tolerance for risk and your faith in the founders’ vision and other investors’ patience. And don’t forget to go back to that pro-con list at the start. If there are other fringe things you would like to keep doing like attending/submitting to relevant conferences (whether academic-focused like KDD, ICWSM, and CSCW, or industry-focused like Strata, UseR, PyCon), having time to consult for a non-profit, teaching at a local college, etc., negotiation is the best time to make those expectations clear.

The Reprisal of the Pro and Con list

Now you have some offers on the table and the pro-con list you wrote up before starting recruitment. Like me, your thinking probably changed a lot going through the process. You will have gone through many reality-distortion fields, drank a good amount of kool-aid, and probably saw some sausage-making throughout the process. This will lead to you coming up with cons you hope to never have to confront again and pros you never dreamed of. Like any good little Bayesian, you should use this new information to update your prior beliefs to come to a better decision. In my case, my priorities started off with wanting privileged access to data, working in an industrial/corporate setting, and developing new analytical skills. After the process and talking with friends and colleagues, I realized that many of these still applied, but I had overlooked how important being able to engage in the academic conversations was to me, especially if I wanted to stay competitive on the academic market for the medium term. But I can’t stress how important it is to have some sort of objective list of criteria that you write down or store in other people’s minds so that you can ground yourself on these values during a very exciting but disorienting search.

The Grand Reveal

With all of that bluster out of the way, I’m very excited to announce that I will be joining the Harvard Business School as a research associate in November. This is not a tenure-track job but I will become one of the first data scientists on their HBX platform, which is their unique MOOC initiative focused on business education. I will be doing a mixture of both platform and academic research to understand the factors that contribute to learning and success in these contexts using both observational and experimental data.

I realize that the bloom is very much off the flower after some very public failures and very justified criticism in the MOOC space. But I also think there are important niches these can fill, even if they can’t and shouldn’t supplant other modes of education. I believe that HBX has identified a really interesting niche and strategy as well as made a big commitment in people and resources , so I’m excited to dig into where and how these approaches are succeeding or faltering. I’ve also been told Harvard employs a number of smart people and has something of a soapbox from which to publicize information. Seriously, I’m thrilled to be at the intersection of traditional business strategy and education, data-driven decision making, and collaborating with brilliant HBS faculty like Bharat Anand and HarvardX colleagues like Justin Reich.

“But Brian, you just spent a billion words talking a big game about industrial data science — what gives?” You’re right, I still don’t have any full-time experience working in industrial data science. You’re also right that I’m still in academia. Going back to my pro and con list — which is going to be different for every person — this role gave me an ideal mix between academia and industry: it is focused on research on learning at scale but it’s working on a product with very real customers and competitors. If I was looking for a longer-term career change, was more willing to relocate, had different skills and interests, or wasn’t solving a two-body problem, I would have made very different decisions.

Returning to the question of why write up something I’m not actually doing, I wanted to share my perspective of “how I almost went into industry” after four months of interviews with nearly a dozen different companies. Very little in academia prepares you to go on a market like this, but getting social scientists into data science roles is vital to ensure the right questions are being asked and the best inferences are being made from many types of data. The recruitment process will be upsetting and disorienting and the episodes from above may or may not resonate among those actually in industry. And I hope others will share their stories. But I wanted to especially target those of you in academia and are considering making the jump: you’re not alone and you should go for it.

Thanks for making it this far and feel free to get in touch if you have any questions. And many thanks to Alan, Lauren, Michael, Patti, Trey, and Ricarose for super valuable feedback on earlier versions of this post!

Peripherality, mental health, and Hollywood

Data, Wikipedia

I promised to do a bigger tear down of Wikipedia’s coverage of currents events like Robin Williams’ death and the protests in Ferguson, Missouri this week, but I wanted to share a quick result based on some tool-development work I’m doing with the Social Media Research Foundation‘s Marc Smith. We’re developing the next version of WikiImporter to allow NodeXL users to import many of the multiple types of networks in MediaWikis [see our paper].

On Wednesday, we scraped the 1.5-step ego network of the articles that the Robin Williams article currently connects to and then whether or not these articles also link to each other. For example, his article links to the Wikipedia articles for “Genie (Aladdin)” as well as the article “Aladdin (1992 Disney film)” article, reflecting one of his most celebrated movie roles. These articles in turn link to each other because they are clearly closely related to each other.

However, other articles are linked from Williams’s article but do not link to each other. The article “Afghanistan” (where he performed with the USO for troops stationed there) and the article “Al Pacino” (with whom he co-starred in the 2002 movie, Insomnia) are linked from his article but these articles do not link to each other themselves: Al Pacino’s article never mentions Afghanistan and Afghanistan’s article never mentions Al Pacino. In other words, the extent to which Wikipedia articles link to each other provides a coarse measure of how closely related two topics are.

The links between the 276 articles that compose Williams’s hyperlinked article neighborhood have a lot of variability in whether they link to each other. Some groups around movies and actors are more densely linked than other articles about the cities he’s lived are relatively isolated from other linked articles. These individual nodes can be partitioned into groups using a number of different bottom-up “community detection” algorithms. A group is roughly defined as having more ties inside the group than outside of the group. We can visualize the resulting graph breaking the communities apart into sub-hairballs to reveal the extent to which these sub-communities link to each other.


The communities reveal clusters of related topics about various roles, celebrity media coverage, and biographical details about places he’s lived and hobbies he enjoyed. But buried inside the primary community surrounding the “Robin Williams” article are articles like “cocaine dependence“, “depression (mood)“, and “suicide“. While these articles are linked among themselves, reflecting their similarity to each other, they are scarcely linked to any other topics in the network.

To me, this reveals something profound about the way we collectively think about celebrities and mental health. Among all 276 articles and 1,399 connections in this hyperlink network about prominent entertainers, performances in movies and television shows, and related topics, there are only 4 links to cocaine dependence, 5 links to depression, and 13 to suicide. In a very real way, our knowledge about mental health issues is nearly isolated from the entire world of celebrity surrounding Robin Williams. These problems are so peripheral, they are effectively invisible to the ways we talk about dozens of actors and their accomplishments.

In an alternative world in which mental health issues and celebrity weren’t treated as secrets to be hidden, I suspect issues of substance abuse, depression, and other mental health issues would move in from the periphery and become more central as these topics connect to other actors’ biographies as well as being prominently featured in movies themselves.

What network scientists do when they’re bored


Or maybe it’s just me when I’m bored, I haven’t blogged in a while, and the wife is working late. Inspired by this tweet about the distribution of colors in the L.L. Bean Home Fall 2014 catalog, I took it upon myself to analyze the network of color-product relationships.

A color is linked to another color if you can buy a product in both colors. For example, because 240-Thread Count Cotton Satteen Bedding is available in both “Lakeside” and “Pale Moss” these two colors are linked together on the network.

color_network_lccYou can find a copy of the graph file here.



The Beneficence of Mobs: A Facebook Apologia

Academics, Data

Last week, the Proceedings of the National Academy of Science (PNAS) published a study that conducted a large-scale experiment on Facebook. The authors of the study included an industry researcher from Facebook as well as academics at the University of California, San Francisco and Cornell University. The study employed an experimental design that reduced the amount of positive or negative emotional content in 689,000 Facebook users’ news feeds to test whether emotions are contagious. The study has since spawned a substantial controversy about the methods used, extent of its regulation by academic institutions’ review board, the nature of participants’ informed consent, the ethics of the research design itself, and the need for more explicit opt-in procedures.

In the face of even-tempered thinking from a gathering mob, I want to defend the execution and implications of this study. Others have also made similar arguments [1,2,3], I guess I’m just a slow blogger. At the outset, I want to declare that I have no direct stake in the outcome of this brouhaha. However, I do have professional and personal relationships with several members of the Facebook Data Science team (none of whom are authors on the study), although the entirety of this post reflects only public information and my opinions alone.

First, as is common in the initial reporting surrounding on scientific findings, there was some misinformation around the study that was greatly magnified. These early criticisms claimed the authors mis-represented the size of the observed effects (they didn’t) or the research wasn’t reviewed by the academic boards charged with human subjects protection (it was). There is likewise a pernicious tendency for the scientific concept of experimental manipulation to be misinterpreted as the homophone implying deception and chicanery: there is no inherent maliciousness in randomly assigning participants to conditions for experimental study. Other reporting on the story has sensationalistically implied users were subjected to injections of negative emotional content so their resulting depression could be more fully quantified. In reality, the study actually only withheld either positive or negative content from users, which resulted in users seeing more of posts they would have seen anyway. In all of these, the hysteria surrounding a “Facebook manipulates your emotions” or is “transmitting anger” story got well ahead of any sober reading of the research reported by the authors in the paper.

Second on the substance of the research, there are still serious questions about the validity of methodological tools used , the interpretation of results, and use of inappropriate constructs. Prestigious and competitive peer-reviewed journals like PNAS are not immune from publishing studies with half-baked analyses. Pre-publication peer review (as this study went through) is important for serving as a check against faulty or improper claims, but post-publication peer review of scrutiny from the scientific community—and ideally replication—is an essential part of scientific research. Publishing in PNAS implies the authors were seeking both a wider audience and a heightened level of scrutiny than publishing this paper in a less prominent outlet. To be clear: this study is not without its flaws, but these debates, in of themselves, should not be taken as evidence that the study is irreconcilably flawed. If the bar for publication is anticipating every potential objection or addressing every methodological limitation, there would be precious little scholarship for us to discuss. Debates about the constructs, methods, results, and interpretations of a study are crucial for synthesizing research across disciplines and increasing the quality of subsequent research.

Third, I want to move to the issue of epistemology and framing. There is a profound disconnect in how we talk about the ways of knowing how systems like Facebook work and the ways of knowing how people behave. As users, we expect these systems to be responsive, efficient, and useful and so companies employ thousands of engineers, product managers, and usability experts to create seamless experiences.  These user experiences require diverse and iterative methods, which include A/B testing to compare users’ preferences for one design over another based on how they behave. These tests are pervasive, active, and on-going across every conceivable online and offline environment from couponing to product recommendations. Creating experiences that are “pleasing”, “intuitive”, “exciting”, “overwhelming”, or “surprising” reflects the fundamentally psychological nature of this work: every A/B test is a psych experiment.

Somewhere deep in the fine print of every loyalty card’s terms of service or online account’s privacy policy is some language in which you consent to having this data used for “troubleshooting, data analysis, testing, research,” which is to say, you and your data can be subject to scientific observation and experimentation. Whether this consent is “informed” by the participant having a conscious understanding of implications and consequences is a very different question that I suspect few companies are prepared to defend. But why does a framing of “scientific research” seem so much more problematic than contributing to “user experience”? How is publishing the results of one A/B test worse than knowing nothing of the thousands of invisble tests? They reflect the same substantive ways of knowing “what works” through the same well-worn scientific methods.

Fourth, there has been no substantive discussion of what the design of informed consent should look like in this context. Is it a blanket opt-in/out to all experimentation? Is consent needed for every single A/B iteration or only those intended for scientific research? Is this choice buried alongside all the other complex privacy buttons or are users expected to manage pop-ups requesting your participation? I suspect the omnipresent security dialogues that Windows and OS X have adopted to warn us against installing software have done little to reduce risky behavior. Does adding another layer of complexity around informed consent improve the current anxieties around managing complex privacy settings? How would users go about differentiating official requests for informed consent from abusive apps, spammers, and spoofers? Who should be charged with enforcing these rules and who are they in turn accountable to? There’s been precious little on designing more informed consent architectures that balance usability, platform affordances, and the needs of researchers.

Furthermore, we might also consider the  ethics of this nascent socio-technical NIMBYism. Researchers at Penn State have looked at the design of privacy authorization dialogues for social networks but found that more fine-grained control over disclosure reduced adoption levels.  We demand ever more responsive and powerful systems while circumscribing our contributions but demanding benefits from other’s contributions. I image the life of such systems would be poor, nasty, brutish, and short. Do more obtrusive interventions or incomplete data collection in the name conservative interpretations of informed consent promote better science and other public goods? What are the specific harms that we should strive to limit in these systems and how might we re-tailor 40 year old policies to these ends?

I want to wrap up by shifting the focus of this conversation from debates about a study that was already done to what should be done going forward. Some of the more extreme calls I’ve seen have advocated for academic societies or institutions to investigate and discipline the authors, others have called for embargoing studies using Facebook data from scholarly publication, and still others have encouraged Facebook employees to quit in protest of a single study.  All this manning of barricades strikes me as a grave over-reaction that could have calamitously chilling effects on several dimensions.  If our overriding social goal is to minimize real or potential harm to participants, what best accomplishes this going forward?

Certainly expelling Facebook from the “community of scholars” might damage its ability to recruit researchers. But are Facebook users really made safer by replacing its current crop of data scientists who have superlative social science credentials with engineers, marketers, and product managers trying to ride methodological bulls they don’t understand? Does Facebook have greater outside institutional accountability by closing down academic collaborations and shutting papers out from peer review and publication? Are we better able to know the potential influence Facebook wields over our emotions, relationships, and politics by discouraging them from publicly disclosing the tools they have developed?  Is raising online mobs to attack industry researchers conducive to starting dialogues to improve their processes for informed consent? Is publicly undermining other scientists the right strategy for promoting evidence-based policy-making in an increasingly hostile political climate?

Needless to say, this episode speaks for the need for rapprochement and sustained engagement between industry and academic researchers. If you care about research ethics, informed consent, and well-designed research, you want companies like Facebook deeply embedded within and responsible to the broader research community. You want the values of social scientists to influence the practice of data science, engineering,  user experience, and marketing teams. You want the campus to be open to visiting academic researchers to explore, collaborate, and replicate. You want industry research to be held to academia’s more stringent standards of human subjects protection and regularly shared through peer-reviewed publication.

The Facebook emotional contagion study demands a re-evaluation of prevailing research ethics, design values, and algorithmic powers in massive networked architectures. But the current reaction to this study can only have a chilling effect on this debate by removing a unique form responsible disclosure through academic collaboration and publishing. This study is guaranteed to serve as an important case study in the professionalization of data science. But academic researchers should make sure their reactions do not unintentionally inoculate industry against the values and perspectives of social inquiry.



Data-Driven Dreams

Academics, Data

What makes misinformation spread? It’s a topic of vital importance with empirical scholarship going back to 1940s on how wartime rumors spread. Rumors, gossip, and misinformation are pernicious for many reasons, but they can reflect deeply-held desires or are reasonably plausible, which makes them hard to stay ahead of or rebut. I have an interest in the spread of misinformation in social media and have published some preliminary research on the topic. So it was fascinating for me to witness misinformation spread like wildfire through my own academic community as it speaks to our data-driven anxieties and dreams.

What we wish to be true

Scientific American published a brief article dated June 1 (but released a week beforehand) titled “Twitter to Release All Tweets to Scientists.” The article claims “[Twitter] will make all its tweets, dating back to 2006, freely available to researcherseverything is up for grabs.” (emphasis added) This claim appears to refer to the Twitter Data Grants initiative announced on February 5th in which they will “give a handful of research institutions access to our public and historical data.” (emphasis added) On April 17, Twitter announced it had received more than 1,300 proposals form more than 60 different countries, but selected only 6 institutions “to receive free datasets.” There have been no subsequent announcements of another Twitter-sponsored data sharing initiative and the Scientific American article refers to a February announcement by Twitter. The semantics of the article’s title and central claim are not technically false as a corpus of tweets was made available gratis (free as in beer) to a selected set of scientists.

Following in the tradition of popular fact-checking websites such as PolitiFact, I rate this claim MOSTLY FALSE. It is not the case that Twitter has made its historical corpus of tweets available to every qualified researcher. This was the interpretation I suspect many of my friends, collaborators, and colleagues were making. But collectively wishing something to be true doesn’t make it so. In this case, the decision about who does and does not get access to their data was already made on April 17, and Twitter (to my knowledge) has made no public announcements about new initiatives to further open its data to researchers. Nothing appears to have changed in their policies around accessing and storing data through their APIs or purchasing data from authorized resellers such as gnip. There’s no more Twitter data available to the median scientific researcher now than there was a day, a month, or a year ago.

What we wish could be changed

Indeed, by selecting only 6 proposals out of approximately 1,300 submission, this proposal process had an acceptance rate of less than 0.5% — lower than most NSF and NIH grant programs (10-20%), lower than the 5.9% of applicants accepted to Harvard’s Class of 2018, lower than Wal-Mart’s 2.6% acceptance rate for its D.C. store, but about average for Google’s hiring. There is obviously a clear and widespread demand by scientific researchers to use Twitter data for a broad variety of topics, many of which are reflected in the selected teams interests across public health, disaster response, entertainment, and social psychology. But we shouldn’t cheer 99.5% of interested researchers being turned away from ostensibly public data as a victory for “open science” or data accessibility.

A major part of this attraction is the fact that Twitter data is like an illusory oasis in the desert where the succor seems just beyond the next setback. The data contains the content and social network structure of communication exchanges for millions of users spread across the globe with fine-grained metadata that captures information about audience (followers), impact (retweets/favorites), timestamps, and sometimes locations. Of course researchers from public health, to disaster response, to social psychology want to get their hands on it. With my collaborators at Northeastern University and other institutions, we put in two separate Twitter data grant proposals to study both misinformation propagation as well as dynamics around the Boston Marathon bombings. We felt these were worthy topics of research (as have NSF funding panels), but they were unfortunately not selected. So you’re welcome to chalk this post up to my sour grapes, if you’d like.

Twitter’s original call for proposals notes that “it has been challenging for researchers outside the company… to access our public, historical data.” This is the methodological understatement of the decade for my academic collaborators who must either write grants to afford the thousands of dollars in fees resellers charge for this data, build complex computing infrastructures to digest and query the streams of public data themselves, or adopt strategies that come dangerously close to DDoS attacks to get around Twitter’s API rate limits. This is all to simply get raw material of tweets that are inputs to still other processes of cleanup, parsing, feature extraction, and statistical modeling. Researchers unfamiliar with JSON, NoSQL, or HDFS must partner with computer and information scientists who go, at considerable peril to their own methods, interests, and infrastructures, into these data mines.

What should be changed

I said before this data is “ostensibly public”, but Twitter has very good reasons to build the walls around this garden ever higher. Let me be clear: Twitter deserves to be lauded for launching such a program in the first place. Hosting, managing, querying, and structuring these data require expensive technical infrastructures (both the physical boxes and the specialized code) as well as expensive professionals to develop, maintain, and use them. Twitter is a private, for-profit company and its users grant Twitter a license to use, store, and re-sell the things they post to it. So it’s well within its rights to charge or restrict access to this data. Indeed there are foundational issues of privacy and informed consent that should probably discourage us from making tweets freely available as there are inevitably quacks hiding among the quantoids like myself submitting proposals.

It’s also important to reflect on issues that Kate Crawford and danah boyd have raised about inequalities and divides in access to data. Twitter did an admirable job in selecting research institutions that are not solely composed of American research universities with super-elite computer science programs. But it doesn’t alter the structural arrangements in which the social, economic, and political interests of those who (borrowing from Lev Manovich) create these data are distinct from the interests of those who collect, analyze, and now share it with other researchers. The latter group obviously sets the terms of who, what, where, when, why, and how this data is used. Twitter would have good reasons to not provide data grants to researchers interested in criticizing Silicon Valley tech culture or identifying political dissenters, yet this model nevertheless privileges Twitter’s interests above others’ for both bad and good reasons.

Some models for big open data

So what might we do going forward? As I said, I think Twitter and other social media companies should be lauded for investing resources into academic research. While it makes for good press and greases the recruitment pipeline, it can still involve non-trivial costs and headaches that can be risky for manager and make investors unhappy. I think there are a few models that Twitter and similar organizations might consider going forward to better engage with the academic community.

First, on the issue of privacy, social media companies have an obvious responsibility to ensure the privacy of their users and to prevent the misuse of their data. This suggests a “visiting researcher” model in which researchers could conduct their research under strict technical and ethical supervision while having privileged access to both public and private metadata. This is a model akin to what the U.S. Census uses as well as what Facebook has been adopting to deal with the very sensitive data they collect.

Second, organizations could create outward facing “data liaisons” who provide a formal interface between academic communities’ interests and internal data and product teams. These liaisons might be some blend of community management, customer service, and ombudsperson who mobilize, respond to, and advocate for the interests of academics. The Wikimedia Foundation is an exemplar of this model as it has staffers and contractors who liaise with the community as well as analytics staff who assist researchers by building semi-public toolservers.

Third, organizations could publish “data dumps” in archives on a rolling basis. These might be large-scale datasets that are useful for basic tasks (e.g., userid to username key-value pairs),  anonymized data that could be useful in a particular domain (e.g., evolution of the follower graph), or archives of historical data (e.g., data from 5 years ago). The Wikimedia Foundation and StackOverflow both provide up-to-date data dumps that have been invaluable for academic research by reducing the overhead for researchers to scrape these data together by themselves.

Finally, on the issue of academic autonomy and conflicts of interest, social media companies could adopt a “data escrow” model. Twitter would provide the data to a panel of expert reviewers who could then in turn award it to worthy research teams after peer review. This privileges academic autonomy and peer review that has underpinned norms of teaching, publishing, funding, and promotion for decades and would prevent Twitter’s conflicts of interests from biasing legitimate academic research interests. Twitter could rotate between different disciplinary themes such as public health or incentivize interdisciplinary themes like crisis informatics. I haven’t seen a model of this in action before, but let me know of any in the comments or… via Twitter :)

And stop linking to that goddamned Scientific American article and link to this one instead!

Crime, Time, and Weather in Chicago


Can we attribute a fall in violent crime in Chicago to its new conceal and carry laws? This was the argument many conservative blogs and news outlets were making in early April after statistics were released showing a marked drop in the murder rate in Chicago. These articles attributed “Chicago’s first-quarter murder total [hitting] its lowest number since 1958″ to the deterrent effects of Chicago’s concealed and carry permits being issued in late February (RedState). Other conservative outlets latched onto the news to argue the policy “is partly responsible for Chicago’s across-the-board drop in the crime” (TheBlaze) or that the policy contributed to the “murder rate promptly [falling] to 1958 levels” (TownHall).

Several articles hedged about the causal direction of any relationship and pointed out that this change is hard to separate from falling general crime rates as well as the atrocious winter weather this season (PJMedia,WonketteHuffPo) The central claim here is whether the adoption of the conceal and carry policy in March 2014 contributed to significant changes in crime rates rather than other social, historical, or environmental factors.

However, an April 7 feature story by David Bernstein and Noah Isackson in Chicago magazine found substantial evidence of violent crimes like homicides, robberies, burglaries, and assaults being reclassified, downgraded to more minor crimes, and even closed as noncriminal incidents. They argue that after Police Superintendent Garry McCarthy arrived in May 2011, the drop in crime has improbably plummeted in spite of high unemployment and significant contraction in the Chicago Police Department’s beat cops. An audit by Chicago’s inspector general into these crime numbers suggests assaults and batteries may have been underreported by more than 24%. This raises a second question: can we attribute the fall in violent crime in Chicago to systematic underreporting of criminal statistics?

In this post, I do four things:

  • First, I demonstrate the relationship crime has with environmental factors like temperature as well as temporal factors like the hour of the day and day of the week. I use a common technique in signal processing to identify that criminal activity not only follows an annual pattern, but also patterns by day of the week.

Homicides by time of day and year

Temperature and crime

  • Second, I estimate a simple statistical model based on the findings above. This model combines temperature, the day of the week, the week of the year, and longer-term historical trends and despite its simplicity (relative to more advanced types of time series models that could be estimated), does a very good job explaining the dynamics of crime in Chicago over the past 13 years.

Statistical model

  • Third, I use this statistical model to make predictions about crime rates for the rest of 2014. If there’s a significant fall-off in violent crime following the introduction of the conceal and carry policy in March 2014, this could be evidence of its success as a deterrent (or that this is a bad model). But if the actual crime data matches the model’s forecasted trends, it suggests the new conceal and carry policy has had no effect. There are no findings here as yet, but I expect as the data comes in there will be no significant changes after March 2014.

Homicide predictions, 2014

  • Fourth, I find evidence of substantial discrepancies in the reporting some crime data since 2013. This obviously imperils the findings of the analyses done above, but also replicates the findings reported by Bernstein and Isackson. The statistical model above expected that property crimes such as arson, burglary, theft, and robbery should follow a particular pattern, which the observed data significantly deviates from after 2013. I perform some additional analyses to uncover which crimes and reporting districts are driving this discrepancy as well as how severe this discrepancy is.

Crime rate changes by CPD district

Deviation in reported statistics from model predictions

The repo containing the code is available on GitHub.

The complete notebook containing these and related analyses can be found here.

A network analysis of Bob Ross’s paintings


While it makes me look like I’m stalking him, Walt Hickey is doing some of the more fun data analysis on the new FiveThirtyEight. His recent April 13 article examined features from the entire canon of Bob Ross paintings and he happily made this data available within days. The data is stored as a rectangular matrix of episodes or pieces by rows and the specific features as columns. A piece like “Mountain-By-The-Sea” from Season 9, Episode 12 is coded as containing 10 features: “beach”, “bushes”, “clouds”, “cumulus”, “deciduous”, “mountain”, “ocean”, “snowy mountain”, “tree”, and “waves”.

To a network scientist like me, this is a common way of storing relationships. We typically imagine social networks as people connected to other people via friendship links, but we can also create many other kinds of networks like people connected to employers or the ingredients contained within recipes. These latter two networks are called “bipartite networks” because they involve two distinct types of nodes (people and employers, ingredients and recipes). We can create a bipartite network of paintings from Hickey’s data where a painting is connected to its features and these features are in turn shared with some other paintings. In the figure below, we can see that some features like trees are highly central (bigger text meaning more paintings having that feature). Conversely, features like “beach”, “ocean”, or “waves” are more peripheral on the upper left.

Bipartite network of features and paintings

For example, while the “Mountain-By-The-Sea” is described by the 10 features mentioned above, other paintings share some of these same features. There are 26 other paintings such as “Surf’s Up” and “High Tide” also contain “beach” features, and these beach-related paintings also have other features that they share in common with still other paintings. But “Mountain-By-The-Sea” is indirectly connected to “Mountain In and Oval” via “bushes”.

Performing a mathematical operation to convert the bipartite network into a more traditional network, we can connect these paintings directly to each other if they share a feature in common. In addition to having “bushes” in common, “Mountain-By-The-Sea” and “Mountain In and Oval” also have “clouds”, “cumulus”, “deciduous”, “mountain”, and “tree” in common. The mathematical operation assigns the relationship between “Mountain-By-The-Sea” and “Mountain In and Oval” a score of 6 because they share 6 features in common.

While Ross was a prolific painter, we well know that some features like trees occur more often than others. As a result, almost every painting is connected to every other painting by at least one feature, which isn’t terribly interesting to visualize as it would be a big “hairball.” Instead, we can take advantage of the fact that paintings have different levels of similarity to each other. We’ll use the SBV backbone algorithm developed by some of my network science colleagues at Northeastern University to extract only those relationships between paintings that are particiularly interesting and then visualize the resulting network.

Using an excellent network visualization tool called Gephi, I layed the network out and colored it below. Again, a painting (represented as a label below) is connected to another painting if they both share features in common. The SBV algorithm removed some links that should be present, but we can interpret the remaining links to be the “most important” links. I then used Gephi’s modularity function to discover “subcommunities” within this network that are highly similar to each other. These communities are represented by the six colors below.

  • Red These are ocean scenes which are not unusual in Ross’s repertoire — there are many paintings in here — but they share few features with other paintings in his oeuvre. Note that “Mountain by the Sea” is one of the three paintings that connect this group back to the rest.
  • Yellow These are the “waterfall” paintings.
  • Green These are the “mountain” paintings.
  • Purple These are the “snow” paintings.
  • Light blue These are the “homes” paintings.
  • Dark blue Something of a catchall.

Projected painting-to-painting network

We can also compute the “entropy” of all the features Ross used in a given season. Entropy is a way of measuring how disorganized a system is; in our case, how many times each feature appears in a season. If the same features we used exactly once, then the entropy would be 0. If different features were used different amounts, then the entropy would be high. Basically, entropy is another way of measuring how experimential or diverse Ross’s work was in a given season by how many different features he used.

The first two seasons have very low entropy, suggesting a very conservative approach. Season 3 and 4 see a drastic departure in style and much higher entropy (more experimentation) before falling again in Seasons 5 through 7 to a more conservative set of features again. Season 8 shows another surge in experimentation — different features being used in different amounts during the season. Season 13 shows some retrenchment back to more conservative habits followed again by a more experimental season. But in general, there’s a pattern of low entropy in the early years, a middle period of high levels of experimentation, followed in the later years by a consistent but moderate level of differentiation.

Entropy over time

The full code for my analysis of Hickey’s data is on GitHub.

The Need for Openness in Data Journalism


Do films that pass the Bechdel Test make more money for their producers? I’ve replicated Walt Hickey’s recent article in FiveThirtyEight to find out. My results confirm his own in part, but also find notable differences that point the need for clarification at a minimum. While I am far from the first to make this argument, this case is illustrative of a larger need for journalism and other data-driven enterprises to borrow from hard-won scientific practices of sharing data and code as well as supporting the review and revision of findings. This admittedly lengthy post is a critique of not only this particular case but also an attempt to work through what open data journalism could look like.

The Angle: Data Journalism should emulate the openness of science

New data-driven journalists such as FiveThirtyEight have faced criticism from many quarters and the critiques, particularly around the naïveté of assuming credentialed experts can be bowled over by quantitative analysis so easily as the terrifyingly innumerate pundits who infest our political media [1,2,3,4]. While I find these critiques persuasive, I depart from them here to instead argue that I have found this “new” brand of data journalism disappointing foremost because it wants to perform science without abiding by scientific norms.

The questions of demarcating what is or is not science are fraught, so let’s instead label my gripe a “failure to be open.” By openness, I don’t mean users commenting on articles or publishing whistleblowers’ documents. I mean “openness” more in the sense of “open source software” where the code is made freely available to everyone to inspect, copy, modify, and redistribute. But the principles of open-source software trace their roots more directly back norms in the scientific community that Robert Merton identified and came to known as “CUDOS” norms. It’s worth reviewing two of these norms because Punk-ass Data Journalism is very much on the lawn of Old Man Science and therein lie exciting possibilities for exciting adventures.

The first and last elements of Merton’s “CUDOS” norms merit special attention for our discussion of openness. Communalism is the norm that scientific results are shared and become part of a commons that others can build upon — this is the bit about “standing upon the shoulders of giants.” Skepticism is the norm that claims must be subject to organized scrutiny by community — which typically manifests as peer review. Both of these strongly motivated philosophies in the open source movement, and while they are practiced imperfectly in my experience within the social and information sciences (see my colleagues’ recent work on the “Parable of Google Flu“), I nevertheless think data journalists should strive to make them their own practice as well.

  1. Data journalists should be open in making their data and analysis available to all comers. This flies in the face of traditions and professional anxieties surrounding autonomy, scooping, and the protection of sources. But how can claims be evaluated as true unless they can be inspected? If I ask a data journalist for her data or code, is she bound by the same norms as a scientist to share it? Where and how should journalists share and document these code and data?
  2. Data journalists should be open in soliciting and publishing feedback. Sure, journalists are used to clearing their story with an editor, but have they solicited an expert’s evaluation of their claims? How willing are they to publish critiques of, commentary on, or revisions to their findings? If not, what are the venues for these discussions? How should a reporter or editor manage such a system?

The Guardian‘s DataBlog and ProPublica have each doing exemplary work in posting their datasets, code, and other tools for several years. Other organizations like the Sunlight Foundation develop outstanding tools to aid reporters and activists, the Knight Foundation has been funding exciting projects around journalism innovation for years, and the Data Journalism Handbook reviews other excellent cases as well. My former colleague, Professor Richard Gordon at Medill reminded me ideas around “computer assisted reporting” have been in circulation in the outer orbits of journalism for decades. For example, Philip Meyer has been (what we would now call) evangelizing since the 1970s for “precision journalism” in which journalists adopt the tools and methods of the social and behavioral sciences as well as its norms of sharing data and replicating research. Actually, if you stopped reading now and promised to read his 2011 Hedy Lamarr Lecture, I won’t even be mad.

The remainder of this post is an attempt demonstrate some ideas of what an “open collaboration” model for data journalism might look like. To that end, this article tries to do many things for many audiences which admittedly makes it hard for any single person to read. Let me try to sketch some of these out now and send you off in the right path.

  • First, I use an article Walt Hickey of FiveThirtyEight published on the relationship between the financial performance of films that the extent to which they grant their female characters substantive roles as a case to illustrate some pitfalls in both the practice and interpretation of statistical data. This is a story about having good questions, ambiguous models, wrong inferences, and exciting opportunities for investigation going forward. If you don’t care for code or statistics, you can start reading at “The Hook” below and stop after “The Clip” below.
  • Second, for those readers who are willing to pay what one might call the “Iron Price of Data Journalism”, I go “soup to nuts” and attempt to replicate Hickey’s findings. I document all the steps I took to crawl and analyze this data to illustrate the need for better documentation of analyses and methods. This level of documentation may be excessive or it may yet be insufficient for others to replicate my own findings. But providing this code and data may expose flaws in my technical style (almost certainly), shortcomings in my interpretations (likely), and errors in my data and modeling (hopefully not). I actively invite this feedback via email, tweets, comments, or pull requests and hope to learn from it. I wish new data journalism enterprises adopted the same openness and tentativeness in their empirical claims. You should start reading at “Start Your Kernels…”
  • Third, I want to experiment with styles for analyzing and narrating findings that make both available in the same document. The hope is that motivated users can find the detail and skimmers can learn something new or relevant while being confident they can come back and dive in deeper if they wish. Does it make sense to have the story up front and the analysis “below the fold” or to mix narrative with analysis? How much background should I presume or provide about different analytical techniques? How much time do I need to spend on tweaking a visualization? Are there better libraries or platforms for serving the needs of mixed audiences? This is a meta point as we’re in it now, but it’ll crop up in the conclusion.
  • Fourth, I want to experiment with technologies for supporting collaboration in data journalism by adopting best practices from open collaborations in free software, Wikipedia, and others. For example, this blog post is not written in a traditional content-management system like WordPress, but is an interactive “notebook” that you can download and execute the code to verify that it works. Furthermore, I’m also “hosting” this data on GitHub so that others can easily access the writeup, code, and data, to see how it’s changed over time (and has it ever…), and to suggest changes that I should incorporate. These can be frustrating tools with demoralizing learning curves, but these are incredibly powerful once apprenticed. Moreover, there are amazing resources and communities who exist to support newcomers and new tools are being released to flatten these learning curves. If data journalists joined data scientists and data analysts in sharing their work, it would contribute to an incredible knowledge commons of examples and cases that is lowering the bars for others who want to learn. This is also a meta point since it exists outside of this story, but I’ll also come back to it in the conclusion.

In this outro to a very unusual introduction, I want to thank Professor Gordon from above, Professor Deen FreelonNathan Matias, and Alex Leavitt for their invaluable feedback on earlier drafts of this… post? article? piece? notebook?

The Hook: The Bechdel Test article in FiveThirtyEight

Walk Hickey published an article on April 1 on FiveThirtyEight, titled The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women. The article examines the relationship between movies’ finances and their portrayals of women using a well-known heuristic call the Bechdel test. The test has 3 simple requirements: a movie passes the Bechdel test if there are (1) two women in it, (2) who talk to each other, (3) about something besides a man.

Let me say at the outset, I like this article: It identifies a troubling problem, asks important questions, identifies appropriate data, and brings in relevant voices to speak to these issues. I should also include the disclaimer that I am not an expert in the area of empirical film studies likeDean Keith Simonton or Nick Redfern. I’ve invested a good amount time in criticizing the methods and findings of this article, but to Hickey’s credit, I also haven’t come across any scholarship that has attempted to quantify this relationship before: this is new knowledge about the world. Crucially, it speaks to empirical scholarship that has exposed how films with award-winning female roles are significantly less likely to win awards themselves [5], older women are less likely to win awards [6], actresses’ earnings peak 17 years earlier than actors’ earnings [7], and differences in how male and female critics rate films [8]. I have qualms about the methods and others may be justified in complaining it overlooks related scholarship like those I cited above, but this article is in the best traditions of journalism that focuses our attention on problems we should address as a society.

Hickey’s article makes two central claims:

  1. We found that the median budget of movies that passed the test…was substantially lower than the median budget of all films in the sample.
  2. We found evidence that films that feature meaningful interactions between women may in fact have a better return on investment, overall, than films that don’t.

I call Claim 1 the “Budgets Differ” finding and Claim 2 the “Earnings Differ” finding. The results, as they’re summarized here are relatively straightforward to test whether there’s an effect of Bechdel scores on earnings and budget controlling for other explanatory variables.

But before I even get to running the numbers, I want to examine the claims Hickey made in the article. The interpretations he makes about the return on investment are particularly problematic interpretations of basic statistics. Hickey reports the following findings from his models (emphasis added).

We did a statistical analysis of films to test two claims: first, that films that pass the Bechdel test — featuring women in stronger roles — see a lower return on investment, and second, that they see lower gross profits. We found no evidence to support either claim.

On the first test, we ran a regression to find out if passing the Bechdel test corresponded to lower return on investment. Controlling for the movie’s budget, which has a negative and significant relationship to a film’s return on investment,passing the Bechdel test had no effect on the film’s return on investment. In other words, adding women to a film’s cast didn’t hurt its investors’ returns, contrary to what Hollywood investors seem to believe.

The total median gross return on investment for a film that passed the Bechdel test was $2.68 for each dollar spent. The total median gross return on investment for films that failed was only $2.45 for each dollar spent.

…On the second test, we ran a regression to find out if passing the Bechdel test corresponded to having lower gross profits — domestic and international. Also controlling for the movie’s budget, which has a positive and significant relationship to a film’s gross profits, once again passing the Bechdel test did not have any effect on a film’s gross profits.

Both models (whatever their faults, and there are some as we will explore in the next section) apparently produce an estimate that the Bechdel test has no effect on a film’s financial performance. That is to say, the statistical test could not determine with a greater than 95% confidence that the correlation between these two variables was greater or less than 0. Because we cannot confidently rule out the possibility of there being zero effect, we cannot make any claims about its direction.

Hickey argues that passing the test “didn’t hurt its investors’ returns”, which is to say there was no significant negative relationship, but neither was there a significant positive relationship: The model provides no evidence of a positive correlation between Bechdel scores and financial performance. However, Hickey switches gears an in the conclusions, writes:

…our data demonstrates that films containing meaningful interactions between women do better at the box office than movies that don’t

I don’t know what analysis supports this interpretation. The analysis Hickey just performed, again taking the findings at their face, concluded that “passing the Bechdel test did not have any effect on a film’s gross profits” not “passing the Bechdel test increased the film’s profits.” While Bayesians will cavil about frequentist assumptions — as they are wont to do — and the absence of evidence is not evidence of absence, the “Results Differ finding” is not empirically supported in any appropriate interpretation of the analysis. The appropriate conclusion from Hickey’s analysis is “there no relationship between the Bechdel test and financial performance,” which he makes… then ignores.

What to make of this analysis? In the next section, I summarize the findings of my own analysis of the same data. In the subsequent sections, I attempt to replicate the findings of this article, and in so doing, highlight the perils of reporting statistical findings without adhering to scientific norms.

The Clip: Look in here for what to tweet

I tried to retrieve and re-analyze the data that Hickey described in his article, but came to some conclusions that were the same, others that were very different, and still others that I hope are new.

I was able to replicate some of his findings, but not others because specific decisions had to be made about the data or modeling that dramatically change the results of the statistical models. However, the article provides no specifics so we’re left to wonder when and where these findings hold, which points to the need for openness in sharing data and code. Specifically, while Hickey found that women’s representation in movies had no significant relationship on revenue, I found a positive and significant relationship.

But the questions and hypotheses Hickey posed about systematic biases in Hollywood were also the right ones. With a reanalysis using different methods as well as adding in new data, I found statistically significant differences in popular ratings also exist. These differences persist in the face of other potential explanations about differences arising because of genres, MPAA ratings, time, and other effects.

In the image below, we see that movies that have non-trivial women’s roles get 24% lower budgets, make 55% more revenue, get better reviews from critics, and face harsher criticism from IMDB users. Bars that are faded out mean my models are less confident about these findings being non-random while bars that are darker mean my models are more confident that this is a significant finding.

Movies passing the Bechdel test (the red bars):

  • …receive budgets that are 24% smaller
  • …make 55% more revenue
  • …are awarded 1.8 more Metacritic points by professional reviewers
  • …are awarded 0.12 fewer stars by IMDB’s amateur reviewers



Read the entire replication here.



The four main findings from this analysis of the effects of women’s roles in movies are summarized in the chart above. These four points point to a paradox in which movies that pass an embarrassingly low bar for female character development make more money and are rated more highly by critics, but have to deal with lower budgets and more critical community responses. Is this definitive evidence of active discrimination in the film industry and culture? No, but it suggests systemic prejudices are contributing to producers irrationally ignoring significant evidence that “feminist” films make them more money and earn higher praise.

The data that I used here was scraped from public-facing websites, but there may be reasons to think that these data are inaccurate by those who are more familiar with how they’re generated. Similarly, the models I used here are simple Stats 101 ordinary least squares regression models with some minor changes to account for categorical variables and skewed data. There are no Bayesian models, no cross-validation or bootstrapping, and no exotic machine learning methods here. But in making the data available (or at least the process for replicating how I obtained my own data), others are welcome to perform and share the results such analyses — and this is ultimately my goal of asking data journalism to adopt the norms of open collaboration. When other people take their methodological hammers or other datasets and still can’t break the finding, we have greater confidence that the finding is “real”.

But the length and technical complexity of this post also raise the question of, who is the audience for this kind of work? Journalistic norms emphasize quick summaries turned around rapidly with opaque discussions of methods and analysis and making definitive claims. Scientific norms emphasize more deliberative and transparent processes that prize abstruse discussions and tentative claims about their “truth”. I am certainly not saying that Hickey should have the output of regression models in his article — 99% of people won’t care to see that. But in the absence of soliciting peer reviews of this research, how are we as analysts, scientists, and journalists to evaluate the validity of the claims unless the code and data are made available for others to inspect? Even this is a higher bar than many scientific publications hold their authors to (and I’m certainly guilty of not doing more to make my own code and data available), but it should be the standard, especially for a genre of research like data journalism where the claims reach such large audiences.

However, there are exciting technologies for supporting this kind of open documentation and collaboration. I used an “open notebook” technology called IPython Notebook to write this post in such a way that the text, code, and figures I generated are all stitched together into one file. You’re likely reading this post on a website that lets you view any such Notebook on the web where other developers and researchers share code about how to do all manner of data analysis. Unfortunately, this was intended as a word processing or blogging tool, so the the lack of features such as more dynamic layout options or spell-checking will frustrate many journalists (apologies for the typos!). However, there are tools for customizing the CSS so that it plays well (see here and here). The code and data are hosted on GitHub, which is traditionally used for software collaboration, but its features for others to discuss problems in my analysis (issue tracker) or propose changes to my code (pull requests) promote critique, deliberation, and improvement. I have no idea how these will work in the context of a journalistic project, and to be honest, I’ve never used them before, but I’d love to try and see what breaks.

Realistically, practices only change if there are incentives to do so. Academic scientists aren’t awarded tenure on the basis of writing well-trafficed blogs or high-quality Wikipedia articles, they are promoted for publishing rigorous research in competitive, peer-reviewed outlets. Likewise, journalists aren’t promoted for providing meticulously-documented supplemental material or replicating other analyses instead of contributing to coverage of a major news event. Amidst contemporary anxieties about information overload as well as the weaponization of fear, uncertainty, and doubt tactics, data-driven journalism could serve a crucial role in empirically grounding our discussions of policies, economic trends, and social changes. But unless the new leaders set and enforce standards that emulate the scientific community’s norms, this data-driven journalism risks falling into traps that can undermine the public’s and scientific community’s trust.

This suggests several models going forward:

  • Open data. Data-driven journalists could share their code and data on open source repositories like GitHub for others to inspect, replicate, and extend. But as any data librarian will rush to tell you, there are non-trivial standards for ensuring the documentation, completeness, formatting, and non-proprietaryness of data.
  • Open collaboration. Journalists could collaborate with scientists and analysts to pose questions that they jointly analyze and then write up as articles or features as well as submitting for academic peer review. But peer review takes time and publishing results in advance of this review, even working with credentialed experts, doesn’t imply their reliability.
  • Open deliberation. Organizations that practice data-driven journalism (to the extent this is different from other flavors of journalism) should invite and provide empirical critiques of their analyses and findings. Making well-documented data available or finding the right experts to collaborate with are extremely time-intensive, but if you’re going to publish original empirical research, you should accept and respond to legitimate critiques.
  • Data omsbudsmen. Data-driven news organizations might consider appointing independent advocates to represent public interests and promote scientific norms of communalism, skepticism, and empirical rigor. Such a position would serve as a check against authors making sloppy claims, using improper methods, analyzing proprietary data, or acting for their personal benefit.

I have very much enjoyed thinking through many of these larger issues and confronting the challenges of the critiques I’ve raised. I look forward to your feedback and I very hope this drives conversations about what kind of science data-driven journalism hopes to become.


Does Wikipedia editing activity forecast Oscar wins?

Data, Wikipedia

The Academy Awards just concluded and much will be said about Ellen Degeneres most retweeted tweet (my coauthors and I have posted an analysis here that shows these “shared media” or “livetweeting” events disproportionately award attention to already elite users on Twitter.) I thought I’d use the time to try to debug some code I’m using to retrieve editing activity information from Wikipedia.

A naive but simple theory I wanted to test was whether editing activity could reliably forecast Oscar wins. Academy Awards are selected from approximately 6,000 ballots and the process is known for intensive lobbying campaigns to sway voters as well as tapping into the zeitgeist about larger social and cultural issues.

I assume that some of this lobbying and zeitgeist activity would both manifest in the aggregate in edits to the English Wikipedia articles about the nominees. In particular, I measure two quantities: (1) the changes (revisions) made to the article and (2) the number of new editors making revisions. The hypothesis is simply that articles about nominees with the most revisions and the most new editors should win. I look specifically at the time between announcement of the nominees in early January and March 1 (an arbitrary cutoff).

I’ve only run the analysis on the nominees for Best Picture, Best Director, Best Actor, Best Actress, and Best Supporting Actress (nominees in Best Supporting Actor was throwing some unusual errors, but I’ll update). The results below show that Wikipedia editing activity forecast the wins in Best Actor, Best Actress, and Best Supporting Actress, but did not do so for Best Picture or Best Director. This is certainly better than chance and I look forward to expanding the analysis to other categories and prior years.

Best Picture

The “Wolf of Wall Street” showed a remarkably strong growth in the number of new edits and editors after January 1. However, “12 Years a Slave” which ranked 5th by the end, actually won the award. A big miss.

picture_editors picture_edits 

Best Director

The Wikipedia activity here showed strong growth for  Steve McQueen (“12 Years a Slave”), but Alfonso Cuaron (“Gravity”) took the award despite coming in 4th in both metrics here. Another big miss.

director_editors director_edits

Best Actor

The Wikipedia activity for new edits and new editors are highly correlated because new editors necessarily show up as new edits. However, we see an interesting and very close race here between Chiwetel Ejofer (“12 Years a Slave”) and Matthew McConaughey (“Dallas Buyers Club”) for edits, but McConaughey with a stronger leader among new editors. This suggest older editors were responsible for pushing Ejofer higher (and he was leading early on), but McConaughey took the lead and ultimately won. Wikipedia won this one.


Best Actress

Poor Judy Dench, she appeared to not even be in the running in either metric. Wikipedia activity forecast a Cate Blanchett (“Blue Jasmine”) win, although this appeared to be close among several candidates if the construct is to be believed. Wikipedia won this one.

actress_editors actress_edits

Best Supporting Actress

Lupita Nyong’o (“12 Years a Slave”) accumulated a huge lead over her other nominees by Wikipedia activity and won the award.



Other Categories and Future Work

I wasn’t able to run the analysis for Supporting Actor because the Wikipedia API seemed to poop out on Bradley Cooper queries, but it may be a deeper bug in my code too. This analysis can certainly be extended to the “non-marquee” nominee categories as well, but I didn’t feel like typing that much.

I will extend and expand this analysis for both other categories as well as prior years’ awards to see if there are any discernible patterns for forecasting. There may be considerable variance between categories in the reliability of this simple heuristic — Director and Picture may be more politicized than the rest, if I wanted to defend my simplistic model. This type of approach might also be used to compare different awards shows to see if some diverge more than others from aggregate Wikipedia preferences. The hypothesis here is a simple descriptive heuristic and more extensive statistical models that incorporate features such as revenue, critics’ scores, and nominees’ award histories (“momentum”) may produce more reliable results as well.


Wikipedia editing activity over the two months leading up to the 2014 Academy Awards accurately forecast the winners of the Best Actor, Best Actress, and Best Supporting Actress categories but significantly missed the winners of the Best Picture and Best Director categories. These results suggest that differences in editing behavior–in some cases–may reflect collective attention to and aggregate preferences for some nominees over others. Because Wikipedia is a major clearinghouse for individuals who both seek and shape popular perceptions,  these behavioral traces may have significant implications for forecasting other types of popular preference aggregation such as elections.