The News on Wikipedia in 2014

Culture, Data, Politics, Wikipedia

It’s that time of year when everyone writes either a year-in-review article or a predictions-for-next-year article. The Wikimedia Foundation offered one of their own that showed the remarkable capacity of Wikipedia to support collaborate around current events.

As many of you know, my dissertation examined how Wikipedia covers breaking news events. I’ve remained interested in this topic and I was very excited to see the Wikimedia Foundation prominently acknowledge this use case.

In this post I want to go deeper and look at what happened on Wikipedia around current events this year. These data are so rich and multifaceted, they can speak to all kinds of different research questions that are beyond the scope of a dissertation, much less a blog post, to fully explore. Such questions can look at several different kinds of data at several different levels of analysis: changes in an article’s text over time, differences contribution patterns between articles, differences between different languages’ articles about the same events. The precipitating events and consequences following any given news story are complex, but there are nevertheless patterns in our attention to these events and Wikipedia’s responses to them. A central question I wanted to explore was how different news events translated into different levels of editorial and popular attention on Wikipedia: As judged by the “wisdom of crowds” of Wikipedia editors across 19 languages, what were the top stories in 2014?

The data

As remarkable as Wikipedia is as a collaborative endeavor, it’s likewise peerless as a data source. Every single change to every single article going back to 2002 is not only public, but also easily accessible through a powerful API with very generous permissions and rate limits compared to the ever-shrinking scope of what other platforms like Twitter and Facebook provide. Couple this with the fact that Wikipedia has records going back to 2002 and you have a social media data source that was not only internationally popular a half-decade before Twitter was a household name, but has continued to remain relevant (though facing strong headwinds on many fronts). Nevertheless, a Wikipedia researcher analyzing a breaking event is likely to be much less stressed than a Twitter researcher owing to the fact that the data quickly becomes difficult to access post hoc for the latter.

Take for example, how perceptions of presumptive U.S. presidential candidate Hillary Rodham Clinton have changed over time. On Wikipedia, you can look at every change made to her article since it was created in 2001. In addition to this longitudinal textual data, there’s also metadata about who has edited what and when as well as “paradata” about the larger context of discussions and regulations of editing behavior. Also throw in the fact that there are different language editions of Wikipedia, each of which having a different take on her. On Twitter you can try to track tweets that match keywords but getting historical data will be technically and/or financially expensive.

For this analysis I used a pure-Python workflow to scrape, structure, clean, and visualize the data. On top of Anaconda, I used the python-wikitools library as a Pythonic wrapper around the Wikipedia API, pandas to handle data formatting and manipulation, seaborn to prettify the default Matplotlib visualizations and do LOWESS fitting, and Gephi for the social network visualizations. All the code is available as an IPython Notebook (view it here) with some additional diagnostics and documentation included and the data I’ve scraped for this analysis is available on my GitHub repo for this project. Personally, my big win from this project was getting pretty dangerous at manipulating and customizing Matplotlib figures that are spit out of plotting functions in pandas and seaborn. Sure it’s mundane, but I think data viz is an important skill and hopefully you find the figures here interesting :)

Cross-language comparisons

I began by scraping all the “zeitgeist” report content for January through October 2014 across 19 language editions of Wikipedia. This zeitgeist ranking sorts articles based on the number of unique editors they had in each month. I then took the zeitgeist rankings for 18 Wikipedia languages with the most pageviews after English (Russian, Spanish, German, Japanese, French, Chinese, Italian, Polish, Portugese, Dutch, Turkish, Arabic, Swedish, Indonesian, Korean, Czech, Farsi, and Ukranian). The articles listed in the table below are sorted by the total number of editors across months in the top 25 multiplied by the number of months in the zeitgeist list for January through October.

1 2 3
Arabic كريستيانو رونالدو ريال مدريد السعودية
Chinese 世間情 太陽花學運 马来西亚航空370号班机空难
Czech Válka na východní Ukrajině Euromajdan Minecraft
Dutch Lijst van personen overleden in 2014 Malaysia Airlines-vlucht 17 Eurovisiesongfestival 2014
English Deaths in 2014 Malaysia Airlines Flight 370 Islamic State of Iraq and the Levant
Farsi دولت اسلامی عراق و شام ایل ملکشاهی مهران مدیری
French État islamique (organisation) Manuel Valls Dieudonné
German Krise in der Ukraine 2014 Alternative für Deutschland Fußball-Weltmeisterschaft 2014
Indonesian JKT48 NET. Joko Widodo
Italian Juventus Football Club Campionato mondiale di calcio 2014 Serie A 2013-2014
Japanese 仮面ライダー鎧武/ガイム 烈車戦隊トッキュウジャー ハピネスチャージプリキュア!
Korean 대한민국 일베저장소 세월호 침몰 사고
Polish Robert Lewandowski 2014 Euromajdan
Portugese Em Família (telenovela) Copa do Mundo FIFA de 2014 Campeonato Brasileiro de Futebol de 2014 – Sér…
Russian Список умерших в 2014 году Вооружённый конфликт на востоке Украины (2014) Донецкая Народная Республика
Spanish Copa Mundial de Fútbol de 2014 Podemos (partido político) Copa Sudamericana 2014
Swedish Sverigedemokraterna Avlidna 2014 Feministiskt initiativ
Turkish Türkiye Recep Tayyip Erdoğan Mustafa Kemal Atatürk
Ukranian Війна на сході України Небесна сотня Ленінопад
Top Zeitgeist Articles in 2014 across 19 languages

I don’t read or speak most the vast majority of these other languages, so trying to rank the top articles across languages would seem to require passing all of these titles into Google Translate and then trying to manually map the translations onto similar topics—a messy and culturally fraught process. Alternatively, I could just adapt a method my former Northwestern labmates Patti Bao and Brent Hecht used to translate Wikipedia articles by following Wikipedia’s inter-language links. Scott Hale at the Oxford Internet Institute has also done some brilliant work around multilingual Wikipedians. Basically, multilingual Wikipedia editors can link from the English article about the “2014 FIFA World Cup” to the Spanish article “Copa Mundial de Futbol de 2014″ which are the same topic and concept but follow different naming conventions in each language.

For each of the over 100 top articles for each of the 18 languages I crawled the inter-language links to try to connect each language’s most-edited articles to articles in other languages. Unfortunately these links across language versions are incomplete (an article about the topic/concept doesn’t exist in other languages) or imperfect (language A links to the concept in language B but B doesn’t link back to A). In the figure below, each row corresponds to the top articles for that language and each column is the rank of an article. Cells are red when the article in that language has a link to an article in English and purple when there’s no English-language article for that language’s top article. I’ve ranked the rows based on how much English coverage they have: Italian has English-language articles for 95 articles in its top 100 list while Indonesian only has English-language articles for 55 articles in its top 100 list.


Fraction of articles in each language with corresponding English version


Note that this is not saying that English and Italian have similar top 100 lists, only that the concepts in the Italian Wikipedia’s most-edited articles of 2014 also tend to have corresponding English Wikipedia articles. In other words, English and Italian have similar coverage in the topics and concepts they talk about while English and Indonesian have very different coverage. As this is an admittedly Anglo-centric perspective, we could measure whether the concepts in the Italian zeitgeist list also have articles in Farsi, Ukranian, or any of the other 18 languages here. Instead of showing all of these charts, I’m going to aggregate them all together into one image where the cell values correspond to the number of other languages that also have an article about that topic/concept. The rows have been ranked by how many languages have articles about its concepts: concepts in the Spanish and Italian Wikipedias’ 2014 zeitgeist lists also have corresponding articles in more other languages than the concepts in the Indonesian and Chinese 2014 zeitgeist lists. Put another way, Spanish editors were contributing to articles with greater international appeal than Indonesian editors.


Number of other languages having a version of a language’s top Zeitgeist article


Reading the chart from left-to-right, there’s a general trend for the top zeitgeist articles in a given language to be present across many other languages and lower zeitgeist articles to be more rare. This is born out more clearly in the chart below where the top ranked article across all languages had coverage across 15 other languages on average but the 100th-ranked article had coverage across only 12 other languages on average. This suggests that the ranking of top articles in each language are meaningful in terms of differentiating stories of narrow versus broad interest.


Correlation between article rank and language coverage


Having “translated” these top zeitgeist articles by following inter-language links, I can reproduce the top 3 zeitgeist table from above, but using English-language concepts. There’s some really fascinating variance in terms of the topics that attracted the most attention across each language: the Arabic Wikipedia appears consumed with Cristiano Ronaldo and Real Madrid rather than current events while Russian Wikipedia is focused on the events in the Ukraine. There are likewise local topics that “Ilbe Storehouse” 4chan-like website in Korean or the “Sunflower Student Movement” in Chinese that will likely be omitted from most Anglophone year in review lists.

1 2 3
Arabic Cristiano Ronaldo Real Madrid C.F. Saudi Arabia
Chinese NaN Sunflower Student Movement Malaysia Airlines Flight 370
Czech War in Donbass Euromaidan Minecraft
Dutch Deaths in 2014 Malaysia Airlines Flight 17 Eurovision Song Contest 2014
English Deaths in 2014 Malaysia Airlines Flight 370 Islamic State of Iraq and the Levant
Farsi Islamic State of Iraq and the Levant NaN Mehran Modiri
French Islamic State of Iraq and the Levant Manuel Valls Dieudonné M’bala M’bala
German War in Donbass Alternative for Germany 2014 FIFA World Cup
Indonesian NaN NaN Joko Widodo
Italian Juventus F.C. 2014 FIFA World Cup 2013–14 Serie A
Japanese Kamen Rider Gaim Ressha Sentai ToQger HappinessCharge PreCure!
Korean South Korea Ilbe Storehouse Sinking of the MV Sewol
Polish Robert Lewandowski 2014 Euromaidan
Portugese Em Família (telenovela) 2014 FIFA World Cup 2014 Campeonato Brasileiro Série A
Russian Deaths in 2014 War in Donbass Donetsk People’s Republic
Spanish 2014 FIFA World Cup Podemos (Spanish political party) 2014 Copa Sudamericana
Swedish Sweden Democrats Deaths in 2014 Feminist Initiative (Sweden)
Turkish Turkey Recep Tayyip Erdoğan Mustafa Kemal Atatürk
Ukranian War in Donbass List of people killed during Euromaidan NaN

Now that (most) concepts across different languages have been mapped to English-language (or any other language) concept, we can begin to rank them. The first ranking sorts articles based on (1) appearing in some language’s zeitgeist list and (2) on the number of languages having an article about that topic (for brevity, the figure below only includes articles that appear in 5 or more languages). Again, it’s important to remember that the method I’ve used here is not to select articles that were created in 2014, but rather only to look at those articles that were most heavily-edited across languages which reflects a really profound and interesting bias towards current events.


Rank of zeitgeist topics by occurrence in other languages


Rather than looking at the breadth of coverage across language versions of Wikipedia, the level of activity on these pages can be used as a metric for ranking. The particular metric I’ll use is the combination of the number of users who edited an article across the months it was in the zeitgeist list, the number of months it was in the zeitgeist list, and the number of languages having the article in its zeitgeist list. So basically (users * months * languages), which produces a large number that can break ties were we to use any single one of the other metrics (like number of languages in the previous figure). The ranking based on this combined activity metric is below and reveals some topics like the World Cup and the deaths of notable people as being consistently interesting topics across time and language (for more on how Wikipedia covers the death of people, see my forthcoming CSCW 2015 paper with Jed Brubaker). The ongoing conflict in the Ukraine is likewise mobilizing sustained and widespread attention focused across several articles (War in Donbass, Ukraine, 2014 Cirmean crisis, Euromaidan, 2014 pro-Russian unrest in Ukraine, etc.) How to aggregate the activity on these different articles related to a similar concept together could be interesting future work.


Rank of articles based on combinations of number of editors, months in zeitgeist ranking, and number of languages


Now that we have mapped articles to the same topics, we can finally compare languages for the similarity in ranks of their top articles. Permuting over every pair of languages, I compare how similar the ranks of the same articles were: if “Robin Williams” and “2014 FIFA World Cup” were 1 and 2 in English (hypothetically) were they also 1 and 2 in Italian? I use a cosine similarity metric to perform this comparison over the vector of the top 100 articles to produce a symmetric contingency table showing the similarity in rankings for every pair of languages for the two ranking metrics we used above. The columns are ordered by average similarity score: English appears to have the most similar rankings with every other language followed by Russian and Polish while Turkish and Indonesian have very different rankings for their top articles from the rest of the languages.


Cosine similarity between languages’ ranking of zeitgeist articles


The table below ranks the similarity of language pairs based on the editor-language-month activity score. My folk hypothesis going into this would be that English and German would have the highest similarities in rankings reflecting their shared roles as economic and policy agenda setters in North America and Europe respectively as well as their leadership in many Wikipedia-related matters. But Farsi and Dutch leap out of nowhere to seize a clear lead in having the greatest similarity in rankings — and I don’t have a good explanation for this. Inspecting their respective top lists, it appears both are more cosmopolitan in their orientation to global rather than local news, especially on topics like ISIL, and paying less attention to the Ukranian crisis. Ignoring the Indonesian Wikipedia’s dissimilarity from every other language, we see that language combinations like Ukranian and Swedish and Japanese and Farsi are very dissimilar in their rankings of top stories.

Highest similarities Lowest similarities
Language 1 Language 2 Cosine Similarity Language 1 Language 2 Cosine Similarity
1 Farsi Dutch 0.747684 Ukranian Swedish 0.094924
2 French English 0.639927 Japanese Farsi 0.115249
3 Czech Polish 0.609603 Japanese English 0.136084
4 Polish English 0.593232 Chinese Czech 0.160647
5 German English 0.592578 Ukranian Dutch 0.169325
6 Farsi Russian 0.587311 Turkish Arabic 0.174961
7 Portugese Farsi 0.583782 Japanese Spanish 0.175202
8 Chinese Spanish 0.580538 Turkish Ukranian 0.177230
9 Farsi Italian 0.570912 Ukranian Farsi 0.181022
10 French Polish 0.569966 Turkish German 0.186484


Comparing across (English) articles

Informed by the “international consensus” of top articles from the analysis above as well as my very imperfect editorial judgment about other topics that have happened since October (remember, the zeitgeist rankings are only available through October at the time of this writing owing to the backlog in parsing the gigantic dump files), I created a list of 23 articles about events in 2014 that includes a mix of international news, domestic US news, as well as science and technology news. My naive goal was to simplify complex events like the Ukrainian crisis by capturing a single article while broadening the scope of the news to include potentially under-represented international (Chibok kidnappings, Sewol sinking, Soma disaster), more recent domestic events (Ferguson protests, Gamergate controversy, US elections), and science & technology stories (Heartbleed bug, Minecraft purchase, Rosetta mission). I’ve also made the decision to exclude deaths (Robin Williams, Philip Seymour Hoffman, etc.) and planned domestic events (Academy Awards, Superbowl, etc.) since these articles either already existed, of limited international interest, or are more predictable. The figure below provides some summary statistics on the number of unique users and revisions made to each of these 23 articles, ranking on the total number of revisions made to these articles between January 1, 2014 and December 22, 2014.


Ranking of 23 news topics by number of total revisions in 2014


There are a number of features that we can look at for each of these articles to understand how these collaborations unfolded over time.

  • The number of users editing on each day reflects both the interest among the editor community as well as the intensity of the coordination challenges the article faces as dozens or hundreds of editors attempt to make changes in the space of a few hours. This is shown in the top subplot in the figure below with articles like Malaysia Airlines Flight 370 and 17 showing clear peaks and rapid fall-offs.
  • The second subplot from the top shows the total number of revisions made each day, again the two MA flights show the strongest peaks. Although the number of users and number of revisions tracks each other reasonably well in the bursts around a single event, but the magnitudes are very different: in the extreme cases, several hundred editors make over 1,000 contributions within a single calendar day.
  • This invites questions around how equitably the editing work in these collaborations is distributed. I measure this by calculating the changes in the cumulative Gini coefficient of the distribution of editors making revisions: values closer to 0 implies all editors are making equal number of revisions and values closer to 1 implies one editor is making nearly all the contributions. All these articles show a remarkable and immediate centralization of effort almost immediately and what’s remarkable to me is there’s no “relaxation”: this centralization never diminishes over time. In effect breaking news articles look like a handful of editors do most of the work and after they move on, editing patterns remain centralized in other editors.
  • Finally, the daily cumulative changes in the article size capture how much content has been added (or removed) from an article on a given day. There’s something surprising to me here because despite the bursts of activity in the top two subplots, there’s no corresponding burst of content generation. This suggests that much of the work in the early hours of these article collaborations are very incremental; changing a few characters rather than authoring whole new sentences or paragraphs. In fact, the peaks of content generation appear to not be strongly correlated with peaks of user or revision activity. This makes some sense — it’s hard to add an entire paragraph to an article when the story is still unfolding and when there are a hundred other articles trying to make changes at the same time.

Changes in features for 23 articles over the year


Below is a sorted table ranked by revisions summarizing the cumulative number of revisions and users as well as the most-recent Gini co-efficients and article sizes for 2014.

Revisions Users Gini Length
Malaysia Airlines Flight 370 10336 2145 0.72 224.72
Ebola virus epidemic in West Africa 7818 1374 0.77 213.27
Islamic State of Iraq and the Levant 7545 1127 0.80 232.12
2014 Israel–Gaza conflict 6550 857 0.79 247.04
Malaysia Airlines Flight 17 5202 796 0.71 140.85
2014 Crimean crisis 4271 933 0.69 204.95
2014 Hong Kong protests 3852 417 0.83 179.21
Gamergate controversy 3304 233 0.82 106.95
2014 FIFA World Cup 3107 2307 0.61 136.72
Indian general election, 2014 2923 482 0.72 189.40
Scottish independence referendum, 2014 2374 550 0.65 231.30
Eurovision Song Contest 2014 1963 277 0.70 163.46
2014 Winter Olympics 1793 337 0.58 96.31
Sinking of the MV Sewol 1692 420 0.63 150.99
Ice Bucket Challenge 1647 539 0.55 50.68
2014 Ferguson unrest 1483 380 0.62 170.76
Felipe VI of Spain 1231 750 0.48 34.54
Rosetta spacecraft 1152 218 0.50 79.15
Minecraft 900 1353 0.59 111.47
Chibok schoolgirl kidnapping 865 202 0.49 45.20
United States elections, 2014 840 156 0.44 29.64
Soma mine disaster 634 159 0.55 29.07
Cuba–United States relations 579 573 0.44 58.26

Information consumption patterns

The data above all pertains to the information production side of the equation, but we can also analyze data about information consumption patterns by downloading the pageview information about each of these articles. The figure below plots out the daily pageviews for each of the articles. The large production-side peaks around the two MA flights are less pronounced here and are replaced with peaks for the Winter Olympics, World Cup, and Ice Bucket Challenge. This suggests that there may be a mis-match between the demand for information and supply of volunteers to generate content on the site in response to new events, a question that I’ll get to in a bit.


Daily pageviews for 23 news topics


The cumulative number of pageviews gives us another metrics ranking the biggest news stories of the year. Note that the x-axis is log-scaled and these pageviews actually vary over 3 orders of magnitude: the World Cup article received a total of 15 million pageviews in English alone while the Chibok kidnapping article received only a total of 108,700 pageviews.


Ranking by cumulative pageviews over 2014


These pageview peaks show some different patterns of attention reflecting the unique nature of each event. The “World Cup” article in purple shows a sudden onset when the event starts in June and sustained attention followed by sudden drop-off after the final matches in July. Contrast this with the article for the MA17 article (yellow) that has rapid onset and fall-off with no sustained attention and the ISIL article (green) that is characterized by a repeated series of bursts and slower fall-off in attention than the other articles. Finally, the “Minecraft” article only shows a small burst of attention in September following the announcement of its purchase by Microsoft but actually has larger daily levels of attention on average than any of the other articles.


Examples of different classes of pageview bursts


Related to this idea of different classes of burstiness, another metric I can use is what fraction of the pageviews over the year occurred on the day with the maximum pageviews. In other words, which articles were “one hit wonders” that received a lot of attention on one day but no other days? Almost a quarter of the pageviews that the Soma mine disaster article received since the May accident happened on a single day versus the Minecraft article receiving only 2% of its total pageviews this year following the announcement of its purchase.


Ranking articles by burstiness of attention


This variability in the “spikiness” of attention to different kinds of news events raises the question of how efficiently do Wikipedia editors supply revisions in response to demand for information? I’m using the number of revisions as a coarse measure of information production despite the fact that our analyses above showed that many of the revisions during the biggest editing days are very small contributions. However, if Wikipedia is able to meet demand for information by mobilizing editors to supply additional revisions, I would expect that the distribution of points should fall along a diagonal: a standard deviation more in pageviews is matched with a standard deviation more of revisions. The figure below is a scatter plot of all the (standardized) daily pageviews and revision counts for each article on each day of this year.  In the upper-right are the articles at times they’re “under stress” with large deviations in attention and collaboration and in the lower-left are the articles under “normal behavior.”

I’ve fit LOWESS lines (solid colored lines) to the points to capture different kinds of variation throughout the distribution, but despite the apparent dispersion in the data, there is a strong relationship along the diagonal across many articles. But several articles (US general elections, MV Sewol, and Chibok kidnapping) show a different pattern where revisions to the article out-pace attention to the article. This suggests that the matching hypothesis holds for most articles and Wikipedia is able to meet increased “demand” for information with proportional increases in the “supply” of revisions. Furthermore, using the diagonal as a crude boundary to split this sample of articles into two classes, there’s some interesting variation in articles’ post-burst behavior. Regression lines falling above the diagonal reflect articles where there’s more post-burst attention (pageviews) than post-burst production (revisions). Conversely, regression lines falling below the diagonal reflect articles where there’s more post-burst information production than demand. The articles about the MA flights and Ice Bucket Challenge fall into the former class of articles where Wikipedians are unable to meet demand while most other articles reflect Wikipedians lingering on articles after popular attention has moved on. If Wikipedia contains a set of “ambulance chasing” editors who move from news story to news story for the thrill or attention, who — if anyone — stays behind to steward these articles?

Collaboration network

The data about who edited what articles can be converted into a bipartite network. In this network there are editor nodes, article nodes, and links when an editor made a contribution to an article. The large red nodes are the articles and the smaller blue nodes are the editors. What’s remarkable is that despite the diversity in topics and these events unfolding at different times over an entire year, this network is a giant component in which every node is indirectly connected to each other. Each breaking news collaboration is not an island of isolated efforts, but rather draws on editors who have made or will make contributions to other articles about current events. In effect, Wikipedia does in fact have a cohort of “ambulance chasers” (pictured at the center) that move between and edit many current events articles.


Editor-article coauthorship network across all 2014 revisions


The analysis above pointed out that editors’ contributions to these articles are extremely unequal: Gini coefficients near 0.8 for many articles which reflects an incredible level of centralization. Below is a list of editor-article pairs with the greatest number of revisions. Keep in mind that Wikipedia editors follow the 80-20 rule with the vast majority of editors making fewer than 10 contributions and the top editors making hundreds of thousands of contributions. This level of activity on a single article within the course of a single year reflects a remarkable investment of effort, which may become problematic if these editors believe they “own” an article and act to sideline others’ contributions.

Article User Revisions
0 Islamic State of Iraq and the Levant P-123 2385
1 Ebola virus epidemic in West Africa BrianGroen 1374
2 2014 Hong Kong protests Signedzzz 893
3 Malaysia Airlines Flight 370 Ohconfucius 773
4 Ebola virus epidemic in West Africa Gandydancer 709
5 2014 Hong Kong protests Ohconfucius 668
6 Gamergate controversy Ryulong 565
7 Indian general election, 2014 Lihaas 505
8 Gamergate controversy NorthBySouthBaranof 496
9 Scottish independence referendum, 2014 Jmorrison230582 473

But this looks at all the editing activity across the entire year — what about focusing on editing happening around the event itself? Are these collaborations just as cohesive and indirectly connected in the 24 hours preceding and 24 hours following the day with the peak amount of pageview activity in the year? Very much yes: these collaborations are still connected together by a cadre of editors who respond to breaking news events within days, or even hours. Previous work I’ve done has shown that these giant components emerge within hours as “ambulance chasing” editors show up to coordinate work on current events articles. The “aftermath network” below has 2158 nodes and 2508 connections in contrast to the 6171 nodes and 7420 edges in the complete collaboration network from above, so it’s smaller but still indirectly tied together through editors brokering between multiple article collaborations.


Editor-article coauthorship network for revisions in 72 hour period around article’s peak pageview activity.


Some of these highly-connected users in the center are actually automated scripts called bots that specialize in copyediting or vandalism fighting, but these aren’t the only multi-article editors either. It’s also not the case that every editor is contributing to every article in this corpus. I can also plot the similarity between articles based on the overlaps in editors who show up to edit them. The figure below has two other contingency tables that plot the fraction of overlapping editors between each pair of articles for the complete (left) and aftermath (right) coauthorship graphs. The y-axis is ranked by articles that show the largest average overlap with MA Flight 370 having more of its editors overlapping on other articles and the US elections article having negligible overlap. The two MA flights show nearly 25% of their editor cohorts overlap, with some sizable overlaps with the Sewol sinking.

But comparing these similarities to the aftermath authorship, a different story emerges. Take the two MA flight articles, for example. Around 25% of the MA17 editors overlapped with the MA370 editors by the end of the year, but 48 hours after MA17 was shot down, there was only around 10% overlap in their respective populations of ambulance chasers. This suggests the migration of editors from one article to another may actually be relatively slow as the emergence of overlaps between these articles happens outside of this aftermath window. Rather the overlaps in the aftermath window seem to be partially attributable to the coincidence of these events within weeks of each other: ambulance chasing editors in March are still chasing ambulances in April, but not in September as they move onto other topics or stop editing Wikipedia.


Overlap in articles’ editors for all 2014 revisions (left) and 72-hour peak pageview window (right)



What were the top news stories in 2014 as judged by Wikipedia editors and readers? Looking across multiple languages as well as a variety of metrics of information production and consumption, no single topic stands out but a few consistently appear at the top. The World Cup, Malaysia Airlines flights, Ukrainian crises, and ISIS all had wide coverage and high levels of activity across languages, dominating other major stories like the West Africa Ebola outbreak, Israel-Gaza conflict, or protests in Hong Kong. However, there was substantial variation in the editorial attention to these topics as well as the existence of articles about these topics across many languages. But topics like the World Cup and Ukrainian crises had several related articles in the lead which leads me to believe these were the two biggest stories of 2014. Focusing on the English Wikipedia, the production of information was highly concentrated in time and editors, the growth of these breaking news articles happened well after peak attention to them. Despite the variability in attention to these articles over the year, Wikipedia appears to be able to mobilize editors to make revisions that scales with the demand for information and the editors writing these articles have substantial overlap with one another.

The question of “what is news?” has consumed professional newspaper editors and journalism scholars for decades. One attempt at an answer is a framework called “news values” that can be traced back to a 1965 paper by Johan Galtung and Mari Ruge that explored how 12 factors influence whether foreign events were picked up by domestic news sources. Galtung and Ruge argue that events most likely to become news are (1) more immediate (frequency), (2) more intense (threshold), (3) less ambiguous (equivocality), (4) more relatable to the audience (meaningfulness), (5) more foreshadow-y (consonance), (6) more rare (unexpectedness), (7) less temporary (continuity), (8) more related to other issues (composition), (9 & 10) more oriented to the powerful (elite reference), (11) more person-level, and (12) more negative. Other journalism scholars have since followed up to this study and offered alternative factors like entertainment. I’ve also written about how Galtung and Ruge’s news values map onto Wikipedia policies governing the notability on what kinds of events can or cannot have articles. I leave it as an exercise to the reader to map these values onto the top stories this year.

There are many ways by which the most “important” news of this year can be measured using behavioral data from Wikipedia editors and readers. The top stories identified by the various methods above vary in how they map to these various news features: the Malaysian Airlines flights had immediate consequences, intense loss of life, were very unexpected, and negative. MA17 brought the on-going Ukrainian conflict back into the news as well as invoking the loss of elite AIDS researchers but MA370’s location and cause of disappearance remain highly uncertain. Many of topics may be “pseudo-events” with little lasting historical impact: the 2011 Royal Wedding was one of the biggest stories using the zeitgeist rankings, but has marginal enduring importance. It’s also important to remember that Wikipedia’s coverage of these topics is filtered through the existing news values and gatekeeping processes as editors have to source their content to coverage from traditional news organizations.

This analysis is far from the final word on the subject of Wikipedia and current events. There are alternative identification strategies that could be employed to select news articles: the Current Events portal, In the News template, 2014 category, and pageview dumps each would uncover alternative news topics. I also haven’t performed any content-level analyses of the kinds of content, topics, or other textual features that changed in these articles over time. News events involving topics that already have existing articles also provide a nice source of exogenous variation for natural experiments around peer effects in attention, for example. Certainly the biases and differences in behavior between these countries cannot be definitely attributed to features of a particular national culture, only the peculiarities and biases of each respective Wikipedia community. Given the difficulties Wikipedia faces in recruiting and retaining new editors, these current events might also provide a template for understanding how to match novice and expert users to tasks. Finally, the correspondence between Wikipedians’ activity on articles with major news stories of the year suggests that Wikipedia article behavior might have predictive validity for box office revenues, disease forecasting, and potentially many other topics. All without having to go begging organizations to share their data ;-)

My 15 Minutes of Fame as a B-List Gamergate Celebrity

Culture, Data, Politics

On Monday, October 27 Andy Baio posted an analysis of 72 hours of tweets with the #Gamergate hashtag. With the very best of intentions, he also shared the underlying data containing over 300,000 tweets saved as CSV file. There are several technical and potential ethical problems with that, which I’ll get to later, but in a fit of “rules are for thee, not for me,” I grabbed this very valuable data while I could knowing that it wouldn’t be up for long.

I did some preliminary data analysis and visualization of the retweet network using this data in my spare time over the next day. On Wednesday morning, October 29, I tweeted out a visualization of the network describing the features of the visualization and offering a preliminary interpretation, “intensely retweeting and following other pro-#gamergate is core to identity and practice. Anti-GG is focused on a few voices.” I intended this tweet as a criticism of pro-Gamergaters for communicating with each other inside an insular echo chamber, but it was accidentally ambiguous and it left room for other interpretations.

The tweet containing the image has since been retweeted and favorited more than 300 times. I also received dozens of responses ranging from benign questions about how to interpret the visualization, more potentially problematic questions about identifying users, and finally responses that veered into motivated and conspiratorially-flavored misreadings. Examples of the latter are below:

To be clear, I do not share these interpretations and I’ll argue that they are almost certainly incorrect (a good rule of thumb is to always back away slowly from anyone who says “data does not lie“). But I nevertheless feel responsible for injecting information having the veneer of objectivity into a highly charged situation. Baio mentioned in his post that he had a similar experience in posting results to Twitter before writing them up in more detail. A complex visualization like this is obviously a ripe for misinterpretation in a polarized context like Gamergate and I wasn’t nearly clear enough describing the methods I used or the limitations on the inferences that can be drawn from this approach. I apologize for pre-maturely releasing information without doing a fuller writeup about what went in and what you should take away.

So let’s get started.

Data collection

I will have to defer to Baio for the specific details on how he collected these original data. He said:

“So I wrote a little Python script with the Twython wrapper for the Twitter streaming API, and started capturing every single tweet that mentioned the #Gamergate and #NotYourShield hashtags from October 21–23.”

This data collection approach is standard, but has some very important limitations. First, this only looked at tweets containing the hashtags “#gamergate” and “#notyourshield.” These hashtags have largely been claimed by the “pro-gamergate” camp, but there are many other tweets on the topic of Gamergate under other partisan hashtags (e.g., “#StopGamerGate2014″) as well as tweets from people speaking on the topic but consciously not using the hashtag to avoid harassment. So the tweets in this sample are very biased towards a particular community and should not be interpreted as representative of the broader conversation. A second and related point is that these data do not include other tweets made by these users. On a topic like Gamergate, users are likely to be involved in many related and parallel conversations, so grabbing all these users’ timelines would ideally give a fuller account of the context of the conversation and other people involved in their mentions and replies.

A third point is that Baio’s data was saved as a comma-separated value (CSV) file, which is a common way of sharing data, but is a non-ideal way to share textual data. Reading the data back in, many observations end up being improperly formatted because of errant commas and apostrophes in the tweets break fields prematurely. So much of the analysis involves checking to make sure the values of fields are properly formatted and tossing those entries that are improperly formatted. Out of 307,932 tweets, various stages of data cleanup will toss thousands of rows of data for being improperly formatted depending on the kind of analysis I’m focusing on. While this was not a complete census of data to begin with, this is still problematic as these data are likely non-random because they contain a combination of mischief-causing commas and apostrophes, which is another important caveat. Please use formats like JSON (ideally) to share textual data like this in the future!

To review, besides only being a three-day window, this dataset doesn’t include other Gamergate-related conversations occurring outside of the included hashtags, ignores participating users’ contextual tweets during this timeframe, and throws out data for tweets contains particular grammatical features. With these important caveats now made explicit, let’s proceed with the analysis.

Data analysis

Baio worked with the very awesome Gilad Lotan to do some network analysis of the follower network. I wanted to do something along similar lines, but looking at the user-to-user retweet network to understand how messages are disseminated within the community. For our purposes of looking at the structure of information sharing in GamerGate, we can turn to some really interesting prior scholarship that’s looked at how retweet networks can be used to understand political polarization [1,2] and what are the factors that influence people to retweet others [3,4]. Their work does far more in-depth analyses and modeling than I’ll be able to replicate for a blog post in the current time frame, but I wanted to highlight a few. boyd and her coauthors [3] identify a list of uses for retweeting, including amplifying information to new audiences, entertaining a specific audience, making one’s presence as a listener known to the author, or to otherwise agree, validate, or demonstrate loyalty. These are obviously not exhaustive of all the uses of the retweet, but they can help frame the goals users have in mind when retweeting.

Using additional metadata in the file about the number and statuses and followers a user has around the time time of his tweet, I can create additional variables. One measure is “tweet_delta”, or the difference between the maximum and minimum observed value for a user’s “user_statuses_count” field recorded for each of their tweets. This ideally captures how many total tweets the user made outside of the observations in the dataset. A second related variable, “tweet_intensity” is the ratio of tweets in the (cleaned) dataset to the tweet_delta. This value should range between 0 (none of the tweets this user made over this timespan contain #Gamergate/#NotYourShield) and 1 (all of the tweets this user made over this timespan contain #Gamergate/#NotYourShield).

A third measure is “friend_delta”, or the difference between the maximum and minimum observed value for the number of other users that a given user follows. Like the “tweet_delta” above, this capture how many friends (I prefer the term “followees”, but “friends” is the official Twitter term) a user has at the time of each tweet. A similar measure can be defined for followers. Since you have less control over who or how many people follow you, friends/followees is a better metric for measuring changes in an individual’s behavior like actively seeking out information by creating new followee links. This value varies between 0 (no change in followees) to n (where n is the maximum number of followers observed over these 3 days).

I wrote the data cleanup and analysis in an IPython Notebook using the pandanetworkX, and seaborn libraries and visualized the data using Gephi. While I have posted the code to replicate these analyses to my GitHub for others to inspect or use on other data they’ve collected, I’ve decided not to share the data itself owing to very real concerns I have about how it might be used to target individuals for harassment in addition to secondary concerns about how far Twitter’s terms of use extent to secondary data and the pragmatic fact that the files are larger than GitHub is willing to host.

The retweet graph visualization

The basic network relationship I captured was whether User A retweeted User B. This network has directed (A retweeting B is distinct from B retweeting A) and weighted (A could retweet many of B’s tweets) connections. Again, to be explicit, the large colored circles are users and the colored lines connecting them indicate whether one retweeted each other (read an edge “A points to B” in the clockwise direction). This is not every retweet relationship in the data, but only those nodes belonging to retweet relationships where A retweeted B at least twice. This has the effect of throwing out even more data and structural information (so inferences about the relative size of clusters should reflect single instances of retweeting have been discarded), but reveals the core patterns. This is an extremely coarse-grained approach and there are smarter ways to highlight the more important links in complex networks, but this is cheap and easy to do.

The x and y coordinates don’t have any substantive meaning like in a scatterplot, instead I used the native ForceAtlas2 force-directed layout algorithm to position nodes relative to each other such that nodes with more similar patterns of connections are closer together. Making this look nice is more art than science and most of you can’t handle all my iterative layout heuristics jelly.

  • I’ve sized the nodes on “in-degree” such that users that are retweeted more by many unique users are larger and users that are retweeted less are smaller.
  • The color of the node corresponds to the “friend_delta” such that “hotter” colors like red and orange are larger changes in users followed and “cooler” colors like blue and teal are 0 or small changes in users followed. Nodes are colored grey if there’s no metadata available for the user.
  • The color of the link corresponds to the “weight” of the relationship, or the number of times A retweeted B. Again hotter colors are more retweets of the user and cooler colors are fewer retweets within the observed data.


Manual inspection of a few of the largest nodes in the larger cluster reveal that these are accounts that I would classify as “pro-Gamergate” while the largest nodes in the cluster in the lower left I would classify as “anti-Gamergate.” I didn’t look at every node’s tweet history or anything like that so maybe there are some people on each side being implicated by retweet association. There were a lot of questions about who the large blue anti-GG node is. Taking him at his word as someone who would welcome being targeted, this is the “ChrisWarcraft” account belonging to Chris Kluwe, who tweeted out this (hilarious) widely-disseminated post on October 21, which is during the time window of our data.

Let’s return back to my original (and insufficient) attempt at interpretation:

The technical term that network scientists like myself use for images like the one above are a “hairball” that often offer more sizzle than steak in terms of substantive insights. Eyeballing a diagram is a pretty poor substitute for doing statistical modeling and qualitative coding of the data (much more on this in the next section). Looking at a single visualization of retweet relationships from three days of data on a pair of hashtags can’t tell you a lot about authoritarianism, astroturfing, or other complex issues that others were offering as interpretations. I don’t claim to have the one “right” answer, but let me try to offer a better interpretation.

  • The pro-GG sub-community is marked by high levels of activity across several dimensions. They retweet each other more intensively (larger in a network where all edges are at least 2 retweets). They are actively changing who they follow more than the anti-GG group (this would need an actual statistical test). It’s certainly the case that participants are highly distributed and decentralized, but as I discuss more below, it also suggests they’re highly insular and retweeting each others’ content is an important part of supporting each other and making sense of outside criticism by intensively sharing information.
  • I suspect the anti-GG sub-community is smaller not because there are fewer people opposed to GG, but that the data analysis and visualization choices Baio and I made included only those people using the hashtag and excluded people who only retweeted once. In other words, one shouldn’t argue there are more Republicans than Democrats by only looking at highly active #tcot users. Ignoring Kluwe’s post as an outlier, the anti-GG sub-community looks smaller but similarly dense.
  • There’s a remarkable absence of retweeting “dialogue” between the two camps, something that’s also seen in other online political topics. Out of the thousands of users in the pro-GG camp, only 2 appear to retweet Kluwe’s rant. So contra the “diversity” argument, there actually appears to be a profound lack of information being exchanged between these camps which suggests they’re both insular. But if #Gamergate is where a lot of the pro-GG discussion happens while anti-GG discussion happens across many other channels not captured here, we can’t say much about anti-GG’s size or structure but we can have more confidence about what pro-GG looks like.
  • The reaction among the pro-GG crowd to my visualization also gives me an unanticipated personal insight into the types of conversations that this image became attached to. The speed and extent to which the visualization spread, the kinds of interpretations it was used to support, and the conversations it sparked all suggested to me that there were many pro-Gamergaters looking for evidence to support their movement, denigrate critics, or delegitimize opponents. My first-hand experience observing these latter two points (the tweets above being a sample) lend further weight to many other critics’ arguments about these and other forms of harassment being part and parcel of tactics used by many pro-GGers.


If anything, I hope this exercise demonstrates that while visualization is an important part in the exploratory data analysis workflow, hairballs will rarely provide definitive conclusions. I already knew this, but as I said before, I should have known better. But to really drive the point home that you might fashion the same hairball visualization to support very different conclusions, here are some more hairballs below from the very same dataset using other kinds of relationships.

First off, here is the mention network where User A is linked to User B if User B’s account in mentioned in User A’s tweet. Now this is a classic hairball (Kluwe is again the isolated-ish green node in the upper left, for those of you keeping score at home). Links of weight 1 are black and higher weights range from cool to hot. Unlike the highly polarized retweet network, here we have an extremely densely-connected core of nodes. I’ve decided the color the nodes by a different attribute than above, specifically normalized degree difference. This is calculated as (out-degree – in-degree)/(out-degree + in-degree) and varies from -1 where a user receives only mentions but never mentions anyone else (bluer) to 1 where a user only makes mentions but is never mentioned by anyone else (redder). There’s really no discernable structure as far as I can tell and anti-GG accounts are mixed in with pro-GG accounts and other accounts like Adobe and Gawker that have been caught up.


But the node colors do tell us something about the nature of the conversation, namely, there are very many nodes that appear to be engaged in harassment (red nodes talking at others but not being responded to) and many nodes that are being targeted for harassment (blue nodes being talked at but not responding, like Feliciaday). Indeed, plotting this relationship out, the more tweets a user makes mentioning another account (x-axis), the lower their normalized degree difference (y-axis). I’ve fit a lowess line to clarify  this relationship in red. In other words, we’re capturing  one feature of harassment where more tweets mentioning other people buys you more responses from others up to about a dozen tweets and then continuing to tweet mentioning other people results in fewer people mentioning you in return.


Second, here’s the multigraph containing the intersection of the retweet and mention networks. User A is linked to User B if A both retweeted and mentioned B within the dataset. Unlike the previous posts, I haven’t filtered the data to include edges and nodes above weight 1, so there are more nodes and weaker links present. I’ve colored the nodes here by account_age, or the number of days the account existed before October 24. Bluer nodes are accounts created in recent week, redder nodes are accounts that have existed for years, grey nodes we have no data on. I’ve left the links as black rather than coloring by weight, but the edges are still weighted to reflect the sum of the number of mentions and number of retweets. This network shows a similarly polarized structure as the retweet network above. Manual inspection of nodes suggests the large, dense cluster of blue nodes in the upper-right is pro-GG and the less dense cluster of greener nodes in the lower-left is anti-GG. By overlapping the data in this way, we have another perspective on the structure of a highly-polarized conversation. The pro-GG came looks larger in size, owing to the choice not to discard low-weight links, which suggests that anti-GG participation is not as intense and cohesive as the tightly-connected pro-GG camp that suggests more insularity.


It’s also worth noting there are substantially more new accounts in the pro-GG camp than the anti-GG camp. We can examine whether there’s a relationships between the age of the account and the clustering coefficient. The clustering coefficient captures whether ones friends are also friends with each other: the pro-GG appears to have more clustering and more new accounts and the anti-GG appears to have less clustering and older accounts. The boxplots below bear this rough relationship out: as the clustering coefficient increases (the other users mentioned by a user also mention each other), the average age of these accounts goes down substantially. This also seems to lend more weight to the echo chamber effect — newly created accounts are talking within dense networks that veer towards pro-GG with older accounts are talking within sparser networks that veer towards anti-GG.


Third, here’s the hashtag network where User A (users are blue) is linked to Hashtag B (hashtags are red) if the user mentions the hashtag in the tweet. I’ve intentionally omitted the #Gamergate and #NotYourShield hashtags as one of these would show up in every tweet, so it’s redundant to include them. I’ve also focused only on the giant component, ignoring the thousands of other unconnected hashtags and users in the network. This graph is distinct from the others as it is a bipartite graph containing two types of nodes (hashtags and users) instead of one type of node (users in the previous.) This graph is also weighted by the number of times the user mentions a hashtag (wamer = more). Some of the noticeable related hashtags are #gamer (top), #fullmcintosh (centerish), and #StopGamergate2014 (bottom right). Interestingly, many of these hashtags appear to be substrings “gamergate” such as “gamerg”, “gamerga”, “gam”, etc. that is some combination of an artifact of Twitter clients shortening hashtags, or improvisation among users to find related backchannels. But a number of anti-GG hashtags are present and connected here suggesting the discussion isn’t as polarized as the RT graph would suggest. This likely reflects users including hashtags sarcastically, like a pro-GG including #StopGamergate2014. There are also outwardly unrelated hashtags such as #ferguson, #tcot, and #ebola included.



Each of these new networks reveals alternative perspectives about the structure and cohesiveness of Gamergate supporters and opponents. I should have shared all these images from the start, but these later three required a bit more work to put together over the past few days. My priors about it being an insular echo chamber are borne out by some evidence and not by other hairballs. Each side might divine meaning from these blobs of data to support their case, but in the absence of actual hypothesis testing, statistical modeling, and qualitative coding of data, it’s premature to draw any conclusions. I did some other exploratory data analysis that suggests features associated with being pro-GG like highly-clustered networks and authoring many tweets mentioning other users are associated with potentially harassing behavior like using newly-created accounts and getting few replies from others.

So where do should we go from here? First and most obviously, I hope others are collecting data about how Gamergate has unfolded over a wider range of time than three days and set of hashtags than #gamergate itself. I hope that literature around online political polarization, online mobilization of social movements, and the like is brought to bear on these data. I hope qualitative and quantitative methods are both used to understand how content and structure are interacting to diffuse ideas. I hope researchers are sensitive to the very real ethical issues of collecting data that can be used for targeting and harassment if it fell into the wrong hands. I hope my 15 minutes of fame as an unexpected B-list celebrity in the pro-GG community doesn’t invite ugly reprisals.

On Starting a New Job

Academics, Data

I am starting a new job in November. This is not a prank like last time. But before the grand reveal of where, first I’ll subject you to a lengthy blog post about my thoughts about the how and why. Hopefully this provides an additional perspective to the excellent posts by Lana Yarosh and Jason Yip on their experiences on the computer/information science academic job market. But those of you who know the rhythms of the academic job market are already realizing that (spoiler alert), I’m not starting a tenure-track faculty role. Instead, I’m going to spend the next few years being a data scientist. But I definitely promise not to be this guy:

This blog post is a mixture of how to get into data science as well as how to leave academia for industry. I want to be clear that this is not my farewell letter to academia, but rather advice to other PhDs—especially in the social sciences—who are considering going into industrial data science. This is the amalgamation of notes I’ve kept, thoughts I’ve restrained myself from tweeting, and lessons from  innumerable pep-talks fromclose friends and family who have counseled me through this process. I hope my experience can clarify some of the fuzzy contours of a process that academia leaves you completely unprepared for. But fair warning, this is still a really, really, really long treatise. I’ve tried to make up for that with a liberal application of GIFs.

This story about a boy leaving a plum post-doc at a great lab on good terms for a non-tenure-track data science position is broken into four acts. The first, “The Big Decision”, is about the choice to pursue opportunities outside of the safe confines of academia. The second, “The Search”, is about my experience starting the search outside the Ivory Tower’s ladder. The third, “The Recruitment”, touches on some of the frustrations and anxieties I confronted through the process. And the final act, “The Little Decision”, is about my process of negotiating and choosing an offer.

At the outset, let me admit that I’m writing from a position of relative privilege as a network and computational social scientist who can pass as a “sexy” data scientist rather than, say, students of literature or biology, who will not granted the same assumptions about the merit of their interests and applicability of their training. That said, if you spend 4+ years in graduate school without ever taking classes that demand general programming and/or data analysis skills, I unapologetically believe that your very real illiteracy has held you back from your potential as a scholar and citizen. That’s tough love, but as someone who only started to learn programming via Python in the fourth year of my PhD, it can be remedied — often more quickly and easily than you’d believe. The rest of the world thinks this stuff is some arcane dark art, but I guarantee you’ll surprise yourself at how quickly you’ll be reading developer documentation, be able to ask and answer technical questions on StackOverflow, and ultimately be able to “pass” as the imposter that almost every other data scientist dabbling in these magicks feels too.

Stage 1: The Big Decision

The anchor and the list

Why am I interested in data science if I want to end up in academia? In my case, I am anchored in time and space as my partner still has two years left in her degree program even at the end of my two-year post-doc contract. We both plan on moving when she is done, so it doesn’t make sense for me to find a tenure-track job here. Although graduating up to a soft-money, non-tenure-track “research assistant professor” role was possible to ride out the next two years, I wanted to use this second two-year window to branch and try something new. In particular, I had never done an internship while in graduate school, nor had I had a “real” job in between undergraduate and graduate school, so I was curious about what life outside the asylum was like. 12 months into my post-doc, I began to scratch the itch of persistent recruiters and to think about what life in an industry research lab, corporate data science group, or start-up setting would be like. And while it is not Bismark, North Dakota, it is nevertheless the case that the hub of the data science universe is not in Boston, Massachusetts (SF and NYC are). So the requirement to stay here altered the calculus for the kinds of jobs I could consider. But these formed the outlines of some of the things that I put on a pro-con list that I started (but should have actually written down) before the search. I think writing down the pros and cons before starting the search could be important, both in terms of documenting what originally motivated you as well as capturing how your thinking evolved. So write a pro-con list before starting.

Skilling up isn’t selling out

Going into this process, the whispers that followed my academic colleagues who had preceded me in going to industry rang loudly in my ears. “He was so talented.” “She had so many best papers.” You can be forgiven if you think these are eulogies for deceased colleagues. Academia has a strange path dependency where venturing off the farm means you can never return. Well not never, but rarely. But my time in grad school and as a post-doc was marked by spending lots of time being on the outside looking in at other people’s cool data. Instead of devising ever-more-clever ways to move already-collected data from someone else’s machine to my own machine, I decided to look to industry as an opportunity to “skill up” and develop technical competencies. I learn best by doing, but there is little reward for “doing” machine learning, map-reduce, or graph databases in the social sciences if they are not in service of a research question. Others have pointed out that going into industry also lets you work on products that are essential infrastructure in the information economy, gives you the flexibility to move between product areas, or provides greater work-life balance. But for me, I very much hope to bring a really interesting battery of tools, skills, and practices back into the academy in a few years’ time.

Stage 2: The Search

Strength of All The Ties

Mark Granovetter’s famous “strength of weak ties” theory was formulated specifically in the context of job searching: strong ties share the same information but weaker ties provide new information about opportunities. I naively hoped that spamming people I trusted would lead them to return an informal e-mail with the subject “Looking for job” from out of the blue. Some of these initial messages received enthusiastic responses within hours, others languished for weeks, and some were likely forgotten. But the point of casting these stones is you neither know how far their ripples will propagate, nor what surprises they might dislodge. Of course be judicious in to whom and how you reach out, but also discard the academic job market logic of thinking that the only jobs available are the ones with publicly-posted calls. In fact, spending your time applying through sites like won’t generate the leads you need. Oh, and come to conferences like CSCW and ICWSM where lots of amazing social and computer scientists from academia and industry get together!

Recruiters are important, but they’re not your friend

As I’m sure many others have experienced, I had recruiters contacting me seemingly within minutes of making my first GitHib commit. In addition to spamming colleagues, I also entertained some of the opportunities recruiters sent along. I admit that being a recruiter must be terribly difficult job of trying to match insatiable demand for unicorns with the fallibility of actual human beings. Recruiters play an important role, not only in exposing you to a wider set of possibilities than your network might offer but also in steeling yourself for the realities of the search ahead. While recruiters are super enthusiastic, they aren’t a new friend trying to hook you up with a job as a favor: they get paid when you take an offer and they will pressure you to interview repeatedly and take any offer that comes along. This is often hard for academics who have come up through a system that demands deference to others’ agendas under the assumption they have your interests at heart as future advocates. Your days of delayed gratification are behind you. You will need to learn to assert yourself so their interests do not override your own and protect your time from distractions like peripherally-related opportunities. These skills, politely practiced on recruiters, will become even more invaluable when you enter the negotiation stage later on.

Growing into the role

By the virtue of even being awarded a PhD, you have accomplished something that makes you uniquely expert in the world. You’re going to be hired because your background includes some combination of “harder” technical skills in terms of using tools to perform analyses and “softer” integrative skills in terms of asking the right questions of complex data. But your interests and skills are surely extremely narrow and will need to expand considerably to fit into the nebulous boundaries of the “data scientist” who’s expected to have some combination of hacking, stats, and expertise. If you’re a social scientist like me, you’ve probably never taken machine learning and won’t know what k-means clustering means or why you should prune a decision tree. If you’re a computer scientist and never taken a statistics course, you may not know how to interpret effects from a regression model or how to design a behavioral experiment.

The intuitions behind these aren’t particularly hard, but you will have to study and practice using these to be conversant in them during interviews. There are no shortage of blog posts on how to break into data science, but I would especially recommend Trey Causey’s. I also strongly recommend Doing Data Science and Data Science for Business as two exemplary introductory books that don’t get lost in the weeds of formalized math or conceptual abstraction. Again, resist the academic urge to understand why they work from first principles, but focus instead on how you can use them to tackle problems and develop heuristics for the kinds of problems they won’t work on. Part of growing into the data scientist box will involve quickly teaching yourself to do these analyses beyond what the introductary documentation and manuals say. And you should seek out interesting data sets in “pop culture” (think entertainment, sports, geography, etc.), document these analyses on a GitHub repo, participate in Kaggle competitions, or implement a website with some interactive features. After you tweet (surely you already tweet!), go post your analysis to /r/DataIsBeautiful, HackerNews, or DataTau.

Stage 3: The Recruitment

Don’t do coding interviews

I am not a computer scientist nor am I a software engineer, so the experience of the “coding interview” was foreign to me. If you’re not familiar with the exercise, after the initial phone screens with recruiters or managers, you’re put on the phone or in a room with an engineer and asked to program some basic function to “make sure you know how to code” or “see how you think through problems.” You will never ever actually need to implement a Fibonacci number generator or sorting algorithm in your actual job. But in a coding interview, you will need to be able to demonstrate you can implement a workable version from scratch within minutes in a closed-book environment while a stranger judges you. This is a terrible approach to recruitment that selects on personality types rather than technical competence. If I had to do it all over again, I would simply refuse to do them — and make that clear ahead of time. If there are concerns about your ability to code, have them do a code review on the analyses you’ve posted to a GitHub repo or Kaggle competition. If they want to see how you think through a data analysis problem, walk through a case study. If some domain expertise needs to be demonstrated, arrange for a “take home” assignment to return after 24 hours. After 4+ years in a PhD program, you’ve earned the privilege to be treated better than the humiliation exercises 20-year old computer science majors are subjected to for software engineering internships.

Ignore the hangups of academia

I’ve alluded to this above, but it can’t be re-inforced enough: industry is not academia. If you’ve made the choice to go into industry, you need to re-calibrate your dystopian comic strip mindset from “PhD” and prepare yourself for “Dilbert”. You’re entering into a new kind of relationship where success is measured by purusing a very applied research program that will demand flexibility, scalability, and attention to details. You really need to be fundamentally honest with yourself about this: industry will not be academia by another means. Your manager will provide a different kind of mentorship that is likely both more and less hands-on than you’re used to. Getting things done fast matters more than worrying about novelty. The pace of work revolves around fiscal quarters rather than academic semesters. You will be compensated and promoted for implementing ideas that are “good” because they make or save money. On the flip side, you don’t need to commit to toiling away somewhere for 4+ years. Your work will be used by thousands or millions of people. There shouldn’t be a shortage of extremely motivated people who have amazing skills and ideas. There will be many fringe benefits that you won’t need “seniority” to take advantage of. You could gross more in your first year as an industrial data scientist than you did in all your years in graduate school, combined. You’re right to think many of these are deeply unfair, but don’t forgo the privileges that an industry role entitles you to out of deference to academic norms you’ve internalized but no longer apply. I admit to having many, if not all, of these hangups, but you need to confront and tame them lest they lead you to self-sabotage.

Serenity in the face of chaos

I’m probably not alone in having the tendency to conjure detailed future scenarios far before the prerequisite actions have remotely come to pass: “Wouldn’t it be great to live in X”, “I can’t wait to work on Y”, “Z is such a brilliant person.” But while enthusiastically pursuing several leads, things well outside of your control will shut them down. Higher-ups might surprise the rest of the group with re-organizations, the politics of hiring decisions might surprise a potential manager, etc. These aren’t a reflection on your performance in an interview, but the disappointment you’ll feel from wanting something for which you’re both qualified and recognized for nevertheless being “taken away” is still very real. So in addition to keeping the fantasies on ice, never stop pursuing other opportunities during the search, no matter how “sure” something feels. You’ll need a backup plan in the worst case and leverage in the best case, so keep other recruiting efforts going even after there are offers on the table.

Stage 4: The Little Decision

Negotiating offers: thar be bigints here

Make sure to take advantage of websites like Glassdoor and PayScale to get a sense for what others at the company or in similar roles earn. Does a median starting salary of $120k seem like an outrageous sum of money when you’d really be happy with just $90k when all your assistant professor friends at Big State U are just making $60k? It turns out your potential employer would also be happy to pay you less than the market rate too! But that starting salary you negotiate becomes the base on which all your raises, bonuses, and future salary negotiations will be based. I made this “mistake”, but apparently it’s widely-acknowledged to never disclose your current salary or how much you expect to make. Remember, this is a business negotiation where they are hiring you to make them a lot of money, some (small) fraction of which they’ll return to you as compensation. If the transactional logic of shaking the most money out of a for-profit corporation who will unflinchingly lay you off in a heartbeat makes you uncomfortable, there are an increasing number of exciting data science opportunities in government and non-profit spaces too. But after an offer has been made, the worst thing an employer can say to your salary request for what seems like a really big number is “No”. Really — it doesn’t go on your permanent record or anything. Having other offers and using them as leverage in a negotiation is not dishonest, especially if you’re upfront throughout the process about looking at other roles (as you should be doing). There are lots of other excellent resources out there on negotiating offers, and never ever accept the first one given to you!

Equity and other fringe benefits

If you’re going into a start-up environment, issues around equity loom larger than salary, but it’s a complicated game. Remember again, if you’re a data scientist with a PhD, you’re worth more than an entry-level engineer and you should be asking along the lines of what other mid or senior-level engineers get: something like 10 and 50 “points” (0.1% – 0.5% of equity), which may vary substantially depending on the size and stage of the company. But always remember that your stake is likely to get diluted down as the company grows and the employee equity pool is usually last in line cash out after the other investors and founders. A fraction of a percent doesn’t sound like a lot, but when $100 million exits aren’t rare, software developers aren’t buying big houses or starting non-profits by dutifully saving up their salaries. You should use sites like AngelList and Wealthfront to get a sense for what the going rate in an industry, role, etc. is. How you choose to balance the trade-off between more salary and more equity comes down to your tolerance for risk and your faith in the founders’ vision and other investors’ patience. And don’t forget to go back to that pro-con list at the start. If there are other fringe things you would like to keep doing like attending/submitting to relevant conferences (whether academic-focused like KDD, ICWSM, and CSCW, or industry-focused like Strata, UseR, PyCon), having time to consult for a non-profit, teaching at a local college, etc., negotiation is the best time to make those expectations clear.

The Reprisal of the Pro and Con list

Now you have some offers on the table and the pro-con list you wrote up before starting recruitment. Like me, your thinking probably changed a lot going through the process. You will have gone through many reality-distortion fields, drank a good amount of kool-aid, and probably saw some sausage-making throughout the process. This will lead to you coming up with cons you hope to never have to confront again and pros you never dreamed of. Like any good little Bayesian, you should use this new information to update your prior beliefs to come to a better decision. In my case, my priorities started off with wanting privileged access to data, working in an industrial/corporate setting, and developing new analytical skills. After the process and talking with friends and colleagues, I realized that many of these still applied, but I had overlooked how important being able to engage in the academic conversations was to me, especially if I wanted to stay competitive on the academic market for the medium term. But I can’t stress how important it is to have some sort of objective list of criteria that you write down or store in other people’s minds so that you can ground yourself on these values during a very exciting but disorienting search.

The Grand Reveal

With all of that bluster out of the way, I’m very excited to announce that I will be joining the Harvard Business School as a research associate in November. This is not a tenure-track job but I will become one of the first data scientists on their HBX platform, which is their unique MOOC initiative focused on business education. I will be doing a mixture of both platform and academic research to understand the factors that contribute to learning and success in these contexts using both observational and experimental data.

I realize that the bloom is very much off the flower after some very public failures and very justified criticism in the MOOC space. But I also think there are important niches these can fill, even if they can’t and shouldn’t supplant other modes of education. I believe that HBX has identified a really interesting niche and strategy as well as made a big commitment in people and resources , so I’m excited to dig into where and how these approaches are succeeding or faltering. I’ve also been told Harvard employs a number of smart people and has something of a soapbox from which to publicize information. Seriously, I’m thrilled to be at the intersection of traditional business strategy and education, data-driven decision making, and collaborating with brilliant HBS faculty like Bharat Anand and HarvardX colleagues like Justin Reich.

“But Brian, you just spent a billion words talking a big game about industrial data science — what gives?” You’re right, I still don’t have any full-time experience working in industrial data science. You’re also right that I’m still in academia. Going back to my pro and con list — which is going to be different for every person — this role gave me an ideal mix between academia and industry: it is focused on research on learning at scale but it’s working on a product with very real customers and competitors. If I was looking for a longer-term career change, was more willing to relocate, had different skills and interests, or wasn’t solving a two-body problem, I would have made very different decisions.

Returning to the question of why write up something I’m not actually doing, I wanted to share my perspective of “how I almost went into industry” after four months of interviews with nearly a dozen different companies. Very little in academia prepares you to go on a market like this, but getting social scientists into data science roles is vital to ensure the right questions are being asked and the best inferences are being made from many types of data. The recruitment process will be upsetting and disorienting and the episodes from above may or may not resonate among those actually in industry. And I hope others will share their stories. But I wanted to especially target those of you in academia and are considering making the jump: you’re not alone and you should go for it.

Thanks for making it this far and feel free to get in touch if you have any questions. And many thanks to Alan, Lauren, Michael, Patti, Trey, and Ricarose for super valuable feedback on earlier versions of this post!