HGSE Guest Lecture

Many thanks to Professor Karen Brennan for inviting me to speak to her class “Teacher Learning and Technology” about “Data“. I’ve posted the very brief slides I used here. Certainly nothing new under the sun, but hopefully above the median level of discourse about “big data.”

The ensuing discussion with the students was excellent and we covered a broad number of topics such as surveillance and privacy, mixed methods research, reducing the learning curve for analysis tools, and the importance of formulating research questions.


Does Wikipedia editing activity forecast Oscar wins?

The Academy Awards just concluded and much will be said about Ellen Degeneres most retweeted tweet (my coauthors and I have posted an analysis here that shows these “shared media” or “livetweeting” events disproportionately award attention to already elite users on Twitter.) I thought I’d use the time to try to debug some code I’m using to retrieve editing activity information from Wikipedia.

A naive but simple theory I wanted to test was whether editing activity could reliably forecast Oscar wins. Academy Awards are selected from approximately 6,000 ballots and the process is known for intensive lobbying campaigns to sway voters as well as tapping into the zeitgeist about larger social and cultural issues.

I assume that some of this lobbying and zeitgeist activity would both manifest in the aggregate in edits to the English Wikipedia articles about the nominees. In particular, I measure two quantities: (1) the changes (revisions) made to the article and (2) the number of new editors making revisions. The hypothesis is simply that articles about nominees with the most revisions and the most new editors should win. I look specifically at the time between announcement of the nominees in early January and March 1 (an arbitrary cutoff).

I’ve only run the analysis on the nominees for Best Picture, Best Director, Best Actor, Best Actress, and Best Supporting Actress (nominees in Best Supporting Actor was throwing some unusual errors, but I’ll update). The results below show that Wikipedia editing activity forecast the wins in Best Actor, Best Actress, and Best Supporting Actress, but did not do so for Best Picture or Best Director. This is certainly better than chance and I look forward to expanding the analysis to other categories and prior years.

Best Picture

The “Wolf of Wall Street” showed a remarkably strong growth in the number of new edits and editors after January 1. However, “12 Years a Slave” which ranked 5th by the end, actually won the award. A big miss.

picture_editors picture_edits 

Best Director

The Wikipedia activity here showed strong growth for  Steve McQueen (“12 Years a Slave”), but Alfonso Cuaron (“Gravity”) took the award despite coming in 4th in both metrics here. Another big miss.

director_editors director_edits

Best Actor

The Wikipedia activity for new edits and new editors are highly correlated because new editors necessarily show up as new edits. However, we see an interesting and very close race here between Chiwetel Ejofer (“12 Years a Slave”) and Matthew McConaughey (“Dallas Buyers Club”) for edits, but McConaughey with a stronger leader among new editors. This suggest older editors were responsible for pushing Ejofer higher (and he was leading early on), but McConaughey took the lead and ultimately won. Wikipedia won this one.


Best Actress

Poor Judy Dench, she appeared to not even be in the running in either metric. Wikipedia activity forecast a Cate Blanchett (“Blue Jasmine”) win, although this appeared to be close among several candidates if the construct is to be believed. Wikipedia won this one.

actress_editors actress_edits

Best Supporting Actress

Lupita Nyong’o (“12 Years a Slave”) accumulated a huge lead over her other nominees by Wikipedia activity and won the award.



Other Categories and Future Work

I wasn’t able to run the analysis for Supporting Actor because the Wikipedia API seemed to poop out on Bradley Cooper queries, but it may be a deeper bug in my code too. This analysis can certainly be extended to the “non-marquee” nominee categories as well, but I didn’t feel like typing that much.

I will extend and expand this analysis for both other categories as well as prior years’ awards to see if there are any discernible patterns for forecasting. There may be considerable variance between categories in the reliability of this simple heuristic — Director and Picture may be more politicized than the rest, if I wanted to defend my simplistic model. This type of approach might also be used to compare different awards shows to see if some diverge more than others from aggregate Wikipedia preferences. The hypothesis here is a simple descriptive heuristic and more extensive statistical models that incorporate features such as revenue, critics’ scores, and nominees’ award histories (“momentum”) may produce more reliable results as well.


Wikipedia editing activity over the two months leading up to the 2014 Academy Awards accurately forecast the winners of the Best Actor, Best Actress, and Best Supporting Actress categories but significantly missed the winners of the Best Picture and Best Director categories. These results suggest that differences in editing behavior–in some cases–may reflect collective attention to and aggregate preferences for some nominees over others. Because Wikipedia is a major clearinghouse for individuals who both seek and shape popular perceptions,  these behavioral traces may have significant implications for forecasting other types of popular preference aggregation such as elections.

Boston Marathon bombing

The Boston Marathon bombing occurred less than a mile from where I work and only 300 feet away from my first apartment after graduating from college. Fortunately, all my kith and kin are safe and well. The proximity and severity of this event has motivated me to expand a bit upon the prior analysis I’ve done of other current events on Wikipedia to examine a new type of data: pageviews. The Wikimedia Foundation makes data available about the number of times an article was requested every hour going back to late 2007. For most purposes, these data can be aggregated at the daily level and accessed via another service made by User:Henrik. It is important to keep in mind that pageviews are requests, not necessarily unique viewers although they are obviously highly correlated.

A characteristic feature of the pageviews around a Wikipedia article for a current news event is a peak followed by a decay. Here is an example from May 2011 about Osama bin Laden. On May 1, there were 7,557 views on the article, after the announcement of his death on late on May 1, the article was viewed more than 4.5 million times on May 2, and by May 31, the pageviews dropped off back to 23,795. (NOTE: The dates in the chart appear to be off by a day since the announcement happened late EDT on May 1 when it was May 2 in UTC).  The vast majority of the 7,557 views on May 1 occurred without any knowledge of his death and are similar to the pageview activity over the entire month of April (between ~6,000 and ~11,000). The magnitude of “burst” of pageview activity on May 2 is obviously indicative of a major event that drove many people to seek information about bin Laden in a narrow period of time.

Screen Shot 2013-04-17 at 6.24.45 PM

Similar patterns of pageview bursts are also found on articles related to bin Laden such as “Abbotobad” or “United States Naval Special Warfare Development Group” which are clearly related to the events of that day. Other articles such as “Saudi royal family” also exhibited characteristic bursts of activity around May 2 while articles such as “Bill Clinton” had no characteristic burst. This suggests that some pages were more related to the events of that day because they received similar types of intense attention versus other articles. In other words, the size of the pageview activity burst for an article on the day reflects users suddenly seeking information about a current news event.

Turning our attention away from this Osama bin Laden example and back to the Boston Marathon bombings, the bursts of pageview activity on a set of articles could reveal information about the event itself. Using the “Boston Marathon bombing” article as a seed, I extracted the 140 other articles the bombing article linked to. Of course, the text of this article is highly unstable and some of these links are likely to come and go. Nevertheless, I will use this list of 140 other articles to examine which received the largest bursts of activity. To quantify the magnitude of the pageview bursts across these articles, I simply took the median number of pageviews for all the articles over the 6 week period from March 1 through April 14 as a baseline. Then I took the maximum number of pageviews on either April 15 or April 16 (the most recent dates available). The ratio of these pageviews (maximum during the days following the event divided by the median over the days preceding the event) gives us some idea of which articles saw the greatest increases in pageview activity.

  1. Ground stop: 329.0
  2. Boylston Street: 268.73
  3. Google Person Finder: 237.22
  4. Patriots’ Day: 201.32
  5. Copley Square: 171.53
  6. Controlled explosion: 168.25
  7. The Lenox Hotel: 116.0
  8. Pressure cooker: 83.98
  9. Massachusetts Emergency Management Agency: 83.5
  10. Boston Police Special Operations Unit: 78.43
  11. BB (ammunition): 59.41

This list excludes a number of articles like “Edward F. Davis“, and “Pressure cooker bomb” that did not exist before April 15. However, the size of the bursts of pageview activity on Wikipedia articles (linked from from the bombing article itself) convey a surprising amount of information about the more salient details of the location, timing, cause, and effects of this story.

Each of these articles’ time series pageview data from March 1 through April 17 can be correlated with each other. For example, the correlation between pageviews for “Ball (bearing)” and “Brigham and Women’s Hospital” is 0.99, which strongly suggests the latter is viewed only when the former is also being viewed. Conversely, the correlation between “Ball (bearing)” and “USA Today” is only 0.13 suggesting the viewing activity for both articles is generally unrelated. These correlations can be done for every pair of articles to establish the relationship between their pageview activity. Thresholding these correlations at the 0.5 level, the resulting relationships can be represented as a correlation network. Here is the network below:


This image (click to embiggen) also tells a variety of stories despite the hairballness of the network. There are two distinct clusters of nodes: the bluish cluster corresponds to articles highly correlated with each other as they deal with topics pertaining to the bombing itself. These articles are the infrequently trafficked articles that all of a sudden attracted attention all together because of the bombings. The greenish cluster on the lower right reveals articles that are linked from the bombing article but aren’t tightly correlated with the bombing topics but are correlated with each other. These articles are more frequently trafficked and less closely related to the events themselves and pertain to major social institutions like newspapers, government agencies, and financial markets. Their clustering together suggests that being only loosely-related to the bombing itself, nevertheless remain closely-related to each other over time. Thus, this network suggests at least two distinct patterns of on-going Wikipedia use: abrupt information seeking about topics that are suddenly in the news versus on-going information seeking about institutions that are regularly in the news.

As always, this is simply a first cut of the analysis and I’m working on some other analyses that look at the pageview data at an hourly level of resolution and expand the corpus of articles from simply the articles linked from the bombing article to all other English Wikipedia articles. So stay tuned for more.