Harvard CRCS talk

Today, I gave a talk in the lunch seminar series at Harvard’s Center for Research on Computation and Society (CRCS) about my dissertation-related work on socio-technical trajectories. Slides are here and the talk was recorded — I’ll post a link to that as soon as it’s available since I need to take some notes on some excellent ideas audience members had about how to extend this research!

EDIT: Link to the video of the presentation.

Boston Marathon bombing

The Boston Marathon bombing occurred less than a mile from where I work and only 300 feet away from my first apartment after graduating from college. Fortunately, all my kith and kin are safe and well. The proximity and severity of this event has motivated me to expand a bit upon the prior analysis I’ve done of other current events on Wikipedia to examine a new type of data: pageviews. The Wikimedia Foundation makes data available about the number of times an article was requested every hour going back to late 2007. For most purposes, these data can be aggregated at the daily level and accessed via another service made by User:Henrik. It is important to keep in mind that pageviews are requests, not necessarily unique viewers although they are obviously highly correlated.

A characteristic feature of the pageviews around a Wikipedia article for a current news event is a peak followed by a decay. Here is an example from May 2011 about Osama bin Laden. On May 1, there were 7,557 views on the article, after the announcement of his death on late on May 1, the article was viewed more than 4.5 million times on May 2, and by May 31, the pageviews dropped off back to 23,795. (NOTE: The dates in the chart appear to be off by a day since the announcement happened late EDT on May 1 when it was May 2 in UTC).  The vast majority of the 7,557 views on May 1 occurred without any knowledge of his death and are similar to the pageview activity over the entire month of April (between ~6,000 and ~11,000). The magnitude of “burst” of pageview activity on May 2 is obviously indicative of a major event that drove many people to seek information about bin Laden in a narrow period of time.

Screen Shot 2013-04-17 at 6.24.45 PM

Similar patterns of pageview bursts are also found on articles related to bin Laden such as “Abbotobad” or “United States Naval Special Warfare Development Group” which are clearly related to the events of that day. Other articles such as “Saudi royal family” also exhibited characteristic bursts of activity around May 2 while articles such as “Bill Clinton” had no characteristic burst. This suggests that some pages were more related to the events of that day because they received similar types of intense attention versus other articles. In other words, the size of the pageview activity burst for an article on the day reflects users suddenly seeking information about a current news event.

Turning our attention away from this Osama bin Laden example and back to the Boston Marathon bombings, the bursts of pageview activity on a set of articles could reveal information about the event itself. Using the “Boston Marathon bombing” article as a seed, I extracted the 140 other articles the bombing article linked to. Of course, the text of this article is highly unstable and some of these links are likely to come and go. Nevertheless, I will use this list of 140 other articles to examine which received the largest bursts of activity. To quantify the magnitude of the pageview bursts across these articles, I simply took the median number of pageviews for all the articles over the 6 week period from March 1 through April 14 as a baseline. Then I took the maximum number of pageviews on either April 15 or April 16 (the most recent dates available). The ratio of these pageviews (maximum during the days following the event divided by the median over the days preceding the event) gives us some idea of which articles saw the greatest increases in pageview activity.

  1. Ground stop: 329.0
  2. Boylston Street: 268.73
  3. Google Person Finder: 237.22
  4. Patriots’ Day: 201.32
  5. Copley Square: 171.53
  6. Controlled explosion: 168.25
  7. The Lenox Hotel: 116.0
  8. Pressure cooker: 83.98
  9. Massachusetts Emergency Management Agency: 83.5
  10. Boston Police Special Operations Unit: 78.43
  11. BB (ammunition): 59.41

This list excludes a number of articles like “Edward F. Davis“, and “Pressure cooker bomb” that did not exist before April 15. However, the size of the bursts of pageview activity on Wikipedia articles (linked from from the bombing article itself) convey a surprising amount of information about the more salient details of the location, timing, cause, and effects of this story.

Each of these articles’ time series pageview data from March 1 through April 17 can be correlated with each other. For example, the correlation between pageviews for “Ball (bearing)” and “Brigham and Women’s Hospital” is 0.99, which strongly suggests the latter is viewed only when the former is also being viewed. Conversely, the correlation between “Ball (bearing)” and “USA Today” is only 0.13 suggesting the viewing activity for both articles is generally unrelated. These correlations can be done for every pair of articles to establish the relationship between their pageview activity. Thresholding these correlations at the 0.5 level, the resulting relationships can be represented as a correlation network. Here is the network below:

zoom

This image (click to embiggen) also tells a variety of stories despite the hairballness of the network. There are two distinct clusters of nodes: the bluish cluster corresponds to articles highly correlated with each other as they deal with topics pertaining to the bombing itself. These articles are the infrequently trafficked articles that all of a sudden attracted attention all together because of the bombings. The greenish cluster on the lower right reveals articles that are linked from the bombing article but aren’t tightly correlated with the bombing topics but are correlated with each other. These articles are more frequently trafficked and less closely related to the events themselves and pertain to major social institutions like newspapers, government agencies, and financial markets. Their clustering together suggests that being only loosely-related to the bombing itself, nevertheless remain closely-related to each other over time. Thus, this network suggests at least two distinct patterns of on-going Wikipedia use: abrupt information seeking about topics that are suddenly in the news versus on-going information seeking about institutions that are regularly in the news.

As always, this is simply a first cut of the analysis and I’m working on some other analyses that look at the pageview data at an hourly level of resolution and expand the corpus of articles from simply the articles linked from the bombing article to all other English Wikipedia articles. So stay tuned for more.

Co-authorship patterns around Pope Francis

A little late in coming, but here’s a pretty picture based on a conference submission I’m preparing.

  1. Taking the revision history of all 607 unique editors who contributed to the article on Pope Francis after 1 Jan 2013.
  2. Get all the other 22,225 articles they revised since the beginning of the year.
  3. From this two-mode network, project a one-mode article-article network where one article is linked to another article if they share an editor in common.
  4. Filter out all the edges where there is only a single editor in common leaving articles than have been edited by two or more editors in common and remove the resulting isolates.
  5. Identify the largest connected component consisting of 2,671 articles and 3,144 edges.
  6. Visualize! Nodes are sized based on degree and colored based on modularity class. Data (including GraphML files for both the complete graph and LCC, a larger PNG, and a SVG) available here.

article-coauthorship-lcc_pope_20130101_4096

There’s a lot going on there and much more to see by looking around the full image, but I’ll give a few highlights.

The articles with the strongest tie (most editors in common)? A lot of ties between Pope Francis and other papal and Catholic-related articles round out the top 10 as one would expect, but there are some interesting outliers as well: Pier Luigi Bersani and Italian general election, 2013 with 42 editors in common, actually takes first and 2013 Malmö FF season and 2012–13 Svenska Cupen comes in 4th. This is to say these random articles shared at least 2 editors with the Pope Francis article but were themselves the subject of intense co-editing.

(u'Pier Luigi Bersani', u'Italian general election, 2013', {'weight': 42}),
 (u'List of popes', u'Pope Francis', {'weight': 37}),
 (u'Papal conclave, 2013', u'Pope Francis', {'weight': 31}),
 (u'2013 Malmxf6 FF season', u'2012u201313 Svenska Cupen', {'weight': 26}),
 (u'Pope Benedict XVI', u'Pope Francis', {'weight': 24}),
 (u'Papal conclave, 2013', u'Pope Benedict XVI', {'weight': 22}),
 (u'Pope Benedict XVI', u'Resignation of Pope Benedict XVI', {'weight': 22}),
 (u'Papal conclave, 2013',u'Resignation of Pope Benedict XVI',{'weight': 20}),
 (u'South American dreadnought race',u'Argentineu2013Chilean naval arms race',{'weight': 18}),
 (u'Timeline of Vietnamese history',u'First Chinese domination of Vietnam',{'weight': 18})

Of course, a lot of co-authorship was around other Catholic topics: the Papal Enclave, Pope Benedict XVI and his resignation, and other cardinal electors:

catholic

There is a lot of co-authorship around other topics that were also in the news:

breaking_news

Other topics of current events, but not peripheral to these coauthorship patterns include updates to Swedish football club rosters as well as editing of articles about members of the Baathist regime in Syria. Strangely, these two disparate topics are clustered together (both by modularity and by layout) suggesting they draw from a similar communities of editors.

football and syria

If you want to know more, hopefully our paper will be accepted and I can share it 🙂