NOTE: The images associated with this post were lost in a migration.
The Boston Marathon bombing occurred less than a mile from where I work and only 300 feet away from my first apartment after graduating from college. Fortunately, all my kith and kin are safe and well. The proximity and severity of this event has motivated me to expand a bit upon the prior analysis I’ve done of other current events on Wikipedia to examine a new type of data: pageviews. The Wikimedia Foundation makes data available about the number of times an article was requested every hour going back to late 2007. For most purposes, these data can be aggregated at the daily level and accessed via another service made by User:Henrik. It is important to keep in mind that pageviews are requests, not necessarily unique viewers although they are obviously highly correlated.
A characteristic feature of the pageviews around a Wikipedia article for a current news event is a peak followed by a decay. Here is an example from May 2011 about Osama bin Laden. On May 1, there were 7,557 views on the article, after the announcement of his death on late on May 1, the article was viewed more than 4.5 million times on May 2, and by May 31, the pageviews dropped off back to 23,795. (NOTE: The dates in the chart appear to be off by a day since the announcement happened late EDT on May 1 when it was May 2 in UTC). The vast majority of the 7,557 views on May 1 occurred without any knowledge of his death and are similar to the pageview activity over the entire month of April (between ~6,000 and ~11,000). The magnitude of “burst” of pageview activity on May 2 is obviously indicative of a major event that drove many people to seek information about bin Laden in a narrow period of time.
Similar patterns of pageview bursts are also found on articles related to bin Laden such as “Abbotobad” or “United States Naval Special Warfare Development Group” which are clearly related to the events of that day. Other articles such as “Saudi royal family” also exhibited characteristic bursts of activity around May 2 while articles such as “Bill Clinton” had no characteristic burst. This suggests that some pages were more related to the events of that day because they received similar types of intense attention versus other articles. In other words, the size of the pageview activity burst for an article on the day reflects users suddenly seeking information about a current news event.
Turning our attention away from this Osama bin Laden example and back to the Boston Marathon bombings, the bursts of pageview activity on a set of articles could reveal information about the event itself. Using the “Boston Marathon bombing” article as a seed, I extracted the 140 other articles the bombing article linked to. Of course, the text of this article is highly unstable and some of these links are likely to come and go. Nevertheless, I will use this list of 140 other articles to examine which received the largest bursts of activity. To quantify the magnitude of the pageview bursts across these articles, I simply took the median number of pageviews for all the articles over the 6 week period from March 1 through April 14 as a baseline. Then I took the maximum number of pageviews on either April 15 or April 16 (the most recent dates available). The ratio of these pageviews (maximum during the days following the event divided by the median over the days preceding the event) gives us some idea of which articles saw the greatest increases in pageview activity.
- Ground stop: 329.0
- Boylston Street: 268.73
- Google Person Finder: 237.22
- Patriots’ Day: 201.32
- Copley Square: 171.53
- Controlled explosion: 168.25
- The Lenox Hotel: 116.0
- Pressure cooker: 83.98
- Massachusetts Emergency Management Agency: 83.5
- Boston Police Special Operations Unit: 78.43
- BB (ammunition): 59.41
This list excludes a number of articles like “Edward F. Davis“, and “Pressure cooker bomb” that did not exist before April 15. However, the size of the bursts of pageview activity on Wikipedia articles (linked from from the bombing article itself) convey a surprising amount of information about the more salient details of the location, timing, cause, and effects of this story.
Each of these articles’ time series pageview data from March 1 through April 17 can be correlated with each other. For example, the correlation between pageviews for “Ball (bearing)” and “Brigham and Women’s Hospital” is 0.99, which strongly suggests the latter is viewed only when the former is also being viewed. Conversely, the correlation between “Ball (bearing)” and “USA Today” is only 0.13 suggesting the viewing activity for both articles is generally unrelated. These correlations can be done for every pair of articles to establish the relationship between their pageview activity. Thresholding these correlations at the 0.5 level, the resulting relationships can be represented as a correlation network. Here is the network below:
This image (click to embiggen) also tells a variety of stories despite the hairballness of the network. There are two distinct clusters of nodes: the bluish cluster corresponds to articles highly correlated with each other as they deal with topics pertaining to the bombing itself. These articles are the infrequently trafficked articles that all of a sudden attracted attention all together because of the bombings. The greenish cluster on the lower right reveals articles that are linked from the bombing article but aren’t tightly correlated with the bombing topics but are correlated with each other. These articles are more frequently trafficked and less closely related to the events themselves and pertain to major social institutions like newspapers, government agencies, and financial markets. Their clustering together suggests that being only loosely-related to the bombing itself, nevertheless remain closely-related to each other over time. Thus, this network suggests at least two distinct patterns of on-going Wikipedia use: abrupt information seeking about topics that are suddenly in the news versus on-going information seeking about institutions that are regularly in the news.
As always, this is simply a first cut of the analysis and I’m working on some other analyses that look at the pageview data at an hourly level of resolution and expand the corpus of articles from simply the articles linked from the bombing article to all other English Wikipedia articles. So stay tuned for more.