2012 Aurora shootings

Data, Dissertation, Wikipedia

The “2012 Aurora shooting” article on Wikipedia is an example of a breaking news article which has a many editors intensively and jointly editing a single article. As of approximately 1pm EDT on July 21, 290 unique editors have made 1,281 changes to the article in a period of less than 36 hours.

The chart below shows how the size of the article has rapidly grown as well as interesting changes in the concentration of work done per editor (Time is recorded at UTC). The blue dots are the length of the article in bytes. We see that there is a rapid growth of the article at about 19:00 UTC which is later reverted, a rapid deceleration of article growth at about 1:00 UTC on 7/21, and a momentary expansion of the article at about 6:00 UTC and a sudden contraction at about 10:00 UTC. The average number of edits per article (green) also shows interesting damped sinusoidal behavior with the earliest part of the article involving many contributions from few editors, a sudden decline as many new editors join the collaboration, a rise again as some of these new contributors make many revisions, and then a stabilization around 4 edits per editor.

Of course, the actual work editors are doing on the article is very uneven and follows classic long-tail behavior. Most editors make only a single contribution, but a handful of editors are making dozens of contributions. Users O’Dea and Sandstein each have made more than 65 contributions over the last 36 hours.

Moreover, the activity on this article is also extremely intense. The chart below plots the distribution of time between edits. Again, the vast majority of edits occur within seconds or minutes of each other and only twice in the entire 36 hour history of the article have 20 minutes gone by without someone making a change to the article.

Finally, I used a method to mine the log of revisions made to the article to create a network of users modifying each others’ work (see paper here). For example, if user B makes a change to the article which was previously the version from user A, a directed link would be created from user B towards user A: user B modified user A’s version. I can also encode a variety of other information into this network. I color the nodes such that bluer nodes are users who started editing early in the article’s history, redder nodes are user who joined the collaboration later. Larger nodes are editors who have more connections to other editors. Darker links are larger changes made to the article. Larger links are more interactions between editors. Visualization was done in Gephi.

 

Eyeballing the diagram (which is no way to do real analysis), suggests that most prolific editors joined the collaboration early but are not the first contributors either (they are at approximately 2 o’clock). The dense core of the network consists of several editors who appear to be working closely together modifying each others’ work. O’Dea is at the center. Most of the editors who join later (greens, oranges, and reds) are on the periphery of the network suggesting they make relatively minor contributions, sometimes with other less central editors, but their work is subsequently revised by the central editors. These high activity editors are maintaining their involvement over time by interacting with many different types of editors who joined earlier and later (as indicated by being connected to  nodes of different colors). However, the largest changes prolific users make are still small relative to the changes other users are making (lighter links compared to the black links elsewhere).

The data for this analysis is based off revision history (CSV here) and was translated into a network format (GEXF here) using custom code I promise I’ll post when it’s not embarrassingly hacky.

Update via Taha Yasseri:

Within 48 hours of the event, 30 Wikipedia language editions had coverage of the event. 10 of these Wikipedias had articles within 6 hours of the article on the English Wikipedia. Interestingly, smaller languages like Latvian (3) and Danish (7) appeared before other major languages like Polish (8), French (10), and German (13).

7 thoughts on “2012 Aurora shootings

  1. I am the contributor O’Dea whose activity you analyzed.

    I fell asleep lying on the floor watching television at about 00:30 and woke at 04:00 on 20 July 2012. A big news story was running on the television about a shopping mall location (Aurora Mall, new name Town Center of Aurora) where I went on a date in a coffee shop a few years ago. The Century 61 cinema is adjacent to the mall. I followed the story for a few minutes and then made my first edit to the Town Center of Aurora article at Wikipedia at 04:17 (11:17 UTC).

    Shortly afterwards, I found the 2012 Aurora shooting article dedicated to the incident and added some details as they came from the television and when I could find online citations for my additions. Some hours later I went back to sleep.

  2. Interesting work. Normally I’d expect editing rates on an article like this to be closely related to its protection status. But looking at the logs http://en.wikipedia.org/w/index.php?title=Special:Log&page=2012+Aurora+shooting

    It went straight to full protection at 11.26 on the 20th and then that was notched down to semiprotection at 18:50. There was almost certainly a shift to the talkpage during that time. Full protection meant that only one of our 700 or so admins could edit the article, and semi protection excludes IP editors and very new accounts.

    Over 192,000 views of the page so far http://stats.grok.se/en/latest/2012_Aurora_shooting

    and 4,738 views of the talkpage http://stats.grok.se/en/201207/talk:2012_Aurora_shooting

    But I think there may be some complications from the page moves

    WSC

  3. Hi,
    very interesting work and great weblog! In my opinion knowledgeproduction on Wikipedia is one of the most exciting topics in current internet research.
    The only thing I do not really understand is the Network analysis. As I understand it an edge is established between every user and the user who edited the article before him/her?
    So, my question is which kind of relation is this and what can it tell us? is it really a kind of collaboration? or may it be a small edit war between two users? Is it possible to differentiate between these two types of connection? Would you say it is important to differentiate between these types?
    In this case users are not totally free to decide with whom to interact since every user who contributes to the article automatically establishes a connection, but maybe he or she isn’t even conscious of this.
    If it is just about the question: who are the most productive users or who has made how many edits and when it would not be necessary to do a network analysis.

  4. @WereSpielChequers There’s a project I’d love to do that tracks and compares changes in page views across the site as a result of admin actions, front paging, news events, etc.

    @supersambo Edges are only formed with the user immediately “in front of them”. Certainly this would capture users engaged in an edit war — but collaborations are messy and it would be important to see this as well as the context of which other users’ revisions they were modifying. You’re somewhat right that many users may not be free or conscious of the other party they’re interacting with, but users also decide when to jump in and make changes in response to other changes made. Other researchers like Aaron Halfaker and Jeff Rzeszotarski have tracked the authorship of Wikipedia content at the level of words or sentences, but this is an NP-hard problem and I don’t quite have the coding skills to hack something like that together yet. But the network analysis does let us see whether active users are interacting with each other or other particular kinds of users and what happens when they stop contributing.

  5. Nice stuff, Brian. Very interesting to see O’Dea’s particular circumstances, including the trigger of the national event connecting to something in their personal lives (a date).

    I wonder if you could refine the network analysis such that you emerge pairings that seem to rise about some random baseline. As the discussion with @supersambo showed there’s some baseline editing of other people’s edits just because we’re working with an existing artifact (not that that isn’t an interesting type of collaboration, even if unconscious).

    I wonder how one could see particular responsiveness to the edits of others? Clearly that’s complicated by the likelihood that editors are working on particular parts of the article, so you might get above random chance associations just because two happen to work on a particular part of the article. So that complicates any baseline that assumes a random editing position (and then looks for deviations from that randomness). So one would probably have to build a baseline model that somehow anchored an editor to the areas they initially chose to edit.

    Associations in time might be a better indicator of responsiveness, establish some baseline rhythm and then see if edits by a particular editor seem to drive their partner above that baseline. Yummy, seems computationally expensive :)

  6. Brian,

    I think normalizing by traffic volume (at an hourly resolution) would give some interesting insights, particularly about the pre-protection phase. WSC has a point about a possible experiment regarding the effects of (semi)protection on edit rates (further thoughts are here).

    I imagine you know that we publish raw hourly page view data from all Wikimedia projects.

  7. @James I’ve also been wracking my brain to think how to put a null model on this. Filtering on users who make more than 1 edit, users who make non-successive edits, etc. might reveal the backbone of “collaborators” rather than “contributors”. Ideas like structural closure don’t quite make sense here either so I wouldn’t really try to tackle it with a p*/ERGM approach either. This paper has a statistical method for extracting the backbone of a network based on variability of edge weights (http://www.pnas.org/content/106/16/6483.abstract), it would be a great mixed methods study to use their method to see which types of contributions and interactions are preserved.

    @Dario One of these days I’ll get around to setting up a streaming database to store the hourly traffic volume. It’d be great if WMF had an API to query the hourly traffic data per article basis rather than downloading thousands of 80mb files to get a single line out! :)

Leave a Reply