Co-authorship patterns around Pope Francis

A little late in coming, but here’s a pretty picture based on a conference submission I’m preparing.

  1. Taking the revision history of all 607 unique editors who contributed to the article on Pope Francis after 1 Jan 2013.
  2. Get all the other 22,225 articles they revised since the beginning of the year.
  3. From this two-mode network, project a one-mode article-article network where one article is linked to another article if they share an editor in common.
  4. Filter out all the edges where there is only a single editor in common leaving articles than have been edited by two or more editors in common and remove the resulting isolates.
  5. Identify the largest connected component consisting of 2,671 articles and 3,144 edges.
  6. Visualize! Nodes are sized based on degree and colored based on modularity class. Data (including GraphML files for both the complete graph and LCC, a larger PNG, and a SVG) available here.


There’s a lot going on there and much more to see by looking around the full image, but I’ll give a few highlights.

The articles with the strongest tie (most editors in common)? A lot of ties between Pope Francis and other papal and Catholic-related articles round out the top 10 as one would expect, but there are some interesting outliers as well: Pier Luigi Bersani and Italian general election, 2013 with 42 editors in common, actually takes first and 2013 Malmö FF season and 2012–13 Svenska Cupen comes in 4th. This is to say these random articles shared at least 2 editors with the Pope Francis article but were themselves the subject of intense co-editing.

(u'Pier Luigi Bersani', u'Italian general election, 2013', {'weight': 42}),
 (u'List of popes', u'Pope Francis', {'weight': 37}),
 (u'Papal conclave, 2013', u'Pope Francis', {'weight': 31}),
 (u'2013 Malmxf6 FF season', u'2012u201313 Svenska Cupen', {'weight': 26}),
 (u'Pope Benedict XVI', u'Pope Francis', {'weight': 24}),
 (u'Papal conclave, 2013', u'Pope Benedict XVI', {'weight': 22}),
 (u'Pope Benedict XVI', u'Resignation of Pope Benedict XVI', {'weight': 22}),
 (u'Papal conclave, 2013',u'Resignation of Pope Benedict XVI',{'weight': 20}),
 (u'South American dreadnought race',u'Argentineu2013Chilean naval arms race',{'weight': 18}),
 (u'Timeline of Vietnamese history',u'First Chinese domination of Vietnam',{'weight': 18})

Of course, a lot of co-authorship was around other Catholic topics: the Papal Enclave, Pope Benedict XVI and his resignation, and other cardinal electors:


There is a lot of co-authorship around other topics that were also in the news:


Other topics of current events, but not peripheral to these coauthorship patterns include updates to Swedish football club rosters as well as editing of articles about members of the Baathist regime in Syria. Strangely, these two disparate topics are clustered together (both by modularity and by layout) suggesting they draw from a similar communities of editors.

football and syria

If you want to know more, hopefully our paper will be accepted and I can share it 🙂

Sandy Hook School massacre

If you follow me on Twitter, you’re probably already well-acquainted with my views on what should happen in the wake of the shooting spree that massacred 20 children and 6 educators at a suburban elementary school in Newton, Connecticut. This post, however, will build on my previous analysis of the Wikipedia article about the Aurora shootings as well as my dissertation examining Wikipedia’s coverage of breaking news events to compare the evolution of the article for the Sandy Hook Elementary School shooting to other Wikipedia articles about recent mass shootings.

In particular, this post compares the behavior of editors during the first 48 hours of each article’s history. The fact that there are 43 English Wikipedia articles about shootings sprees in the United States since 2007 should lend some weight to this much ballyhooed “national conversation” we are supposedly going to have, but I choose to compare just six of these articles to the Sandy Hook shooting article based on their recency and severity as well as an international example.

Wikipedia articles certainly do not break the news of the events themselves, but the first edits to these article happen within two to three hours of the event itself unfolding. However, once created these articles attract many editors and changes as well as grow extremely rapidly.

Figure 1: Number of changes made over time.

The Virginia Tech article, by far and away, attracted more revisions than the other shootings in the same span of time and ultimately enough revisions in the first 48 hours (5,025) to put in within striking distance of the top 1000 most-edited articles in all of Wikipedia’s history. Conversely, the Oak Creek and Binghamton shootings, despite having 21 fatalities between them, attracted substantially less attention from Wikipedians and the news media in general, likely because these massacres had fewer victims and the victims were predominantly non-white.

A similar pattern of VT as an exemplary case, shootings involving immigrants and minorities attracting less attention, and the other shootings having largely similar behavior is also found in the the number of unique users editing an article over time:

Figure 2: Number of unique users over time.

These editors and the revisions they make cause articles to rapidly increase in size. Keep in mind, the average Wikipedia article’s length (albeit highly skewed by many short articles about things like minor towns, bands, and species) is around 3,000 bytes and articles above 50,000 bytes can raise concerns about length. Despite the constant back-and-forth of users adding and copyediting content, the Newtown article reached 50kB within 24 hours of its creation. However, in the absence of substantive information about the event, much of this early content is often related to national and international reactions and expressions of support. As more background and context as information comes to light, this list of reactions is typically removed which can be seen in the sudden contraction of article size as seen in Utoya around 22 hours, and Newtown and Virginia Tech around 36 hours. As before, the articles about the shootings at Oak Creek and Binghamton are significantly shorter.

Figure 3: Article size over time.

However, not every editor does the same amount of work. The Gini coefficient captures the concentration of effort (in this case, number of revisions made) across all editors contributing to the article. A Gini coefficient of 1 indicates that all the activity is concentrated in a single editor while a coefficient of 0 indicates that every editor does exactly the same amount of work.

Figure 4: Gini coefficient of editors’ revision counts over time.

Across all the articles, the edits over the first few hours are evenly distributed: editors make a single contribution and others immediately jump in to also make single contributions as well. However, around hour 3 or 4, one or more dedicated editors show up and begin to take a vested interest in the article, which is manifest in the rapid centralization of the article. This centralization  increases slightly over time across all articles suggesting these dedicated editors continue to edit after other editors move on.

Another way to capture the intensity of activity on these articles is to examine the amount of time elapsed between consecutive edits. Intensely edited articles may have only seconds between successive revisions while less intensely edited articles can go minutes or hours. This data is highly noisy and bursty, so the plot below is smoothed over a rolling average of about 3 hours.

Figure 5: Waiting times between edits (y-axis is log-scaled).

What’s remarkable is the sustained level of intensity over a two day period of time. The Virginia Tech article was still being edited several times every minute even 36 hours after the event while other articles were seeing updates every five minutes more than a day after the event. This means that even at 3 am, all these articles are still being updated every few minutes by someone somewhere. There’s a general trend upward reflecting the initially intense activity immediately after the article is created following increasing time lags as the article stabilizes, but there’s also a diurnal cycle with edits slowing between 18 to 24 hours after the event, before quickening again. This slowing and quickening is seen around about 20 hours as well as around 44 hours suggesting information being released and incorporated in cycles as the investigation proceeds.

Finally, who is doing the work across all these articles? The structural patterns of users contributing to articles also reveals interesting patterns. It appears that much of the editing is done by users who have never contributed to the other articles examined here, but there are a few editors who contributed to each of these articles within 4 hours of their creation.

Figure 6: Collaboration network of articles (red) and the editors who contribute to them (grey) within the first four hours of their existence. Editors who’ve made more revisions to an article have thicker and darker lines.

Users like BabbaQ (edits to Sandy Hook), Ser Amantio di Nicolao (edits to Sandy Hook), Art LaPella (edits to Sandy Hook) were among the first responders to edit several of these articles, including Sandy Hook. However, their revisions are relatively minor copyedits and reference formatting reflecting the prolific work they do patrolling recent changes. Much of the substantive content of the article is from editors who have edited none of the other articles about shootings examined here and likely no other articles about other shootings. In all likelihood, readers of these breaking news articles are mostly consuming the work of editors who have never previously worked on this kind of event. In other words, some of the earliest and most widely read information about breaking news events is written by people with fewer journalistic qualifications than Medill freshmen.

What does the collaboration network look like after 48 hours?

Figure 7: Collaboration network after 48 hours.

3,262 unique users edited one or more of these seven articles, 222 edited two or more of these articles, 60 had edited 3 or more, and a single user WWGB had edited all seven within the first 48 hours of their creation. These editors are at the center of Figure 7 where they connect to many of the articles on the periphery. The stars surrounding each of the articles are the editors who contributed to that article and that article alone (in this corpus). WWGB is an editor who appears to specialize not only in editing articles about current events, but participating in a community of editors engaged in the newswork on Wikipedia. These editors are not the first to respond (as above), but their work involves documenting administrative pages enumerating current events and mediating discussions across disparate current events articles. The ability for these collaborations to unfold as smoothly as they do appears to rest on the ability for Wikipedia editors with newswork experience to either supplant or compliment the work done by amateurs who first arrive on the page.

Of course, this just scratches the surface of the types of analyses that could done on this data. One might look at changes in the structure and pageview activity of each article’s local hyperlink neighborhood to see what related articles are attracting attention, examine the content of the article for changes in sentiment, the patterns of introducing and removing controversial content and unsubstantiated rumors, or broaden the analysis to the other shooting articles. Needless to say, one hopes the cases for future analyses become increasingly scarce.

The IPython Notebook and GEXF network files used in this analysis can be found here.

Edit: As always, Taha Yasseri is on the ball with his analysis of Wikipedia coverage of the events.

Disclosure: I edited the Sandy Hook article twice after publishing this post.

2012 Aurora shootings

The “2012 Aurora shooting” article on Wikipedia is an example of a breaking news article which has a many editors intensively and jointly editing a single article. As of approximately 1pm EDT on July 21, 290 unique editors have made 1,281 changes to the article in a period of less than 36 hours.

The chart below shows how the size of the article has rapidly grown as well as interesting changes in the concentration of work done per editor (Time is recorded at UTC). The blue dots are the length of the article in bytes. We see that there is a rapid growth of the article at about 19:00 UTC which is later reverted, a rapid deceleration of article growth at about 1:00 UTC on 7/21, and a momentary expansion of the article at about 6:00 UTC and a sudden contraction at about 10:00 UTC. The average number of edits per article (green) also shows interesting damped sinusoidal behavior with the earliest part of the article involving many contributions from few editors, a sudden decline as many new editors join the collaboration, a rise again as some of these new contributors make many revisions, and then a stabilization around 4 edits per editor.

Of course, the actual work editors are doing on the article is very uneven and follows classic long-tail behavior. Most editors make only a single contribution, but a handful of editors are making dozens of contributions. Users O’Dea and Sandstein each have made more than 65 contributions over the last 36 hours.

Moreover, the activity on this article is also extremely intense. The chart below plots the distribution of time between edits. Again, the vast majority of edits occur within seconds or minutes of each other and only twice in the entire 36 hour history of the article have 20 minutes gone by without someone making a change to the article.

Finally, I used a method to mine the log of revisions made to the article to create a network of users modifying each others’ work (see paper here). For example, if user B makes a change to the article which was previously the version from user A, a directed link would be created from user B towards user A: user B modified user A’s version. I can also encode a variety of other information into this network. I color the nodes such that bluer nodes are users who started editing early in the article’s history, redder nodes are user who joined the collaboration later. Larger nodes are editors who have more connections to other editors. Darker links are larger changes made to the article. Larger links are more interactions between editors. Visualization was done in Gephi.


Eyeballing the diagram (which is no way to do real analysis), suggests that most prolific editors joined the collaboration early but are not the first contributors either (they are at approximately 2 o’clock). The dense core of the network consists of several editors who appear to be working closely together modifying each others’ work. O’Dea is at the center. Most of the editors who join later (greens, oranges, and reds) are on the periphery of the network suggesting they make relatively minor contributions, sometimes with other less central editors, but their work is subsequently revised by the central editors. These high activity editors are maintaining their involvement over time by interacting with many different types of editors who joined earlier and later (as indicated by being connected to  nodes of different colors). However, the largest changes prolific users make are still small relative to the changes other users are making (lighter links compared to the black links elsewhere).

The data for this analysis is based off revision history (CSV here) and was translated into a network format (GEXF here) using custom code I promise I’ll post when it’s not embarrassingly hacky.

Update via Taha Yasseri:

Within 48 hours of the event, 30 Wikipedia language editions had coverage of the event. 10 of these Wikipedias had articles within 6 hours of the article on the English Wikipedia. Interestingly, smaller languages like Latvian (3) and Danish (7) appeared before other major languages like Polish (8), French (10), and German (13).