Category Archives: Wikipedia

Peripherality, mental health, and Hollywood

I promised to do a bigger tear down of Wikipedia’s coverage of currents events like Robin Williams’ death and the protests in Ferguson, Missouri this week, but I wanted to share a quick result based on some tool-development work I’m doing with the Social Media Research Foundation‘s Marc Smith. We’re developing the next version of WikiImporter to allow NodeXL users to import many of the multiple types of networks in MediaWikis [see our paper].

On Wednesday, we scraped the 1.5-step ego network of the articles that the Robin Williams article currently connects to and then whether or not these articles also link to each other. For example, his article links to the Wikipedia articles for “Genie (Aladdin)” as well as the article “Aladdin (1992 Disney film)” article, reflecting one of his most celebrated movie roles. These articles in turn link to each other because they are clearly closely related to each other.

However, other articles are linked from Williams’s article but do not link to each other. The article “Afghanistan” (where he performed with the USO for troops stationed there) and the article “Al Pacino” (with whom he co-starred in the 2002 movie, Insomnia) are linked from his article but these articles do not link to each other themselves: Al Pacino’s article never mentions Afghanistan and Afghanistan’s article never mentions Al Pacino. In other words, the extent to which Wikipedia articles link to each other provides a coarse measure of how closely related two topics are.

The links between the 276 articles that compose Williams’s hyperlinked article neighborhood have a lot of variability in whether they link to each other. Some groups around movies and actors are more densely linked than other articles about the cities he’s lived are relatively isolated from other linked articles. These individual nodes can be partitioned into groups using a number of different bottom-up “community detection” algorithms. A group is roughly defined as having more ties inside the group than outside of the group. We can visualize the resulting graph breaking the communities apart into sub-hairballs to reveal the extent to which these sub-communities link to each other.

noname

The communities reveal clusters of related topics about various roles, celebrity media coverage, and biographical details about places he’s lived and hobbies he enjoyed. But buried inside the primary community surrounding the “Robin Williams” article are articles like “cocaine dependence“, “depression (mood)“, and “suicide“. While these articles are linked among themselves, reflecting their similarity to each other, they are scarcely linked to any other topics in the network.

To me, this reveals something profound about the way we collectively think about celebrities and mental health. Among all 276 articles and 1,399 connections in this hyperlink network about prominent entertainers, performances in movies and television shows, and related topics, there are only 4 links to cocaine dependence, 5 links to depression, and 13 to suicide. In a very real way, our knowledge about mental health issues is nearly isolated from the entire world of celebrity surrounding Robin Williams. These problems are so peripheral, they are effectively invisible to the ways we talk about dozens of actors and their accomplishments.

In an alternative world in which mental health issues and celebrity weren’t treated as secrets to be hidden, I suspect issues of substance abuse, depression, and other mental health issues would move in from the periphery and become more central as these topics connect to other actors’ biographies as well as being prominently featured in movies themselves.

Does Wikipedia editing activity forecast Oscar wins?

The Academy Awards just concluded and much will be said about Ellen Degeneres most retweeted tweet (my coauthors and I have posted an analysis here that shows these “shared media” or “livetweeting” events disproportionately award attention to already elite users on Twitter.) I thought I’d use the time to try to debug some code I’m using to retrieve editing activity information from Wikipedia.

A naive but simple theory I wanted to test was whether editing activity could reliably forecast Oscar wins. Academy Awards are selected from approximately 6,000 ballots and the process is known for intensive lobbying campaigns to sway voters as well as tapping into the zeitgeist about larger social and cultural issues.

I assume that some of this lobbying and zeitgeist activity would both manifest in the aggregate in edits to the English Wikipedia articles about the nominees. In particular, I measure two quantities: (1) the changes (revisions) made to the article and (2) the number of new editors making revisions. The hypothesis is simply that articles about nominees with the most revisions and the most new editors should win. I look specifically at the time between announcement of the nominees in early January and March 1 (an arbitrary cutoff).

I’ve only run the analysis on the nominees for Best Picture, Best Director, Best Actor, Best Actress, and Best Supporting Actress (nominees in Best Supporting Actor was throwing some unusual errors, but I’ll update). The results below show that Wikipedia editing activity forecast the wins in Best Actor, Best Actress, and Best Supporting Actress, but did not do so for Best Picture or Best Director. This is certainly better than chance and I look forward to expanding the analysis to other categories and prior years.

Best Picture

The “Wolf of Wall Street” showed a remarkably strong growth in the number of new edits and editors after January 1. However, “12 Years a Slave” which ranked 5th by the end, actually won the award. A big miss.

picture_editors picture_edits 

Best Director

The Wikipedia activity here showed strong growth for  Steve McQueen (“12 Years a Slave”), but Alfonso Cuaron (“Gravity”) took the award despite coming in 4th in both metrics here. Another big miss.

director_editors director_edits

Best Actor

The Wikipedia activity for new edits and new editors are highly correlated because new editors necessarily show up as new edits. However, we see an interesting and very close race here between Chiwetel Ejofer (“12 Years a Slave”) and Matthew McConaughey (“Dallas Buyers Club”) for edits, but McConaughey with a stronger leader among new editors. This suggest older editors were responsible for pushing Ejofer higher (and he was leading early on), but McConaughey took the lead and ultimately won. Wikipedia won this one.

actor_usersactor_edits

Best Actress

Poor Judy Dench, she appeared to not even be in the running in either metric. Wikipedia activity forecast a Cate Blanchett (“Blue Jasmine”) win, although this appeared to be close among several candidates if the construct is to be believed. Wikipedia won this one.

actress_editors actress_edits

Best Supporting Actress

Lupita Nyong’o (“12 Years a Slave”) accumulated a huge lead over her other nominees by Wikipedia activity and won the award.

supactress_edits

supactress_editors 

Other Categories and Future Work

I wasn’t able to run the analysis for Supporting Actor because the Wikipedia API seemed to poop out on Bradley Cooper queries, but it may be a deeper bug in my code too. This analysis can certainly be extended to the “non-marquee” nominee categories as well, but I didn’t feel like typing that much.

I will extend and expand this analysis for both other categories as well as prior years’ awards to see if there are any discernible patterns for forecasting. There may be considerable variance between categories in the reliability of this simple heuristic — Director and Picture may be more politicized than the rest, if I wanted to defend my simplistic model. This type of approach might also be used to compare different awards shows to see if some diverge more than others from aggregate Wikipedia preferences. The hypothesis here is a simple descriptive heuristic and more extensive statistical models that incorporate features such as revenue, critics’ scores, and nominees’ award histories (“momentum”) may produce more reliable results as well.

Conclusion

Wikipedia editing activity over the two months leading up to the 2014 Academy Awards accurately forecast the winners of the Best Actor, Best Actress, and Best Supporting Actress categories but significantly missed the winners of the Best Picture and Best Director categories. These results suggest that differences in editing behavior–in some cases–may reflect collective attention to and aggregate preferences for some nominees over others. Because Wikipedia is a major clearinghouse for individuals who both seek and shape popular perceptions,  these behavioral traces may have significant implications for forecasting other types of popular preference aggregation such as elections.

Harvard CRCS talk

Today, I gave a talk in the lunch seminar series at Harvard’s Center for Research on Computation and Society (CRCS) about my dissertation-related work on socio-technical trajectories. Slides are here and the talk was recorded — I’ll post a link to that as soon as it’s available since I need to take some notes on some excellent ideas audience members had about how to extend this research!

EDIT: Link to the video of the presentation.

Boston Marathon bombing

The Boston Marathon bombing occurred less than a mile from where I work and only 300 feet away from my first apartment after graduating from college. Fortunately, all my kith and kin are safe and well. The proximity and severity of this event has motivated me to expand a bit upon the prior analysis I’ve done of other current events on Wikipedia to examine a new type of data: pageviews. The Wikimedia Foundation makes data available about the number of times an article was requested every hour going back to late 2007. For most purposes, these data can be aggregated at the daily level and accessed via another service made by User:Henrik. It is important to keep in mind that pageviews are requests, not necessarily unique viewers although they are obviously highly correlated.

A characteristic feature of the pageviews around a Wikipedia article for a current news event is a peak followed by a decay. Here is an example from May 2011 about Osama bin Laden. On May 1, there were 7,557 views on the article, after the announcement of his death on late on May 1, the article was viewed more than 4.5 million times on May 2, and by May 31, the pageviews dropped off back to 23,795. (NOTE: The dates in the chart appear to be off by a day since the announcement happened late EDT on May 1 when it was May 2 in UTC).  The vast majority of the 7,557 views on May 1 occurred without any knowledge of his death and are similar to the pageview activity over the entire month of April (between ~6,000 and ~11,000). The magnitude of “burst” of pageview activity on May 2 is obviously indicative of a major event that drove many people to seek information about bin Laden in a narrow period of time.

Screen Shot 2013-04-17 at 6.24.45 PM

Similar patterns of pageview bursts are also found on articles related to bin Laden such as “Abbotobad” or “United States Naval Special Warfare Development Group” which are clearly related to the events of that day. Other articles such as “Saudi royal family” also exhibited characteristic bursts of activity around May 2 while articles such as “Bill Clinton” had no characteristic burst. This suggests that some pages were more related to the events of that day because they received similar types of intense attention versus other articles. In other words, the size of the pageview activity burst for an article on the day reflects users suddenly seeking information about a current news event.

Turning our attention away from this Osama bin Laden example and back to the Boston Marathon bombings, the bursts of pageview activity on a set of articles could reveal information about the event itself. Using the “Boston Marathon bombing” article as a seed, I extracted the 140 other articles the bombing article linked to. Of course, the text of this article is highly unstable and some of these links are likely to come and go. Nevertheless, I will use this list of 140 other articles to examine which received the largest bursts of activity. To quantify the magnitude of the pageview bursts across these articles, I simply took the median number of pageviews for all the articles over the 6 week period from March 1 through April 14 as a baseline. Then I took the maximum number of pageviews on either April 15 or April 16 (the most recent dates available). The ratio of these pageviews (maximum during the days following the event divided by the median over the days preceding the event) gives us some idea of which articles saw the greatest increases in pageview activity.

  1. Ground stop: 329.0
  2. Boylston Street: 268.73
  3. Google Person Finder: 237.22
  4. Patriots’ Day: 201.32
  5. Copley Square: 171.53
  6. Controlled explosion: 168.25
  7. The Lenox Hotel: 116.0
  8. Pressure cooker: 83.98
  9. Massachusetts Emergency Management Agency: 83.5
  10. Boston Police Special Operations Unit: 78.43
  11. BB (ammunition): 59.41

This list excludes a number of articles like “Edward F. Davis“, and “Pressure cooker bomb” that did not exist before April 15. However, the size of the bursts of pageview activity on Wikipedia articles (linked from from the bombing article itself) convey a surprising amount of information about the more salient details of the location, timing, cause, and effects of this story.

Each of these articles’ time series pageview data from March 1 through April 17 can be correlated with each other. For example, the correlation between pageviews for “Ball (bearing)” and “Brigham and Women’s Hospital” is 0.99, which strongly suggests the latter is viewed only when the former is also being viewed. Conversely, the correlation between “Ball (bearing)” and “USA Today” is only 0.13 suggesting the viewing activity for both articles is generally unrelated. These correlations can be done for every pair of articles to establish the relationship between their pageview activity. Thresholding these correlations at the 0.5 level, the resulting relationships can be represented as a correlation network. Here is the network below:

zoom

This image (click to embiggen) also tells a variety of stories despite the hairballness of the network. There are two distinct clusters of nodes: the bluish cluster corresponds to articles highly correlated with each other as they deal with topics pertaining to the bombing itself. These articles are the infrequently trafficked articles that all of a sudden attracted attention all together because of the bombings. The greenish cluster on the lower right reveals articles that are linked from the bombing article but aren’t tightly correlated with the bombing topics but are correlated with each other. These articles are more frequently trafficked and less closely related to the events themselves and pertain to major social institutions like newspapers, government agencies, and financial markets. Their clustering together suggests that being only loosely-related to the bombing itself, nevertheless remain closely-related to each other over time. Thus, this network suggests at least two distinct patterns of on-going Wikipedia use: abrupt information seeking about topics that are suddenly in the news versus on-going information seeking about institutions that are regularly in the news.

As always, this is simply a first cut of the analysis and I’m working on some other analyses that look at the pageview data at an hourly level of resolution and expand the corpus of articles from simply the articles linked from the bombing article to all other English Wikipedia articles. So stay tuned for more.

Co-authorship patterns around Pope Francis

A little late in coming, but here’s a pretty picture based on a conference submission I’m preparing.

  1. Taking the revision history of all 607 unique editors who contributed to the article on Pope Francis after 1 Jan 2013.
  2. Get all the other 22,225 articles they revised since the beginning of the year.
  3. From this two-mode network, project a one-mode article-article network where one article is linked to another article if they share an editor in common.
  4. Filter out all the edges where there is only a single editor in common leaving articles than have been edited by two or more editors in common and remove the resulting isolates.
  5. Identify the largest connected component consisting of 2,671 articles and 3,144 edges.
  6. Visualize! Nodes are sized based on degree and colored based on modularity class. Data (including GraphML files for both the complete graph and LCC, a larger PNG, and a SVG) available here.

article-coauthorship-lcc_pope_20130101_4096

There’s a lot going on there and much more to see by looking around the full image, but I’ll give a few highlights.

The articles with the strongest tie (most editors in common)? A lot of ties between Pope Francis and other papal and Catholic-related articles round out the top 10 as one would expect, but there are some interesting outliers as well: Pier Luigi Bersani and Italian general election, 2013 with 42 editors in common, actually takes first and 2013 Malmö FF season and 2012–13 Svenska Cupen comes in 4th. This is to say these random articles shared at least 2 editors with the Pope Francis article but were themselves the subject of intense co-editing.

(u'Pier Luigi Bersani', u'Italian general election, 2013', {'weight': 42}),
 (u'List of popes', u'Pope Francis', {'weight': 37}),
 (u'Papal conclave, 2013', u'Pope Francis', {'weight': 31}),
 (u'2013 Malmxf6 FF season', u'2012u201313 Svenska Cupen', {'weight': 26}),
 (u'Pope Benedict XVI', u'Pope Francis', {'weight': 24}),
 (u'Papal conclave, 2013', u'Pope Benedict XVI', {'weight': 22}),
 (u'Pope Benedict XVI', u'Resignation of Pope Benedict XVI', {'weight': 22}),
 (u'Papal conclave, 2013',u'Resignation of Pope Benedict XVI',{'weight': 20}),
 (u'South American dreadnought race',u'Argentineu2013Chilean naval arms race',{'weight': 18}),
 (u'Timeline of Vietnamese history',u'First Chinese domination of Vietnam',{'weight': 18})

Of course, a lot of co-authorship was around other Catholic topics: the Papal Enclave, Pope Benedict XVI and his resignation, and other cardinal electors:

catholic

There is a lot of co-authorship around other topics that were also in the news:

breaking_news

Other topics of current events, but not peripheral to these coauthorship patterns include updates to Swedish football club rosters as well as editing of articles about members of the Baathist regime in Syria. Strangely, these two disparate topics are clustered together (both by modularity and by layout) suggesting they draw from a similar communities of editors.

football and syria

If you want to know more, hopefully our paper will be accepted and I can share it :)

Sandy Hook School massacre

If you follow me on Twitter, you’re probably already well-acquainted with my views on what should happen in the wake of the shooting spree that massacred 20 children and 6 educators at a suburban elementary school in Newton, Connecticut. This post, however, will build on my previous analysis of the Wikipedia article about the Aurora shootings as well as my dissertation examining Wikipedia’s coverage of breaking news events to compare the evolution of the article for the Sandy Hook Elementary School shooting to other Wikipedia articles about recent mass shootings.

In particular, this post compares the behavior of editors during the first 48 hours of each article’s history. The fact that there are 43 English Wikipedia articles about shootings sprees in the United States since 2007 should lend some weight to this much ballyhooed “national conversation” we are supposedly going to have, but I choose to compare just six of these articles to the Sandy Hook shooting article based on their recency and severity as well as an international example.

Wikipedia articles certainly do not break the news of the events themselves, but the first edits to these article happen within two to three hours of the event itself unfolding. However, once created these articles attract many editors and changes as well as grow extremely rapidly.

Figure 1: Number of changes made over time.

The Virginia Tech article, by far and away, attracted more revisions than the other shootings in the same span of time and ultimately enough revisions in the first 48 hours (5,025) to put in within striking distance of the top 1000 most-edited articles in all of Wikipedia’s history. Conversely, the Oak Creek and Binghamton shootings, despite having 21 fatalities between them, attracted substantially less attention from Wikipedians and the news media in general, likely because these massacres had fewer victims and the victims were predominantly non-white.

A similar pattern of VT as an exemplary case, shootings involving immigrants and minorities attracting less attention, and the other shootings having largely similar behavior is also found in the the number of unique users editing an article over time:

Figure 2: Number of unique users over time.

These editors and the revisions they make cause articles to rapidly increase in size. Keep in mind, the average Wikipedia article’s length (albeit highly skewed by many short articles about things like minor towns, bands, and species) is around 3,000 bytes and articles above 50,000 bytes can raise concerns about length. Despite the constant back-and-forth of users adding and copyediting content, the Newtown article reached 50kB within 24 hours of its creation. However, in the absence of substantive information about the event, much of this early content is often related to national and international reactions and expressions of support. As more background and context as information comes to light, this list of reactions is typically removed which can be seen in the sudden contraction of article size as seen in Utoya around 22 hours, and Newtown and Virginia Tech around 36 hours. As before, the articles about the shootings at Oak Creek and Binghamton are significantly shorter.

Figure 3: Article size over time.

However, not every editor does the same amount of work. The Gini coefficient captures the concentration of effort (in this case, number of revisions made) across all editors contributing to the article. A Gini coefficient of 1 indicates that all the activity is concentrated in a single editor while a coefficient of 0 indicates that every editor does exactly the same amount of work.

Figure 4: Gini coefficient of editors’ revision counts over time.

Across all the articles, the edits over the first few hours are evenly distributed: editors make a single contribution and others immediately jump in to also make single contributions as well. However, around hour 3 or 4, one or more dedicated editors show up and begin to take a vested interest in the article, which is manifest in the rapid centralization of the article. This centralization  increases slightly over time across all articles suggesting these dedicated editors continue to edit after other editors move on.

Another way to capture the intensity of activity on these articles is to examine the amount of time elapsed between consecutive edits. Intensely edited articles may have only seconds between successive revisions while less intensely edited articles can go minutes or hours. This data is highly noisy and bursty, so the plot below is smoothed over a rolling average of about 3 hours.

Figure 5: Waiting times between edits (y-axis is log-scaled).

What’s remarkable is the sustained level of intensity over a two day period of time. The Virginia Tech article was still being edited several times every minute even 36 hours after the event while other articles were seeing updates every five minutes more than a day after the event. This means that even at 3 am, all these articles are still being updated every few minutes by someone somewhere. There’s a general trend upward reflecting the initially intense activity immediately after the article is created following increasing time lags as the article stabilizes, but there’s also a diurnal cycle with edits slowing between 18 to 24 hours after the event, before quickening again. This slowing and quickening is seen around about 20 hours as well as around 44 hours suggesting information being released and incorporated in cycles as the investigation proceeds.

Finally, who is doing the work across all these articles? The structural patterns of users contributing to articles also reveals interesting patterns. It appears that much of the editing is done by users who have never contributed to the other articles examined here, but there are a few editors who contributed to each of these articles within 4 hours of their creation.

Figure 6: Collaboration network of articles (red) and the editors who contribute to them (grey) within the first four hours of their existence. Editors who’ve made more revisions to an article have thicker and darker lines.

Users like BabbaQ (edits to Sandy Hook), Ser Amantio di Nicolao (edits to Sandy Hook), Art LaPella (edits to Sandy Hook) were among the first responders to edit several of these articles, including Sandy Hook. However, their revisions are relatively minor copyedits and reference formatting reflecting the prolific work they do patrolling recent changes. Much of the substantive content of the article is from editors who have edited none of the other articles about shootings examined here and likely no other articles about other shootings. In all likelihood, readers of these breaking news articles are mostly consuming the work of editors who have never previously worked on this kind of event. In other words, some of the earliest and most widely read information about breaking news events is written by people with fewer journalistic qualifications than Medill freshmen.

What does the collaboration network look like after 48 hours?

Figure 7: Collaboration network after 48 hours.

3,262 unique users edited one or more of these seven articles, 222 edited two or more of these articles, 60 had edited 3 or more, and a single user WWGB had edited all seven within the first 48 hours of their creation. These editors are at the center of Figure 7 where they connect to many of the articles on the periphery. The stars surrounding each of the articles are the editors who contributed to that article and that article alone (in this corpus). WWGB is an editor who appears to specialize not only in editing articles about current events, but participating in a community of editors engaged in the newswork on Wikipedia. These editors are not the first to respond (as above), but their work involves documenting administrative pages enumerating current events and mediating discussions across disparate current events articles. The ability for these collaborations to unfold as smoothly as they do appears to rest on the ability for Wikipedia editors with newswork experience to either supplant or compliment the work done by amateurs who first arrive on the page.

Of course, this just scratches the surface of the types of analyses that could done on this data. One might look at changes in the structure and pageview activity of each article’s local hyperlink neighborhood to see what related articles are attracting attention, examine the content of the article for changes in sentiment, the patterns of introducing and removing controversial content and unsubstantiated rumors, or broaden the analysis to the other shooting articles. Needless to say, one hopes the cases for future analyses become increasingly scarce.

The IPython Notebook and GEXF network files used in this analysis can be found here.

Edit: As always, Taha Yasseri is on the ball with his analysis of Wikipedia coverage of the events.

Disclosure: I edited the Sandy Hook article twice after publishing this post.

2012 Aurora shootings

The “2012 Aurora shooting” article on Wikipedia is an example of a breaking news article which has a many editors intensively and jointly editing a single article. As of approximately 1pm EDT on July 21, 290 unique editors have made 1,281 changes to the article in a period of less than 36 hours.

The chart below shows how the size of the article has rapidly grown as well as interesting changes in the concentration of work done per editor (Time is recorded at UTC). The blue dots are the length of the article in bytes. We see that there is a rapid growth of the article at about 19:00 UTC which is later reverted, a rapid deceleration of article growth at about 1:00 UTC on 7/21, and a momentary expansion of the article at about 6:00 UTC and a sudden contraction at about 10:00 UTC. The average number of edits per article (green) also shows interesting damped sinusoidal behavior with the earliest part of the article involving many contributions from few editors, a sudden decline as many new editors join the collaboration, a rise again as some of these new contributors make many revisions, and then a stabilization around 4 edits per editor.

Of course, the actual work editors are doing on the article is very uneven and follows classic long-tail behavior. Most editors make only a single contribution, but a handful of editors are making dozens of contributions. Users O’Dea and Sandstein each have made more than 65 contributions over the last 36 hours.

Moreover, the activity on this article is also extremely intense. The chart below plots the distribution of time between edits. Again, the vast majority of edits occur within seconds or minutes of each other and only twice in the entire 36 hour history of the article have 20 minutes gone by without someone making a change to the article.

Finally, I used a method to mine the log of revisions made to the article to create a network of users modifying each others’ work (see paper here). For example, if user B makes a change to the article which was previously the version from user A, a directed link would be created from user B towards user A: user B modified user A’s version. I can also encode a variety of other information into this network. I color the nodes such that bluer nodes are users who started editing early in the article’s history, redder nodes are user who joined the collaboration later. Larger nodes are editors who have more connections to other editors. Darker links are larger changes made to the article. Larger links are more interactions between editors. Visualization was done in Gephi.

 

Eyeballing the diagram (which is no way to do real analysis), suggests that most prolific editors joined the collaboration early but are not the first contributors either (they are at approximately 2 o’clock). The dense core of the network consists of several editors who appear to be working closely together modifying each others’ work. O’Dea is at the center. Most of the editors who join later (greens, oranges, and reds) are on the periphery of the network suggesting they make relatively minor contributions, sometimes with other less central editors, but their work is subsequently revised by the central editors. These high activity editors are maintaining their involvement over time by interacting with many different types of editors who joined earlier and later (as indicated by being connected to  nodes of different colors). However, the largest changes prolific users make are still small relative to the changes other users are making (lighter links compared to the black links elsewhere).

The data for this analysis is based off revision history (CSV here) and was translated into a network format (GEXF here) using custom code I promise I’ll post when it’s not embarrassingly hacky.

Update via Taha Yasseri:

Within 48 hours of the event, 30 Wikipedia language editions had coverage of the event. 10 of these Wikipedias had articles within 6 hours of the article on the English Wikipedia. Interestingly, smaller languages like Latvian (3) and Danish (7) appeared before other major languages like Polish (8), French (10), and German (13).