A network analysis of Bob Ross’s paintings

While it makes me look like I’m stalking him, Walt Hickey is doing some of the more fun data analysis on the new FiveThirtyEight. His recent April 13 article examined features from the entire canon of Bob Ross paintings and he happily made this data available within days. The data is stored as a rectangular matrix of episodes or pieces by rows and the specific features as columns. A piece like “Mountain-By-The-Sea” from Season 9, Episode 12 is coded as containing 10 features: “beach”, “bushes”, “clouds”, “cumulus”, “deciduous”, “mountain”, “ocean”, “snowy mountain”, “tree”, and “waves”.

To a network scientist like me, this is a common way of storing relationships. We typically imagine social networks as people connected to other people via friendship links, but we can also create many other kinds of networks like people connected to employers or the ingredients contained within recipes. These latter two networks are called “bipartite networks” because they involve two distinct types of nodes (people and employers, ingredients and recipes). We can create a bipartite network of paintings from Hickey’s data where a painting is connected to its features and these features are in turn shared with some other paintings. In the figure below, we can see that some features like trees are highly central (bigger text meaning more paintings having that feature). Conversely, features like “beach”, “ocean”, or “waves” are more peripheral on the upper left.

Bipartite network of features and paintings

For example, while the “Mountain-By-The-Sea” is described by the 10 features mentioned above, other paintings share some of these same features. There are 26 other paintings such as “Surf’s Up” and “High Tide” also contain “beach” features, and these beach-related paintings also have other features that they share in common with still other paintings. But “Mountain-By-The-Sea” is indirectly connected to “Mountain In and Oval” via “bushes”.

Performing a mathematical operation to convert the bipartite network into a more traditional network, we can connect these paintings directly to each other if they share a feature in common. In addition to having “bushes” in common, “Mountain-By-The-Sea” and “Mountain In and Oval” also have “clouds”, “cumulus”, “deciduous”, “mountain”, and “tree” in common. The mathematical operation assigns the relationship between “Mountain-By-The-Sea” and “Mountain In and Oval” a score of 6 because they share 6 features in common.

While Ross was a prolific painter, we well know that some features like trees occur more often than others. As a result, almost every painting is connected to every other painting by at least one feature, which isn’t terribly interesting to visualize as it would be a big “hairball.” Instead, we can take advantage of the fact that paintings have different levels of similarity to each other. We’ll use the SBV backbone algorithm developed by some of my network science colleagues at Northeastern University to extract only those relationships between paintings that are particiularly interesting and then visualize the resulting network.

Using an excellent network visualization tool called Gephi, I layed the network out and colored it below. Again, a painting (represented as a label below) is connected to another painting if they both share features in common. The SBV algorithm removed some links that should be present, but we can interpret the remaining links to be the “most important” links. I then used Gephi’s modularity function to discover “subcommunities” within this network that are highly similar to each other. These communities are represented by the six colors below.

  • Red These are ocean scenes which are not unusual in Ross’s repertoire — there are many paintings in here — but they share few features with other paintings in his oeuvre. Note that “Mountain by the Sea” is one of the three paintings that connect this group back to the rest.
  • Yellow These are the “waterfall” paintings.
  • Green These are the “mountain” paintings.
  • Purple These are the “snow” paintings.
  • Light blue These are the “homes” paintings.
  • Dark blue Something of a catchall.

Projected painting-to-painting network

We can also compute the “entropy” of all the features Ross used in a given season. Entropy is a way of measuring how disorganized a system is; in our case, how many times each feature appears in a season. If the same features we used exactly once, then the entropy would be 0. If different features were used different amounts, then the entropy would be high. Basically, entropy is another way of measuring how experimential or diverse Ross’s work was in a given season by how many different features he used.

The first two seasons have very low entropy, suggesting a very conservative approach. Season 3 and 4 see a drastic departure in style and much higher entropy (more experimentation) before falling again in Seasons 5 through 7 to a more conservative set of features again. Season 8 shows another surge in experimentation — different features being used in different amounts during the season. Season 13 shows some retrenchment back to more conservative habits followed again by a more experimental season. But in general, there’s a pattern of low entropy in the early years, a middle period of high levels of experimentation, followed in the later years by a consistent but moderate level of differentiation.

Entropy over time

The full code for my analysis of Hickey’s data is on GitHub.

The Need for Openness in Data Journalism

Do films that pass the Bechdel Test make more money for their producers? I’ve replicated Walt Hickey’s recent article in FiveThirtyEight to find out. My results confirm his own in part, but also find notable differences that point the need for clarification at a minimum. While I am far from the first to make this argument, this case is illustrative of a larger need for journalism and other data-driven enterprises to borrow from hard-won scientific practices of sharing data and code as well as supporting the review and revision of findings. This admittedly lengthy post is a critique of not only this particular case but also an attempt to work through what open data journalism could look like.

The Angle: Data Journalism should emulate the openness of science

New data-driven journalists such as FiveThirtyEight have faced criticism from many quarters and the critiques, particularly around the naïveté of assuming credentialed experts can be bowled over by quantitative analysis so easily as the terrifyingly innumerate pundits who infest our political media [1,2,3,4]. While I find these critiques persuasive, I depart from them here to instead argue that I have found this “new” brand of data journalism disappointing foremost because it wants to perform science without abiding by scientific norms.

The questions of demarcating what is or is not science are fraught, so let’s instead label my gripe a “failure to be open.” By openness, I don’t mean users commenting on articles or publishing whistleblowers’ documents. I mean “openness” more in the sense of “open source software” where the code is made freely available to everyone to inspect, copy, modify, and redistribute. But the principles of open-source software trace their roots more directly back norms in the scientific community that Robert Merton identified and came to known as “CUDOS” norms. It’s worth reviewing two of these norms because Punk-ass Data Journalism is very much on the lawn of Old Man Science and therein lie exciting possibilities for exciting adventures.

The first and last elements of Merton’s “CUDOS” norms merit special attention for our discussion of openness. Communalism is the norm that scientific results are shared and become part of a commons that others can build upon — this is the bit about “standing upon the shoulders of giants.” Skepticism is the norm that claims must be subject to organized scrutiny by community — which typically manifests as peer review. Both of these strongly motivated philosophies in the open source movement, and while they are practiced imperfectly in my experience within the social and information sciences (see my colleagues’ recent work on the “Parable of Google Flu“), I nevertheless think data journalists should strive to make them their own practice as well.

  1. Data journalists should be open in making their data and analysis available to all comers. This flies in the face of traditions and professional anxieties surrounding autonomy, scooping, and the protection of sources. But how can claims be evaluated as true unless they can be inspected? If I ask a data journalist for her data or code, is she bound by the same norms as a scientist to share it? Where and how should journalists share and document these code and data?
  2. Data journalists should be open in soliciting and publishing feedback. Sure, journalists are used to clearing their story with an editor, but have they solicited an expert’s evaluation of their claims? How willing are they to publish critiques of, commentary on, or revisions to their findings? If not, what are the venues for these discussions? How should a reporter or editor manage such a system?

The Guardian‘s DataBlog and ProPublica have each doing exemplary work in posting their datasets, code, and other tools for several years. Other organizations like the Sunlight Foundation develop outstanding tools to aid reporters and activists, the Knight Foundation has been funding exciting projects around journalism innovation for years, and the Data Journalism Handbook reviews other excellent cases as well. My former colleague, Professor Richard Gordon at Medill reminded me ideas around “computer assisted reporting” have been in circulation in the outer orbits of journalism for decades. For example, Philip Meyer has been (what we would now call) evangelizing since the 1970s for “precision journalism” in which journalists adopt the tools and methods of the social and behavioral sciences as well as its norms of sharing data and replicating research. Actually, if you stopped reading now and promised to read his 2011 Hedy Lamarr Lecture, I won’t even be mad.

The remainder of this post is an attempt demonstrate some ideas of what an “open collaboration” model for data journalism might look like. To that end, this article tries to do many things for many audiences which admittedly makes it hard for any single person to read. Let me try to sketch some of these out now and send you off in the right path.

  • First, I use an article Walt Hickey of FiveThirtyEight published on the relationship between the financial performance of films that the extent to which they grant their female characters substantive roles as a case to illustrate some pitfalls in both the practice and interpretation of statistical data. This is a story about having good questions, ambiguous models, wrong inferences, and exciting opportunities for investigation going forward. If you don’t care for code or statistics, you can start reading at “The Hook” below and stop after “The Clip” below.
  • Second, for those readers who are willing to pay what one might call the “Iron Price of Data Journalism”, I go “soup to nuts” and attempt to replicate Hickey’s findings. I document all the steps I took to crawl and analyze this data to illustrate the need for better documentation of analyses and methods. This level of documentation may be excessive or it may yet be insufficient for others to replicate my own findings. But providing this code and data may expose flaws in my technical style (almost certainly), shortcomings in my interpretations (likely), and errors in my data and modeling (hopefully not). I actively invite this feedback via email, tweets, comments, or pull requests and hope to learn from it. I wish new data journalism enterprises adopted the same openness and tentativeness in their empirical claims. You should start reading at “Start Your Kernels…”
  • Third, I want to experiment with styles for analyzing and narrating findings that make both available in the same document. The hope is that motivated users can find the detail and skimmers can learn something new or relevant while being confident they can come back and dive in deeper if they wish. Does it make sense to have the story up front and the analysis “below the fold” or to mix narrative with analysis? How much background should I presume or provide about different analytical techniques? How much time do I need to spend on tweaking a visualization? Are there better libraries or platforms for serving the needs of mixed audiences? This is a meta point as we’re in it now, but it’ll crop up in the conclusion.
  • Fourth, I want to experiment with technologies for supporting collaboration in data journalism by adopting best practices from open collaborations in free software, Wikipedia, and others. For example, this blog post is not written in a traditional content-management system like WordPress, but is an interactive “notebook” that you can download and execute the code to verify that it works. Furthermore, I’m also “hosting” this data on GitHub so that others can easily access the writeup, code, and data, to see how it’s changed over time (and has it ever…), and to suggest changes that I should incorporate. These can be frustrating tools with demoralizing learning curves, but these are incredibly powerful once apprenticed. Moreover, there are amazing resources and communities who exist to support newcomers and new tools are being released to flatten these learning curves. If data journalists joined data scientists and data analysts in sharing their work, it would contribute to an incredible knowledge commons of examples and cases that is lowering the bars for others who want to learn. This is also a meta point since it exists outside of this story, but I’ll also come back to it in the conclusion.

In this outro to a very unusual introduction, I want to thank Professor Gordon from above, Professor Deen FreelonNathan Matias, and Alex Leavitt for their invaluable feedback on earlier drafts of this… post? article? piece? notebook?

The Hook: The Bechdel Test article in FiveThirtyEight

Walk Hickey published an article on April 1 on FiveThirtyEight, titled The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women. The article examines the relationship between movies’ finances and their portrayals of women using a well-known heuristic call the Bechdel test. The test has 3 simple requirements: a movie passes the Bechdel test if there are (1) two women in it, (2) who talk to each other, (3) about something besides a man.

Let me say at the outset, I like this article: It identifies a troubling problem, asks important questions, identifies appropriate data, and brings in relevant voices to speak to these issues. I should also include the disclaimer that I am not an expert in the area of empirical film studies likeDean Keith Simonton or Nick Redfern. I’ve invested a good amount time in criticizing the methods and findings of this article, but to Hickey’s credit, I also haven’t come across any scholarship that has attempted to quantify this relationship before: this is new knowledge about the world. Crucially, it speaks to empirical scholarship that has exposed how films with award-winning female roles are significantly less likely to win awards themselves [5], older women are less likely to win awards [6], actresses’ earnings peak 17 years earlier than actors’ earnings [7], and differences in how male and female critics rate films [8]. I have qualms about the methods and others may be justified in complaining it overlooks related scholarship like those I cited above, but this article is in the best traditions of journalism that focuses our attention on problems we should address as a society.

Hickey’s article makes two central claims:

  1. We found that the median budget of movies that passed the test…was substantially lower than the median budget of all films in the sample.
  2. We found evidence that films that feature meaningful interactions between women may in fact have a better return on investment, overall, than films that don’t.

I call Claim 1 the “Budgets Differ” finding and Claim 2 the “Earnings Differ” finding. The results, as they’re summarized here are relatively straightforward to test whether there’s an effect of Bechdel scores on earnings and budget controlling for other explanatory variables.

But before I even get to running the numbers, I want to examine the claims Hickey made in the article. The interpretations he makes about the return on investment are particularly problematic interpretations of basic statistics. Hickey reports the following findings from his models (emphasis added).

We did a statistical analysis of films to test two claims: first, that films that pass the Bechdel test — featuring women in stronger roles — see a lower return on investment, and second, that they see lower gross profits. We found no evidence to support either claim.

On the first test, we ran a regression to find out if passing the Bechdel test corresponded to lower return on investment. Controlling for the movie’s budget, which has a negative and significant relationship to a film’s return on investment,passing the Bechdel test had no effect on the film’s return on investment. In other words, adding women to a film’s cast didn’t hurt its investors’ returns, contrary to what Hollywood investors seem to believe.

The total median gross return on investment for a film that passed the Bechdel test was $2.68 for each dollar spent. The total median gross return on investment for films that failed was only $2.45 for each dollar spent.

…On the second test, we ran a regression to find out if passing the Bechdel test corresponded to having lower gross profits — domestic and international. Also controlling for the movie’s budget, which has a positive and significant relationship to a film’s gross profits, once again passing the Bechdel test did not have any effect on a film’s gross profits.

Both models (whatever their faults, and there are some as we will explore in the next section) apparently produce an estimate that the Bechdel test has no effect on a film’s financial performance. That is to say, the statistical test could not determine with a greater than 95% confidence that the correlation between these two variables was greater or less than 0. Because we cannot confidently rule out the possibility of there being zero effect, we cannot make any claims about its direction.

Hickey argues that passing the test “didn’t hurt its investors’ returns”, which is to say there was no significant negative relationship, but neither was there a significant positive relationship: The model provides no evidence of a positive correlation between Bechdel scores and financial performance. However, Hickey switches gears an in the conclusions, writes:

…our data demonstrates that films containing meaningful interactions between women do better at the box office than movies that don’t

I don’t know what analysis supports this interpretation. The analysis Hickey just performed, again taking the findings at their face, concluded that “passing the Bechdel test did not have any effect on a film’s gross profits” not “passing the Bechdel test increased the film’s profits.” While Bayesians will cavil about frequentist assumptions — as they are wont to do — and the absence of evidence is not evidence of absence, the “Results Differ finding” is not empirically supported in any appropriate interpretation of the analysis. The appropriate conclusion from Hickey’s analysis is “there no relationship between the Bechdel test and financial performance,” which he makes… then ignores.

What to make of this analysis? In the next section, I summarize the findings of my own analysis of the same data. In the subsequent sections, I attempt to replicate the findings of this article, and in so doing, highlight the perils of reporting statistical findings without adhering to scientific norms.

The Clip: Look in here for what to tweet

I tried to retrieve and re-analyze the data that Hickey described in his article, but came to some conclusions that were the same, others that were very different, and still others that I hope are new.

I was able to replicate some of his findings, but not others because specific decisions had to be made about the data or modeling that dramatically change the results of the statistical models. However, the article provides no specifics so we’re left to wonder when and where these findings hold, which points to the need for openness in sharing data and code. Specifically, while Hickey found that women’s representation in movies had no significant relationship on revenue, I found a positive and significant relationship.

But the questions and hypotheses Hickey posed about systematic biases in Hollywood were also the right ones. With a reanalysis using different methods as well as adding in new data, I found statistically significant differences in popular ratings also exist. These differences persist in the face of other potential explanations about differences arising because of genres, MPAA ratings, time, and other effects.

In the image below, we see that movies that have non-trivial women’s roles get 24% lower budgets, make 55% more revenue, get better reviews from critics, and face harsher criticism from IMDB users. Bars that are faded out mean my models are less confident about these findings being non-random while bars that are darker mean my models are more confident that this is a significant finding.

Movies passing the Bechdel test (the red bars):

  • …receive budgets that are 24% smaller
  • …make 55% more revenue
  • …are awarded 1.8 more Metacritic points by professional reviewers
  • …are awarded 0.12 fewer stars by IMDB’s amateur reviewers

 

Takeaway

Read the entire replication here.

 

Conclusions

The four main findings from this analysis of the effects of women’s roles in movies are summarized in the chart above. These four points point to a paradox in which movies that pass an embarrassingly low bar for female character development make more money and are rated more highly by critics, but have to deal with lower budgets and more critical community responses. Is this definitive evidence of active discrimination in the film industry and culture? No, but it suggests systemic prejudices are contributing to producers irrationally ignoring significant evidence that “feminist” films make them more money and earn higher praise.

The data that I used here was scraped from public-facing websites, but there may be reasons to think that these data are inaccurate by those who are more familiar with how they’re generated. Similarly, the models I used here are simple Stats 101 ordinary least squares regression models with some minor changes to account for categorical variables and skewed data. There are no Bayesian models, no cross-validation or bootstrapping, and no exotic machine learning methods here. But in making the data available (or at least the process for replicating how I obtained my own data), others are welcome to perform and share the results such analyses — and this is ultimately my goal of asking data journalism to adopt the norms of open collaboration. When other people take their methodological hammers or other datasets and still can’t break the finding, we have greater confidence that the finding is “real”.

But the length and technical complexity of this post also raise the question of, who is the audience for this kind of work? Journalistic norms emphasize quick summaries turned around rapidly with opaque discussions of methods and analysis and making definitive claims. Scientific norms emphasize more deliberative and transparent processes that prize abstruse discussions and tentative claims about their “truth”. I am certainly not saying that Hickey should have the output of regression models in his article — 99% of people won’t care to see that. But in the absence of soliciting peer reviews of this research, how are we as analysts, scientists, and journalists to evaluate the validity of the claims unless the code and data are made available for others to inspect? Even this is a higher bar than many scientific publications hold their authors to (and I’m certainly guilty of not doing more to make my own code and data available), but it should be the standard, especially for a genre of research like data journalism where the claims reach such large audiences.

However, there are exciting technologies for supporting this kind of open documentation and collaboration. I used an “open notebook” technology called IPython Notebook to write this post in such a way that the text, code, and figures I generated are all stitched together into one file. You’re likely reading this post on a website that lets you view any such Notebook on the web where other developers and researchers share code about how to do all manner of data analysis. Unfortunately, this was intended as a word processing or blogging tool, so the the lack of features such as more dynamic layout options or spell-checking will frustrate many journalists (apologies for the typos!). However, there are tools for customizing the CSS so that it plays well (see here and here). The code and data are hosted on GitHub, which is traditionally used for software collaboration, but its features for others to discuss problems in my analysis (issue tracker) or propose changes to my code (pull requests) promote critique, deliberation, and improvement. I have no idea how these will work in the context of a journalistic project, and to be honest, I’ve never used them before, but I’d love to try and see what breaks.

Realistically, practices only change if there are incentives to do so. Academic scientists aren’t awarded tenure on the basis of writing well-trafficed blogs or high-quality Wikipedia articles, they are promoted for publishing rigorous research in competitive, peer-reviewed outlets. Likewise, journalists aren’t promoted for providing meticulously-documented supplemental material or replicating other analyses instead of contributing to coverage of a major news event. Amidst contemporary anxieties about information overload as well as the weaponization of fear, uncertainty, and doubt tactics, data-driven journalism could serve a crucial role in empirically grounding our discussions of policies, economic trends, and social changes. But unless the new leaders set and enforce standards that emulate the scientific community’s norms, this data-driven journalism risks falling into traps that can undermine the public’s and scientific community’s trust.

This suggests several models going forward:

  • Open data. Data-driven journalists could share their code and data on open source repositories like GitHub for others to inspect, replicate, and extend. But as any data librarian will rush to tell you, there are non-trivial standards for ensuring the documentation, completeness, formatting, and non-proprietaryness of data.
  • Open collaboration. Journalists could collaborate with scientists and analysts to pose questions that they jointly analyze and then write up as articles or features as well as submitting for academic peer review. But peer review takes time and publishing results in advance of this review, even working with credentialed experts, doesn’t imply their reliability.
  • Open deliberation. Organizations that practice data-driven journalism (to the extent this is different from other flavors of journalism) should invite and provide empirical critiques of their analyses and findings. Making well-documented data available or finding the right experts to collaborate with are extremely time-intensive, but if you’re going to publish original empirical research, you should accept and respond to legitimate critiques.
  • Data omsbudsmen. Data-driven news organizations might consider appointing independent advocates to represent public interests and promote scientific norms of communalism, skepticism, and empirical rigor. Such a position would serve as a check against authors making sloppy claims, using improper methods, analyzing proprietary data, or acting for their personal benefit.

I have very much enjoyed thinking through many of these larger issues and confronting the challenges of the critiques I’ve raised. I look forward to your feedback and I very hope this drives conversations about what kind of science data-driven journalism hopes to become.

 

Checklist for Reviewing (And Thus Writing) A Research Paper

The instructions given to the program committee members for GROUP 2014 are unusually detailed and well-organized. I think they provide a great jumping-off point for scholars to reflect on how they review as well as write papers. Kudos to Program Co-Chairs David McDonald and Pernelle Bjorn for writing these up — reviewers and authors everywhere should take notice and step up their game!

1: Briefly SUMMARIZE the main points of the submission:

2: Describe the IMPORTANCE OF THE WORK:
-Is it new, original and/or innovative?
-Do they make a case in the paper itself for its originality (relationship to previous work)?
-Is it a useful, relevant and/or significant problem?
-Do they sufficiently motivate the problem?
-Is the work done on the problem useful, relevant and/or significant?
-Did you learn anything?
-Is this submission appropriate for the GROUP conference?

3: Describe the CLARITY OF GOALS, METHODS AND CREDIBILITY OF RESULTS:
-Are the goal(s) clear?
-Do they clearly describe what was done and/or how it was studied?
-Are the method(s) and/or analysis used to achieve the goal appropriate and used correctly?
-Do they provide sufficient data and/or well-supported arguments?
-Do the results and discussion follow from the method and/or argumentation? Are they believable?
-Does the paper cover all the important issues at the appropriate level of detail?
-Do they cite relevant work?
-Is there sufficient detail so that another researcher can replicate (more or less) the work?

4: Describe the QUALITY OF THE WRITING, addressing the following:
-Is the writing clear and concise?
-Is the paper appropriately focused?
-Is the paper well-organized
-Do they provide the right level of detail?
-Will the paper be understandable to the GROUP audience, including international readers?
-Are figures clear?
-Any figures needed/not needed?

5: Provide any OTHER COMMENTS you believe would be useful to the author (including pointers to missing relevant work).

6: SUMMARIZE your ASSESSMENT of the paper, pointing out which aspects described above you weighted most heavily in your rating.

HGSE Guest Lecture

Many thanks to Professor Karen Brennan for inviting me to speak to her class “Teacher Learning and Technology” about “Data“. I’ve posted the very brief slides I used here. Certainly nothing new under the sun, but hopefully above the median level of discourse about “big data.”

The ensuing discussion with the students was excellent and we covered a broad number of topics such as surveillance and privacy, mixed methods research, reducing the learning curve for analysis tools, and the importance of formulating research questions.

 

Does Wikipedia editing activity forecast Oscar wins?

The Academy Awards just concluded and much will be said about Ellen Degeneres most retweeted tweet (my coauthors and I have posted an analysis here that shows these “shared media” or “livetweeting” events disproportionately award attention to already elite users on Twitter.) I thought I’d use the time to try to debug some code I’m using to retrieve editing activity information from Wikipedia.

A naive but simple theory I wanted to test was whether editing activity could reliably forecast Oscar wins. Academy Awards are selected from approximately 6,000 ballots and the process is known for intensive lobbying campaigns to sway voters as well as tapping into the zeitgeist about larger social and cultural issues.

I assume that some of this lobbying and zeitgeist activity would both manifest in the aggregate in edits to the English Wikipedia articles about the nominees. In particular, I measure two quantities: (1) the changes (revisions) made to the article and (2) the number of new editors making revisions. The hypothesis is simply that articles about nominees with the most revisions and the most new editors should win. I look specifically at the time between announcement of the nominees in early January and March 1 (an arbitrary cutoff).

I’ve only run the analysis on the nominees for Best Picture, Best Director, Best Actor, Best Actress, and Best Supporting Actress (nominees in Best Supporting Actor was throwing some unusual errors, but I’ll update). The results below show that Wikipedia editing activity forecast the wins in Best Actor, Best Actress, and Best Supporting Actress, but did not do so for Best Picture or Best Director. This is certainly better than chance and I look forward to expanding the analysis to other categories and prior years.

Best Picture

The “Wolf of Wall Street” showed a remarkably strong growth in the number of new edits and editors after January 1. However, “12 Years a Slave” which ranked 5th by the end, actually won the award. A big miss.

picture_editors picture_edits 

Best Director

The Wikipedia activity here showed strong growth for  Steve McQueen (“12 Years a Slave”), but Alfonso Cuaron (“Gravity”) took the award despite coming in 4th in both metrics here. Another big miss.

director_editors director_edits

Best Actor

The Wikipedia activity for new edits and new editors are highly correlated because new editors necessarily show up as new edits. However, we see an interesting and very close race here between Chiwetel Ejofer (“12 Years a Slave”) and Matthew McConaughey (“Dallas Buyers Club”) for edits, but McConaughey with a stronger leader among new editors. This suggest older editors were responsible for pushing Ejofer higher (and he was leading early on), but McConaughey took the lead and ultimately won. Wikipedia won this one.

actor_usersactor_edits

Best Actress

Poor Judy Dench, she appeared to not even be in the running in either metric. Wikipedia activity forecast a Cate Blanchett (“Blue Jasmine”) win, although this appeared to be close among several candidates if the construct is to be believed. Wikipedia won this one.

actress_editors actress_edits

Best Supporting Actress

Lupita Nyong’o (“12 Years a Slave”) accumulated a huge lead over her other nominees by Wikipedia activity and won the award.

supactress_edits

supactress_editors 

Other Categories and Future Work

I wasn’t able to run the analysis for Supporting Actor because the Wikipedia API seemed to poop out on Bradley Cooper queries, but it may be a deeper bug in my code too. This analysis can certainly be extended to the “non-marquee” nominee categories as well, but I didn’t feel like typing that much.

I will extend and expand this analysis for both other categories as well as prior years’ awards to see if there are any discernible patterns for forecasting. There may be considerable variance between categories in the reliability of this simple heuristic — Director and Picture may be more politicized than the rest, if I wanted to defend my simplistic model. This type of approach might also be used to compare different awards shows to see if some diverge more than others from aggregate Wikipedia preferences. The hypothesis here is a simple descriptive heuristic and more extensive statistical models that incorporate features such as revenue, critics’ scores, and nominees’ award histories (“momentum”) may produce more reliable results as well.

Conclusion

Wikipedia editing activity over the two months leading up to the 2014 Academy Awards accurately forecast the winners of the Best Actor, Best Actress, and Best Supporting Actress categories but significantly missed the winners of the Best Picture and Best Director categories. These results suggest that differences in editing behavior–in some cases–may reflect collective attention to and aggregate preferences for some nominees over others. Because Wikipedia is a major clearinghouse for individuals who both seek and shape popular perceptions,  these behavioral traces may have significant implications for forecasting other types of popular preference aggregation such as elections.

Introduction to Zotero

I will be giving a talk on February 25 for members of the Northeastern University Lab for Texts, Maps, and Networks on how to use the excellent and open-source reference management tool, Zotero. This is a great tool for managing all the citations you need to keep track of for projects, organizing all those papers you should be reading, syncing changes to the cloud, and creating resources for sharing references with collaborators. And it’s much less evil than EndNote or Mendeley :)

Details on the talk time and location are here, but if you’re too busy to follow the link, the talk will be from 12:00 – 1:30pm in 400B Holmes Hall (map). I’m not sure if there will be a video recording, but if there is, I’ll post the link when it becomes available.

Slides I’ll be using are available here based on an existing tutorial and the excellent documentation.

Harvard CRCS talk

Today, I gave a talk in the lunch seminar series at Harvard’s Center for Research on Computation and Society (CRCS) about my dissertation-related work on socio-technical trajectories. Slides are here and the talk was recorded — I’ll post a link to that as soon as it’s available since I need to take some notes on some excellent ideas audience members had about how to extend this research!

EDIT: Link to the video of the presentation.

Site hacked and rebuilding now

Bad: Thanks to a number of people who pointed out that my site got hacked and had links to unsavory topics popping up in Google.

Worse: I was overzealous in removing the previous install and inadvertently deleted the directories containing the data and papers I had previously made available.

Please bear with me over the next few days and weeks as I try to go through the process of trying to track all these files down and get them properly linked back to their posts and such. Please contact me directly if there’s some data or code you need in the interim. My apologies in advance for any inconveniences this may cause!

Stalk and Snipe These Computational Social Scientists

There’s a number of academic job searches at great schools this year for positions around computational social science. While my two-body constraints will keep me rooted in Boston through 2016 which keeps me from applying, I’ve had several folks reach out to ask for advice for candidates they should seek out. Because interdisciplinarity has a weakness in failing to provide a focal point around which worthy candidates can be lauded by peers, I thought this might be a great opportunity to enumerate those peers whose work and thinking around computational social science and variants thereon (social computing, network science, etc.) I really respect — and thus should be included in hiring committees’ binders full of men or women!

This list is drawn from conversations with colleagues, interacting with people at various conferences and workshops, or getting to know through other channels like Twitter. As such, this is not an exhaustive search and is necessarily biased by friendship, shared interests, and shared memberships but I’ve made some minimal effort to eschew obvious conflicts of interest arising from prior collaborators and shared affiliations (which disqualifies a whole bunch of awesome people who have worked alongside me with Darren, Nosh, and David!). In their defense, the people below did not ask nor were they informed about being on such a list so it should certainly not be interpreted as self-promotion or unhappiness with their current position. And with such an ad hoc methodology, it’s certainly incomplete and I apologize in advance for omitting people.

I’ve broken the list up into two categories based on where they are in their careers. The first is “Stalk” for folks who are too young to be on the job market yet, but are up-and-coming rockstars who I’m confident will be in high demand once they are on the market. Keep an eye on them and if you’re able to, these are folks to pick up for as interns or visiting students while you can before they migrate to the next category. The second category is for folks who are in the vicinity of the academic job market and should be directly targeted for recruitment, or “Sniped.” There’s obviously a third category of junior people who already have jobs in academia or industry who could be stolen, but I wouldn’t presume to speak to their interest in such a conversation!

I present the inaugural job market Stalk and Snipe list in alphabetical order.

Stalk

Snipe

How not to run a (academic) hackathon

I think the academy and academic conferences would do well to incorporate more of the hackathon ethos and norms into its stodgy and plodding culture. So I like hackathons, I really do. But I wrote much of the following response to an email discussion about how to make an hackathon successful and I would also like to get other folks’ thoughts since my experience with hackathons is both limited to a few cases and biased towards cases that are more academic-focused. This is my read on the cultural norms and best practices and are almost certainly internally inconsistent. However with those disclaimers, here are some of my thoughts on why I’ve seen hackathons falter. This is mostly a list of chipping away everything that doesn’t look like a successful hackathon, rather than an affirmative list of how to make a hackathon successful. 

1. Thematically underconstrained. While a strawman, “Let’s get together and scrape some Twitter data!” is the surest way to have a hackathon crash and burn. There needs to be some combination of the following: a documented dictionary, cleaned data, or a clear method, and always some defined site, question, or outcome. Taking each of those in turn, a hackathon might do any one of the following: go explore a canonical dataset like GSS, ACS, or AddHealth for geospatial data; take messy Reddit data and have folks clean and code it up using some extant framework; allow people to learn some SNA concepts using Wikipedia data. The point is there are clear boundaries about where the hackathon starts and stops to prevent people from being paralyzed at the outset by questions of scope, but people are free to pursue a variety of agendas, methods, and questions within those constraints.

2. Mis-distribution of technical expertise. This can arise from either having not enough people with the requisite skills or permitting technical people to self-select into working with other technical people. The result is the people who know how to scrape, process, and build are either overworked or not pushed out of their comfort zone. Hackathons absolutely demand a willingness to learn new methods and techniques — people who rejected out of hand the possibility of learning to program, design, or do basic statistical tests obviously should not be invited. However, it’s also unrealistic to think that people are going to pick up Python scripting in a few hours so there should be repertoires of activities that allow non-technical folks to clean data, code data, search for related work and documentation, do UEX and wireframing, etc.

3. Regression to the academy. Like any other profession, academics want to revert back to comfortable paradigms. However, a hackathon should not become a lecture with technical folks teaching others to do basic scripting, a troubleshooting session around a single person’s bug, or airing of concerns and quibbles over a method or approach. The overriding focus should be on getting something done (even poorly) rather than talking abstractly about how things might be done better. The format should be radically participatory and resist a focus on deliverables, authority, or schedule.

4. Over-ambition. To the extent the evocativeness of the portmanteau needs construction, hackathons are not meant to be quick or lightweight. It’s not something that happens in a 90 minute session, it should not be a filler between other activities, and it should not have to compete with alternatives that require less commitment. The hackathon will have no expectations of “success” and people shouldn’t be expected to continue the work outside of it. There’s both a floor (~3 hours) and a ceiling (~10 hours) on how long people should be able to work in a day and these hours should be clearly communicated.

5. Lack of respect for participants and their time. As an organizer, a hackathon is not an opportunity to get a bunch of people together to do your coding work for free. You do not let a bunch of folks sign up under the auspices of one theme, then switch it at the last minute to another. You do not forget to have enough space, power, beverages and food, or other support for the people who are volunteering their time and expertise. You do not fail to have the authority to protect the space from ideologues, influence peddlers, or people with toxic personalities by intervening or ejecting people if necessary.

6. Unclear rules. This is probably best done as a group, but there are basic procedural questions. What are the norms about people tweeting, taking photos, or emailing colleagues about what they’re doing in the hackathon? Will people be able to continue to use this data or code afterwards and under what conditions? If individuals have non-public access to data or other resources, how obligated are they to contribute that to the hackathon? Who is responsible for cleaning up afterwards? Does anyone need to be compensated for space, food, etc.? Are people allowed to work on or access each others machines? What kinds of services will we use for collaboration and how can we accommodate others who don’t or refuse to use those services (e.g., gDocs, Dropbox, etc.)?

What are others’ thoughts? Have I missed something basic here about the culture and ethos of hackathon culture? Does anyone have good affirmative or positive advice on how to make a hackathon successful?