With the 2014 Facebook Experiment behind us and the 2016 U.S. general election ahead of us, discussions about ethics and data science will remain very much in the public eye. Every new revelation and allegation will predictably bring a cycle of unconstructive outrage and over-baked hot takes — which is all fine and good for collecting pageviews if you need to pay the bills. But these cycles prevent us from being able to step back and articulate what data science should be and should do. If data science is to become a profession with values, ethics, and boundaries rather than an occupation, I think data scientists should look to journalism for lessons to adopt into its emerging professional identity.
It’s that time of year when everyone writes either a year-in-review article or a predictions-for-next-year article. The Wikimedia Foundation offered one of their own that showed the remarkable capacity of Wikipedia to support collaborate around current events.
On Monday, October 27 Andy Baio posted an analysis of 72 hours of tweets with the #Gamergate hashtag. With the very best of intentions, he also shared the underlying data containing over 300,000 tweets saved as CSV file. There are several technical and potential ethical problems with that, which I’ll get to later, but in a fit of “rules are for thee, not for me,” I grabbed this very valuable data while I could knowing that it wouldn’t be up for long.