With the 2014 Facebook Experiment behind us and the 2016 U.S. general election ahead of us, discussions about ethics and data science will remain very much in the public eye. Every new revelation and allegation will predictably bring a cycle of unconstructive outrage and over-baked hot takes — which is all fine and good for collecting pageviews if you need to pay the bills. But these cycles prevent us from being able to step back and articulate what data science should be and should do. If data science is to become a profession with values, ethics, and boundaries rather than an occupation, I think data scientists should look to journalism for lessons to adopt into its emerging professional identity.

In every cycle of how data science is made out to be essential, sexy, expensive, over-hyped, unaccountable, chauvinist, etc. we are making judgments of this profession based off our assumptions about other professions should be and should do: police are essential, actors are sexy, consultants are useless, nurses are trustworthy, politicians are slimy. By definition, these tropes are unfair to many members in each class — but they nevertheless structure our own interactions with these professions by helping us to make sense of their responsibilities to us and others under the best and worst cases. When a new profession enters the scene, we make sense of it by blending in the most similar pre-existing tropes. In the case of data science, I would argue much of its professional identity in popular culture has unfortunately been blended from the deviant tropes of the mad scientist and the black hat.

The mad scientist is an amoral, aloof, or naive conjurer whose pre-occupation with success without regard for prior failure has entirely predictable effects (Hayes 2007). In this trope, data scientists become mad scientists because their overzealous commitment to accurate predictions ignores humans’ free will and creativity and predictably leads to totalitarian quantification and automation. The black hat is the anonymous and autonomous master of arcane technologies with no allegiances except to other mysterious hackers (Jordan & Taylor 1998). In this trope, data scientists become black hats because they manipulate unwitting users into complex statistical tests whose results are only legible to other data scientists but incomprehensible to everyone else. Both of these tropes are actively framed and cultivated, not by grassroots alone, but by intellectuals, activists, and other elites advancing their own interests like earning professional credit or advancing their preferred policy agendas. But by placing data scientists’ alleged behavior outside the range of responsibilities and moralities of what professions should be and should do, these tropes justify efforts by non-data scientists to control what data scientists’ work is and how it can be done.

Which is fine in a liberal democratic society: we regulate who can practice medicine or how planes are flown for many good reasons and being good with numbers isn’t a Get-Out-Of-Social-Responsibility free card. The problem is that the ways popular society and elite conversations approach data science as a profession is through these massively distorted tropes that suggest remedies wholly at odds with how actual data scientists do their actual work. If data scientists are mad scientists, just require more oversight to prevent their work from getting out of hand. If data scientists are black hats, just make it harder for them to access the data they crave. Because inaccurate tropes lead to unsuccessful remedies, we need better tropes at a minimum. But we should also look to other professions and how they’ve developed professional identities to influence what data science should be and should do to strike a better balance over the complex and competing interests over the practice of data science.

I suggest data scientists should look to journalism as we develop a professional identity. Of course journalism isn’t going to replace data science’s close affinities to engineering and academic cultures, but journalists do have decades of experience developing norms and practices to balance their considerable privileges against enormous amount of scrutiny and risk. These include ombudspeople, a culture of public criticism, as well as ethical guidelines that are developed and enforced as the level of professional societies as well as corporate management. How these manifest into journalists’ professional identities in turn can inform how professional data scientists might develop (and enforce) norms about dealing with sensitive information, navigating conflicting values, developing a shared culture, and persisting through change. This argument draws from Michael Schudson’s (1989) sociology of organizations and occupational ideology that explores the tensions between “journalists’ professed autonomy and decision-making power and…[constraints] by organizational and occupational routines.” Asking a similar question of data scientists, how do they balance the autonomy they have to analyze data and influence decisions against the limitations imposed by managers, methods, and regulations?

1. Both are formally sanctioned to gather and analyze sensitive data. Journalists have explicit legal rights to publish and are granted privileged access to elites while being hemmed in by news cycles and editorial judgments of what’s newsworthy. Data scientists are similarly granted privileged autonomy and access to private data while also being constrained by development cycles and strategic priorities of what’s production-worthy. In both cases, access to and control over sensitive data enable journalists and data scientists to make sense of complex and discrepant information, communicate their findings, and ideally enable others to act upon them. But this mapping is not perfect: journalists write for a public audience using privileged data about others while data scientists analyze privileged data about users for management (and sometimes academics or the users themselves). But it’s also the case that victims of journalistic breaches of this trust generally have far fewer legal remedies than victims of data breaches.

2. Both face complex and conflicting incentives. Schudson calls the political economy of the news the “consonance between profit-seeking industry and system-maintaining news.” Journalists are trained to be simultaneously autonomous and accountable to editors but under pressure to deliver eyeballs and not piss off advertisers. Data scientists face similarly complex incentives as they are trained and hired to be simultaneously autonomous analysts making sense of complex information but accountable to engineering and management to deliver products and recommendations without pissing off managers and advertisers. But there has not yet been a prominent whistleblowing case wherein a data scientist performed and disclosed analyses involving privileged information to serve the public interest like we have seen in business, engineering, or even hacking. Of course being in the public eye creates prestige, attracts scrutiny, and makes enemies. Moreover, publicity and ambition can lead to overreach, which alienates supporters or validates criticisms. Many newsrooms manage these conflicting values in part by erecting a firewall between the editorial and business sides of the organization so editorial judgments are not clouded by business decisions while subjecting stories to editorial judgment and fact-checking. Whether or how data teams could make themselves similarly insulated from business interests and open to fact-checking is unclear because they are so integral to strategy and operations on the business side with no similar incentives to publish.

3. Both are challenged to develop a shared culture. This is what Schudson calls the “culturological approach” that examines broader cultural forces around values, beliefs, and symbols that animate journalists and their audiences. In (American) journalism, values like egalitarianism, centrism, and accountability explain why some stories have popular resonance, drive political change, win awards, etc. Both journalists and data scientists perform rituals about objectivity to insulate themselves from criticism despite carrying considerable ideological baggage about what is or isn’t worthy of investigation. For instance, journalists internalize news values that prioritize some stories over others based on the values of their editors, colleagues, and audiences. Data scientists also internalize values about what makes a good analysis based on its compatibility with their academic training, efficiency of solving a problem, or originality of a finding. For example, a blog post about activity tracker users during an earthquake becomes popular because of its timeliness (hours after the event), cleanliness (a classic natural experiment), novelty (few other companies could do it), and relevance (the results are intuitive). These “conspicuous analyses” are often as much as a performance for other data scientists for the purposes of prestige and recruitment as they are for general interest publicity. In both cases, the absence of any strong barriers to entry also suggests the need for shared values to define and regulate a professional identity as ways of marking who is “in” or “out”. Questioning professionals’ commitment to these occupational values becomes an expedient way for manipulating their inquiry as well: watching journalists contort themselves to in response to politicians’ accusations about their lack of neutrality has become a sport of late. Data scientists would do well to reflect on how the professional values could also be manipulated.

4. Both face disruptive changes. Mark Feldstein examined the history of American journalism and proposed a basic 2×2 framework where high/low demand for social change and high/low supply of media content interact to create different styles of investigative journalism. Early 20C “muckraking” and Vietnam-era journalists navigated eras of major social turmoil and rapid technological change that disrupted traditional fact-gathering and publishing. But those modes cannot sustain themselves when audience interests shift and organizational practices stabilize. Early 21C data scientists face a similar environment of profound social change and technological competition, but they should remember that “this too shall pass”. What is taken for granted now in terms of prestige, skill, and practices could rapidly change: if the widespread engineering-adjacent model shifts to an external consulting model or managers switch to data engineers without pricey PhDs in experimental particle physics who can do 90% of the data science job for 10% of the cost, there may not be chairs for everyone when the music stops. Professional identity and values can provide some guidance through these disruptions.

In addition to developing alternatives to the deviant tropes I outlined above, alternative professional identities like journalism also invite speculation about how the professionalization of data science could unfold in the decades to come. My history of journalism may be a Whiggish one that fails to account for very real differences between the fields, but maybe these alignments between sensemaking workers of different eras points to where we might see data science develop going forward around new professional models of autonomy, ethics, legal rights, and division of labor. Journalism certainly is not the only model for professional identity data scientists might look to: consider accountants and financial advisors, inspectors and product testers, or librarians and archivists. Each have developed distinctive professional identities to deal with sensitive information, manage conflicting incentives, mark the boundaries of the profession, and sustain the profession during profound change. But this also raises practical questions of who will do the reforming and where it will come from in the first place. Even if the prominent data scientists were to be proactive in calling for their colleagues to imagine alternative norms and practices drawn from other professions, when do we start doing it? I’d welcome thoughts on strategies and precedents about how to professionalize data science, but this seems like the topic of a whole new blog post!

I’ll leave you with a particularly evocative framework (below): Hanitzch’s seven dimensions of journalism culture. An assumption under-girding my argument is that a single monolithic “journalism culture” exists and it should be transplanted into the single monolithic “data science culture”. Neither obviously exists in practice. The extent to which ad-impression maximizers and social do-gooders are both “data scientists” despite the huge differences in their ability to intervene, proximity to power, orientation to market concerns, reflexivity, use of methods, ethics, and and goals will clearly vary. Despite the diversity in both journalism’s and data science’s cultures, I suspect there will be substantial consensus about what data scientists should be and should do as we look to other professions for their hard-won lessons.

Acknowledgements

Thanks to Nathan Matias, Nick Diakopoulos, Deen Freelon, and others for their feedback on earlier versions of this post.

Bibliography

Feldstein, M. (2006). A Muckraking Model: Investigative Reporting Cycles in American History. The Harvard International Journal of Press/Politics, 11(2), 105–120. doi:10.1177/1081180X06286780

Hanitzsch, T. (2007). Deconstructing Journalism Culture: Toward a Universal Theory. Communication Theory, 17(4), 367–385. doi:10.1111/j.1468-2885.2007.</wbr>00303.x

Haynes, R. (2003). From alchemy to artificial intelligence: Stereotypes of the scientist in western literature. Public Understanding of Science, 12(3), 243–253. doi: 10.1177/0963662503123003

Jordan, T., & Taylor, P. (1998, November). A sociology of hackers. The Sociological Review, 46 (4), 757–780. doi: 10.1111/1467-954X.00139

Schudson, M. (1989). The sociology of news production. Media, Culture & Society, 11(3), 263–282. doi:10.1177/016344389011003002