Big data is big news these days. Whether its consumer and user data from Google, Amazon, and Walmart, or the government’s big-data grab of phone and email records from the companies we trust, like Google and Version, in the latest US, as well as Canadian as it turns out, governments invasion of civil liberties in a war on terror threatening to take citizen’s data points hostage.
There is much to be concerned about with big data, from profiling to privacy issues. When it comes to where I work, in the space of scholarly communication, I can see that my responsibilities go beyond that of a concerned citizen (or at least a resident alien abroad). I want to make sure that big data also serves the public good that can come of research and scholarship. Easier said then done, of course, but in ways that caught me off guard.
The difficulties were brought home to me in a recent New York Times story that describes how 70 medical, research and advocacy organizations in 41 countries came to an agreement that genetic data needed to organized in ways that better serve medical understanding, while respecting what’s left of our privacy and our right of self-possession.
Now that millions of people are likely to have their genomes sequenced in the coming year, we are on the brink of big-data capacities that will greatly advance the ability to understand the health effects of genetic variations and mutations.
The shocker for me and the point of the article is that at this point, “there are no agreed-upon standards for representing genetic data or sharing them, experts say.” Or as Francis Collins, director of the National Institutes of Health, put it: “We need standard formats so we don’t have to spend two years figuring out how to merge data together.”
Forgive me, but this lack of standards had slipped past me. Here we have “the largest single undertaking in the history of biological science,” as a 2011 report by Simon Tripp and Martin Gruber with the Battelle Memorial Institute, states it, with “a $3.8 billion investment [that] drove $796 billion in economic impact, created 310,000 jobs and launched the genome revolution.” In all of that investment, employment, and impact, no body or organization of authority insisted that the real value of all of this sequencing could best be realized if the data was collected and shared in a standard easy-to-use format?
One suspects that among the impediments has been the proprietary hopes among some of the researchers for patenting what they have. Yet some of that resistance to data-sharing may have been dispelled in June, at least in the U.S. by the unanimous Supreme Court ruling against patenting naturally occurring genetic sequences. Only a month earlier, Obama’s White House Office of Science and Technology declared a Project Open Data, intended to “accelerate the adoption of open data practices,” which goes along with the Office’s February directive requiring government agencies that award research funds to develop open access policies. Therein lies the hope for the open and public side to big data. (Not in Canada, perhaps, judging by a recent Library and Archives Canada licensing deal that my friend Lon Dubinsky forwarded, which will result in subscription-based access to a portion of the country’s heritage.)
Yet I know realize that open is not nearly good enough when it comes to big data, as the recent agreement to create genetic data standards makes clear. It takes the additional hard work of arriving at a standard for aggregating, indexing, and curating data that will facilitate use and analysis. It seems vitally important, then, for scholarly societies and other bodies to work with that new breed of information scientist known as data curators to identify particular classes of data in any given field with a potential big-data payoff. Then they needs to identify policies for privacy and ethical use, data formats and conventions, citation and credit structures, sustainable archiving and access terms.
To arrive at an agreement and compliance for such standards will not be easy. But I do want to stress how much the quality, efficacy, and value of research can be advanced by this process. Such standards can advance research not only in health, but in education, environment, and energy, in child poverty and civil rights. I will be look to do my bit and to encourage other researchers to embrace what can be open and big about data that is gathered to advance a public good. Big data has been largely about other things up to now. It can surely be otherwise.