Is Less: Signal Processing and the Data Deluge - AAAS

5 downloads 134 Views 9MB Size Report
Science's policy for some time has been that “all data necessary ... have encouraged authors to comply in one of two w
AAAS is here – connecting government to the scientific community. As a part of its efforts to introduce fully open government, the White House is reaching out to the scientific community for a conversation around America’s national scientific and technological priorities. To enable the White House’s dialogue with scientists, AAAS launched Expert Labs, under the direction of blogger and tech guru Anil Dash. Expert Labs is building online tools that allow government agencies to ask questions of the scientific community and then sort and rank the answers they receive. On April 12, 2010, AAAS asked scientists everywhere to submit their ideas to the Obama administration and at the same time launched the first of Expert Labs tools, Think Tank, to help policy makers collect the subsequent responses. The result was thousands of responses to the White House’s request, many of which are already under consideration by the Office of Science and Technology Policy. As a AAAS member, your dues support our efforts to help government base policy on direct feedback from the scientific community. If you are not already a member, join us. Together we can make a difference.

To learn more, visit aaas.org/plusyou/expertlabs

2011 Data Collections Booklet

EDITORIAL Making Data Maximally Available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Brooks Hanson, Andrew Sugden, Bruce Alberts

NEWS Rescue of Old Data Offers Lesson for Particle Physicists. . . . . . . . . . . . . . . . . . . 6 Andrew Curry Is There an Astronomer in the House?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Sarah Reed May the Best Analyst Win . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Jennifer Carpenter

PERSPECTIVES Climate Data Challenges in the 21st Century . . . . . . . . . . . . . . . . . . . . . . . . 12 Jonathan T. Overpeck, Gerald A. Meehl, Sandrine Bony, David R. Easterling Challenges and Opportunities of Open Data in Ecology . . . . . . . . . . . . . . . . . . 15 O. J. Reichman, Matthew B. Jones, Mark P. Schildhauer Changing the Equation on Scientific Data Visualization. . . . . . . . . . . . . . . . . . . 17 Peter Fox and James Hendler Challenges and Opportunities in Mining Neuroscience Data. . . . . . . . . . . . . . . . 20 Huda Akil, Maryann E. Martone, David C. Van Essen The Disappearing Third Dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Timothy Rowe and Lawrence R. Frank Advancing Global Health Research Through Digital Technology and Sharing Data . . . . 26 Trudie Lang More Is Less: Signal Processing and the Data Deluge. . . . . . . . . . . . . . . . . . . . 29 Richard G. Baraniuk Ensuring the Data-Rich Future of the Social Sciences. . . . . . . . . . . . . . . . . . . . 31 Gary King Metaknowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 James A. Evans and Jacob G. Foster Access to Stem Cells and Data: Persons, Property Rights, and Scientific Progress . . . . 37 Debra J. H. Mathews, Gregory D. Graff, Krishanu Saha, David E. Winickoff On the Future of Genomic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Scott D. Kahn

AAAS is here – promoting universal science literacy.

AAAS is here – promoting universal science literacy. In 1985, AAAS founded Project 2061 with the goal of helping all Americans become literate in science, mathematics, and

In 1985, AAAS founded Project publications 2061 with the goalforofAllhelping all and Americans become literateLiteracy, in science, technology. With its landmark Science Americans Benchmarks for Science Projectmathematics, 2061 set out and technology. With its landmark publications Science forand Allbe Americans Benchmarks for Science Literacy, Project 2061 set out recommendations for what all students should know able to doand in science, mathematics, and technology by the time they graduate from for highwhat school. Today, many of theknow state and standards in the United States have drawn theirand content from Project 2061. recommendations all students should be able to do in science, mathematics, technology by the time they graduate from high school. Today, many of the state standards in the United States have drawn their content from Project 2061. Every day Project 2061 staff use their expertise as teachers, researchers, and scientists to evaluate textbooks and assessments, conceptual maps for expertise educators,as produce groundbreaking and innovative books, CD-ROMs, and profesEverycreate day Project 2061strand staff use their teachers, researchers,research and scientists to evaluate textbooks and assessments, sional development workshops for educators, all in the service of achieving our goal of universal science literacy. create conceptual strand maps for educators, produce groundbreaking research and innovative books, CD-ROMs, and profes-

sionalAsdevelopment workshops forhelp educators, in the service achieving our goal of universal literacy. a AAAS member, your dues support all Project 2061 as it of works to improve science education.science If you are not yet a AAAS member, join us. Together we can make a difference.

As a AAAS member, your dues help support Project 2061 as it works to improve science education. If you are not yet a AAAS member, join us. Together we can make a difference.

To learn more, visit aaas.org/plusyou/project2061

To learn more, visit aaas.org/plusyou/project2061

2011

CREDIT: THINKSTOCK

CREDIT: THINKSTOCK

Brooks Hanson is Deputy Editor for physical sciences Brooks Hanson is at Science. Deputy Editor for physical sciences at Andrew Sugden is Science. Deputy Editor for biological sciences and Andrew Sugden is International Deputy Editor forManaging Editor atsciences Science.and biological International Managing Bruce Editor at Alberts Science.is Editorin-Chief of Science. Bruce Alberts is Editorin-Chief of Science.

EDITORIAL EDITORIAL Data Collections Booklet

Making Data Maximally Available Making Data Maximally Available

SCIENCE IS DRIVEN BY DATA. NEW TECHNOLOGIES HAVE VASTLY INCREASED THE EASE OF DATA

collection and consequently the amount of data collected, while also enabling data to be

independently mined and reanalyzed by others.HAVE And VASTLY societyINCREASED now relies THE on scientifi SCIENCE IS DRIVEN BY DATA. NEW TECHNOLOGIES EASE OFc data DATA of

diverse kinds; for example, in disease outbreaks, managing resources, collection and consequently theresponding amount oftodata collected, while also enabling datarespondto be ing to climate change, improving transportation. It is now obvious that widely independently mined andand reanalyzed by others. And society relies onmaking scientifidata c data of available is for an example, essential element of scientifi c research. Themanaging scientific resources, community strives to diverse kinds; in responding to disease outbreaks, respondits basic responsibilities towardtransportation. transparency, standardization, data archiving. Yet, ingmeet to climate change, and improving It is obvious thatand making data widely as pointed in a special section of this issue (pp. 692–729), scientists are struggling available is anout essential element of scientifi c research. The scientifi c community strives with to the its huge amount, complexity, and variety of the data that are now and being produced. meet basic responsibilities toward transparency, standardization, data archiving. Yet, Recognizing longsection shelf-life of data their varied applications, the close relaas pointed out in a the special of this issueand (pp. 692–729), scientists areand struggling with of amount, data to the integrity of publishers, including have increasthetion huge complexity, andreported variety results, of the data that are now beingScience, produced. ingly assumedthe more responsibility ensuring thatvaried data are archived and the close relaRecognizing long shelf-life offor data and their applications, and after publication. Thus, Science and other journals have tion ofavailable data to the integrity of reported results, publishers, including Science, have increasstrengthened theirresponsibility policies regarding data, and publishing moved ingly assumed more for ensuring thatasdata are archived online, added supporting online material to expand data preand available after publication. Thus, Science(SOM) and other journals have sentation and availability. But it is adata, growing to ensure that strengthened their policies regarding and challenge as publishing moved data produced during the course of reported aredata approprionline, added supporting online material (SOM)research to expand preately described, standardized, archived, andchallenge availableto toensure all. that sentation and availability. But it is a growing Science’s during policy for has beenresearch that “allare dataapproprinecessary data produced the some coursetime of reported to understand, assess, and extend theand conclusions ately described, standardized, archived, available of to the all. manuscript must be available to any reader Science’s policy for some time of hasScience” been that(see “allwww.sciencemag. data necessary prohibiting to data to org/site/feature/contribinfo/). understand, assess, and extendBesides the conclusions of references the manuscript in unpublished those described as “in press”), we must be available papers to any (including reader of Science” (see www.sciencemag. have encouraged authors toBesides complyprohibiting in one of two ways: either org/site/feature/contribinfo/). references to databy in public databases are reliably and in depositing unpublisheddata papers (including thosethat described as “insupported press”), we likely to be maintained or,comply when such a database is noteither available, have encouraged authors to in one of two ways: by by including data in the SOM. online supplements depositing data their in public databases thatHowever, are reliably supported and have too often become unwieldy, and journalsor, arewhen not equipped to curateishuge data sets. For very large databases likely to be maintained such a database not available, a plausible have thereforeonline required authors tohave enter an become archiving bywithout including their datahome, in the we SOM. However, supplements toointo often agreement, which the commitstotocurate archive the data on an For institutional Web site, with unwieldy, andinjournals areauthor not equipped huge sets. very large databases a copyaofplausible the data home, held atwe Science. But such required agreements are only a stopgap solution; more without have therefore authors to enter into an archiving support for community-maintained is badly needed. Web site, with agreement, in permanent, which the author commits to archivearchives the data on an institutional growing complexity of data and analyses, Science is extending data a copyToofaddress the datathe held at Science. But such agreements are only a stopgap solution;our more accessfor requirement above to include computer involved in the creation or analsupport permanent,listed community-maintained archivescodes is badly needed. ysis of data. the To provide and reveal dataand sources moreScience clearly,iswe will ask our authors To address growingcredit complexity of data analyses, extending data to produce a single list thatabove combines references fromcodes the main paperinand SOM (this comaccess requirement listed to include computer involved thethe creation or analplete list will be available in and the online of themore paper). And to SOM,towe ysis of data. To provide credit reveal version data sources clearly, weimprove will askthe authors will provide to constrain its content methods and data an aid produce a singlea template list that combines references fromtothe main paper and descriptions, the SOM (thisascomto reviewers readers. also ask authors to provide specifi c statement regardplete list will beand available in We the will online version of the paper). Andato improve the SOM, we ingprovide the availability and data as part their acknowledgements, requesting that will a template tocuration constrainofits content to of methods and data descriptions, as an aid this aWe responsibility the authors. We recognize exceptions may be to reviewers reviewers consider and readers. will also askofauthors to provide a specifithat c statement regardto these general requirements; example, preserve the privacy of individuals, ingneeded the availability and curation of data asfor part of their to acknowledgements, requesting that or in some cases this when data or materials areauthors. obtainedWe from third parties, and/or formay security reviewers consider a responsibility of the recognize that exceptions be reasons. But we expectrequirements; these exceptions to be rare.to preserve the privacy of individuals, needed to these general for example, As gatekeepers publication, journals clearly have important to play in making or in some cases whentodata or materials are obtained froman third parties,part and/or for security data publicly permanently available. But the most important steps for improving the reasons. But we and expect these exceptions to be rare. way that sciencetoispublication, practiced and conveyed must come from the wider commuAs gatekeepers journals clearly have an important part toscientifi play in cmaking nity. Scientists critical roles in the leadership of journals andsteps societies, as reviewers data publicly andplay permanently available. But the most important for improving thefor papers and grants, and as authors themselves. must all the accept thatscientifi sciencec is data and way that science is practiced and conveyed mustWe come from wider commuthatScientists data are play science, androles thusinprovide for, andofjustify theand need for the support of, muchnity. critical the leadership journals societies, as reviewers for improved data curation. papers and grants, and as authors themselves. We must all accept that science is data and – Brooks Hanson, Andrew Sugden, Bruce Alberts that data are science, and thus provide for, and justify the need for the support of, much10.1126/science.1203354 improved data curation. – Brooks Hanson, Andrew Sugden, Bruce Alberts

www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011

649

10.1126/science.1203354

649

www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011 0211Editorial.indd 649

www.sciencemag.org

SCIENCE

2/7/11 4:47 PM

5

NEWS

Rescue of Old Data Offers Lesson for Particle Physicists Old data tends to get forgotten as physicists move on to new and better machines. The tale of the JADE experiment suggests that they should be more careful IN THE MID-1990S, SIEGFRIED BETHKE

decided to take another look at an experiment he participated in around 2 decades earlier as a young particle physicist at DESY, Germany’s high-energy physics lab near Hamburg. Called JADE, it was one of five experiments at DESY’s PETRA collider, which smashed positrons and electrons into each other. Looking at the strength of the force that binds quarks and gluons into protons and neutrons, JADE finished in 1986 when DESY closed down PETRA to build a more powerful collider. In the decades since, new theoretical insights had come along, and Bethke hoped the old data from JADE—taken at lower collision energies— would yield fresh information. What the physicist found was a disaster. Since JADE shut down and the experiment’s funding ended, the data had been scattered across the globe, stored haphazardly on old tapes, or lost entirely. The fate of the JADE data is, however, typical for the field: Accustomed to working in large collaborations and moving swiftly on to bigger, better machines, particle physicists have no stan-

694

dard format for sharing or storing information. “There’s funding to build, collect, analyze, and publish data, but not to preserve data,” says Salvatore Mele, a physicist and data preservation expert at the CERN particle physics lab near Geneva, Switzerland. This tendency has prompted some in the field to call for better care to be taken of data after an experiment has finished. For a very small fraction of the experiment’s budget, they argue, data could be preserved in a form usable by later generations of physicists. To promote this strategy, researchers from a half-dozen major labs around the world, including CERN, formed a working group in 2009 called Data Preservation in High Energy Physics (DPHEP). One of the group’s aims is to create the new post of “data archivist,” someone within each experimental team who will ensure that information is properly managed. Physics archaeology For the founders of DPHEP, Bethke’s struggles with the JADE data are both an inspiration and a cautionary tale. It took Bethke,

now the head of the Max Planck Institute for Physics in Munich, Germany, nearly 2 years—and a lot of luck—to reconstruct the data. Originally stored on magnetic tapes and cartridges from old-style mainframes, most of it had been saved by a sentimental colleague who copied the few gigabytes of data to new storage media every few years. Other data files turned up at the University of Tokyo. A stack of 9-track magnetic storage tapes was stashed in a Heidelberg physics lab. One critical set of calibration numbers survived only as ASCII text printed on reams of green printer paper found when a DESY building was being cleaned out. Bethke’s secretary spent 4 weeks reentering the numbers by hand. Even then, much of the data couldn’t be read. Software routines written in arcane IBM assembler codes such as SHELTRAN and MORTRAN, tweaked for ’70s-era computers for which memory was at a premium, and stored on long-deactivated personal accounts, were lost forever. A graduate student spent a year recreating code used to run the numbers. The recovery work was motivated by more than Bethke’s nostalgia. In the years since JADE ended, new theories about what physicists call the strong coupling strength had emerged. These predict phenomena that can best be seen at lower energies than today’s colliders are able to replicate. By reanalyzing the old data, Bethke’s team squeezed more than a dozen high-impact scientific publications out of the resurrected JADE data. Some of the data helped confirm quantum chromodynamics, the theory governing the interior of atomic nuclei, and was cited by the committee that awarded the 2004 Nobel Prize in physics to David Gross, David Politzer, and Frank Wilczek. “It was like physics archaeology,” Bethke says today. “It took a lot of work. It shouldn’t be like that. If this was properly planned before the end of the experiment, it could have all been saved.” The usefulness of JADE’s old data may not be an isolated occurrence. “Big installations are more high-energy, but they don’t replace data taken at lower energy levels,” says Cristinel Diaconu, a particle physicist at DESY. “The reality is a lot of experiments done in the past are unique; they’re not going to be repeated at that energy.” If anything, the need to better preserve

CREDIT: COURTESY OF SIEGFRIED BETHKE

Back on track. Particle-track data from the 1980s-era JADE experiment after restoration by Siegfried Bethke’s team in 1999.

11 FEBRUARY 2011 VOL 331 SCIENCE www.sciencemag.org 6

0211SpecialNewsSection.indd 694

SCIENCE

www.sciencemag.org 2/3/11 4:49 PM

CREDITS: SIEGFRIED BETHKE

SPECIALSECTION 2011 Data Collections Booklet particle physics data has grown more urgent in the past few years as CERN’s Large Hadron Collider (LHC) captured the world’s attention and a handful of other high-profile projects—BaBar at the SLAC National Accelerator Laboratory, Japan’s KEK collider, and the latest DESY experiments—wrap up work and prepare to disband. “In the past, experiments were smaller and more frequent. Now we build very big devices that cost a lot of money and person power over a number of years,” says Diaconu. “Each experiment is one application, built specifically for the task.” The LHC alone represents nearly a halfcentury of work, with 20 years invested in design and construction and 20 years of scheduled operation. There will never be another experiment like it. The issue, experts say, isn’t data degradation. “The problem starts when the experiment is over, and the data used by one group of people is only understood by those people,” Diaconu says. “When they go off and do other things, the data is orphaned; it has no parents anymore.” The orphan metaphor only goes so far: After a certain point, orphaned data can’t be adopted by later researchers who weren’t part of the original team. Even given the raw data, only someone intimately involved in the original experiment can make sense of it. “The analysis is so complex that to understand the data you have to be there with it, working on the experiment,” says SLAC database manager Travis Brooks. “There’s a whole spectrum of things you need to keep around if you want petabytes [10 15 bytes] of data to be useful.” That spectrum includes everything from internal notes that explain the ins and outs of analyses, to subprograms designed to massage numbers for specific experiments. And then there’s the fuzzy-sounding “metainfo,” the hacks and undocumented software tweaks made by a team in the midst of a project and then quickly forgotten. Making it worse, particle physicists don’t usually share their data outside their collaborations the way most peer-reviewed scientists do. “We don’t publish the data, because it’s something like a petabyte—you can’t just attach the raw data in a ZIP file,” Brooks says. As a result, there’s been no incentive to

find a standard format for the raw information that would be readable to outsiders. A data librarian To give shuttered experiments a future, the DPHEP working group is looking for ways to keep data in working order long after the original collaboration has disbanded. Typi-

Down but not out. Siegfried Bethke works on the JADE detector in 1984 (above). A display screen (top) announces the end of the experiment.

cally, software that can make sense of the data is custom-made to run on servers that are optimized for the experiment and shut down when funding runs out. And the constant churn of technology can make software and storage media obsolete within a matter of years. “The data can’t be read if the software can’t be run,” Brooks says.

One option is to “virtualize” the software, creating a digital layer that simulates the computers the experiment was originally run on. With regular updates and maintenance, software designed to run on the UNIX machines of today could be rerun on the computers of the future the same way people nostalgically play old Atari games on new PCs, for example. To capture and preserve the less tangible aspects of a particle physics experiment, the working group has suggested the job of data archivist. The archivist would be in charge of baby-sitting the data and standardizing the software used to read it, helping to justify huge investments in the big machines of physics by making data usable by future researchers or useful as a teaching tool. The idea has been endorsed by the International Committee for Future Accelerators, an advisory group that helps coordinate international physics experiments. DPHEP is also pushing data preservation among funding agencies, arguing that the physics experiments of the future should be designed with a data-preservation component to help justify their cost. Diaconu admits that the idea has a way to go before it captures the minds of young physicists focused on publishing new data. “Some people say, ‘Can you imagine how boring, to sit and look at old data for 20 years?’ ” he says. “But look at a librarian. Part of their job is taking care of books and making sure you can access them.” A data archivist would be a mix of librarian, IT expert, and physicist, with the computing skills to keep porting data to new formats but savvy enough about the physics to be able to crosscheck old results on new computer systems. The DPHEP group estimates that archivists—and the computing and storage resources they’d need to keep data current long after an experiment ended—would cost 1% of a collider’s total budget. That can be a hefty financial commitment: It would amount to $90 million for CERN. But keeping data in a usable form would provide a return on the investment in the form of later analyses, the group argues. Says Diaconu: “Data collection may stop, but it’s not true that’s the end of the –ANDREW CURRY experiment.” Andrew Curry is a freelance writer based in Berlin.

www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011 www.sciencemag.org 0211SpecialNewsSection.indd 695

SCIENCE

695 7 2/3/11 4:49 PM

NEWS

With biomedical researchers analyzing stars and astronomers tackling cancer, two unlikely collaborations creatively solve data problems IN 2004, ALYSSA GOODMAN HAD A PROBLEM.

An astronomer at Harvard University, she and her colleagues were working on a project called COMPLETE, a survey of star-forming regions; now they had to analyze massive amounts of data that were tricky to visualize in only two dimensions. Goodman wanted a three-dimensional view of the regions, but the tools available to astronomers weren’t up to the task. So she went in search of the answer elsewhere. Goodman presented her problem at a workshop called Visualization Research Challenges, held at the headquarters of the U.S. National Institutes of Health (NIH) in Bethesda, Maryland. In the audience was Michael Halle, a computer scientist at Brigham and Women’s Hospital in Boston, who recognized that the technology Goodman needed already existed in medicine. His department had previously developed a piece of visualization software, called 3D Slicer, for use with medical scans such as MRIs. Halle thought it could handle Goodman’s astronomical data set as well. His hunch was right. And the unusual collaboration that formed between his team and Goodman’s still exists today at Harvard in a data-analysis project called Astronomical Medicine. It’s not the only odd pairing of astronomers and biomedical researchers motivated by the need to deal with data. At the Uni-

8

versity of Cambridge’s Institute of Astronomy (IoA) in the United Kingdom, Nicholas Walton uses sophisticated computer algorithms to analyze large batches of images, picking out faint, fuzzy objects. When he isn’t looking for distant galaxies, nebulae, or star clusters, the astronomer lends his datahandling skills to the hunt for cancer. From stars to biomarkers Walton and his colleagues work on a project called PathGrid in which image-analysis software developed for astronomy is being used to automate the study of pathology slides. Pathologists stain tissue samples to identify various biomarkers that indicate a cancer’s aggressiveness. Currently, they must inspect each slide personally with a microscope, but PathGrid aims to improve on this time-consuming and subjective endeavor. The key behind the project is the surprising similarity between images of tissue samples and the cosmos: Spotting a cancerous cell buried in normal tissue is like finding a single star in a crowded stellar field. “There’s a natural overlap in astronomy and medicine for needing to identify and quantify indistinct objects in large data sets,” says oncologist James Brenton of the Cancer Research UK Cambridge Research Institute, who, with Walton, leads the PathGrid project. When deciding on the best course of treatment for a person with breast cancer,

SCIENCE

pathologists look for different biomarkers— specific proteins—in the patient’s cancerous tissue. For example, an overexpression of the biomarker human epidermal growth factor receptor 2 (HER2) indicates a more aggressive form of breast cancer with a poorer prognosis. To spot such biomarkers, pathologists use a technique called immunohistochemical (IHC) screening. First they treat the tissue with antibodies that bind to targeted proteins, such as HER2; then a secondary antibody highlights the binding by undergoing a chemical reaction that produces a colored stain. At present, however, there are only a handful of well-validated biomarkers for cancer, and even fewer that reveal how a patient is likely to respond to a specific treatment, says Brenton. “That’s because there’s a bottleneck between new biomarker discoveries and being able to put them into clinical practice,” he says. “Discoveries are made with relatively small groups of tens to a few hundred patients, but their usefulness needs to be validated on sample sizes of hundreds or several thousands of patients.” Initial discovery studies are small-scale because pathologists must manually assess images in the IHC screening process, qualitatively scoring them for the abundance of a particular biomarker as well as the intensity of the staining. What was needed, Brenton thought, is a way of automating this timeconsuming task: a computer algorithm that could accurately pick out stained tissue of varying shapes and sizes in cluttered images. That’s when Walton came on the scene.

www.sciencemag.org

CREDITS (LEFT TO RIGHT): EUROPEAN SOUTHERN OBSERVATORY/VISTA AND CASU, UNIVERSITY OF CAMBRIDGE; WALTON & IRWIN, UNIVERSITY OF CAMBRIDGE

Is There an Astronomer in the House?

2011 Data Collections Booklet

CREDITS (TOP AND BOTTOM): ANDRÁS JAKAB/UNIVERSITY OF DEBRECEN, HUNGARY AND MICHELLE BORKIN/HARVARD UNIVERSITY, USA

Surprisingly similar. A picture of the center of our galaxy and a slide of stained cancerous tissue show a common need to pick out indistinct objects in both types of images.

“We’ve been developing algorithms to extract information from large telescope surveys at the IoA for years. The algorithms are robust to various backgrounds, such as stars, galaxies, and gas,” says Walton. He and Brenton met at the first Cambridge eScience Centre Scientific Forum held in 2002. The scientists, along with colleagues from their respective fields, talked at length about the possibility of using astronomy algorithms in cancer-screening image analysis. Walton and his IoA colleague Mike Irwin found that transferring those algorithms to medical use was painless. “In a pilot scheme to investigate the feasibility of the project, we found that we had to make virtually no changes at all, just tweaking the odd parameters here and there,” says Walton. PathGrid has performed well in tests, Walton and Brenton say. In a study that checked 270 breast cancer images for a biomarker called estrogen receptor (ER), PathGrid agreed with pathologists’ scorings for 88% of the positive slides and 93% of the negative ones. (An ER-positive tumor has a better prognosis than an ER-negative one and is treated by suppressing the production of the hormone estrogen.) A larger test, which looked for the biomarker HER2 in more than 2000 images, yielded even more impressive success rates of 96% and 98%, respectively. Walton says he thinks the results would have been even higher if not for “the subjective manner in which pathologists rate images.” PathGrid is consistent yet speedy, says Walton. To analyze a batch of a few hundred images for one specific biomarker would take PathGrid only a few minutes, he says, compared with about 3 hours for a pathologist. PathGrid is just one of several “virtual pathology” projects now under way, notes Laoighse Mulrane of the University College Dublin School of Biomolecular and Biomedical Science, a cancer researcher who recently reviewed this area. Yet its “novel” use of astronomy-based techniques makes it stand out from the pack, he says: “The collaborative efforts of the groups involved should be applauded.” PathGrid is now ready to undergo trials within a hospital environment in the United Kingdom. “Hopefully, if everything goes well, it could be used as routine, automated screen-

www.sciencemag.org

for help, nobody had considered adapting the technique to astronomy. After the NIH workshop, the pair quickly began working together and soon brought Michelle Borkin, then a Harvard undergraduate in astronomy and now a Harvard Ph.D. student in applied physics, on board the project. Borkin was instantly hooked. “The very first time that I saw our astronomical data come to life in 3D Slicer was amazing,” says Borkin. “Viewing the data in 3D is far more intuitive to understand than looking at it in 2D. I was instantly able to start making new discoveries that are incredibly difficult to do otherwise, such as spotting elusive jets of gas ejected from newborn stars.” While continuing to research star-forming regions using 3D Slicer, the Harvard team is currently working on projects that will give back to the medical world, developing tools based on algorithms used in astronomy to visualize, for example, coronary arteries. The way in which these two interdisciplinary projects have been able to share tools are special cases, cautions Stephen Wong, a biomedical informatics scientist at the Methodist Hospital Research Institute in Houston, Texas. In general, Wong says, using secondhand algorithms is not a good idea: “To be effective, image-processing algorithms and analysis tools have to be customized and specific to the particular probStellar views. 3D imagery can give a clearer picture lems under investigation.” Working in interdisciplinary research is of the inner workings of the human body (top), and astronomers are using related visualization software also demanding on scientists’ already hectic to study distant star-forming regions (bottom). work schedules and has to be fitted around their traditional career duties. “I spend about ing in hospitals within 3 years,” says Walton. 10% of my time working on PathGrid and the rest on my day job as part of the EuroAdding another dimension pean Space Agency’s Gaia spacecraft sciThe partnership between astronomy and med- ence team,” says Walton. icine works both ways, as Goodman and her Yet for the scientists involved in both colleagues have shown with the Astronomical projects, taking on this supplementary work Medicine project. For the COMPLETE survey, is a labor of love. “Usually, you become an Goodman already had some visualization expert in just one field,” says Goodman, “but and analysis tools, however, they were I’ve had the opportunity to learn something geared toward dealing with 2D images, completely new in my 40s. I think people whereas she also wanted to see how the should go into interdisciplinary research, velocity of gas in star-forming regions not just because the world might learn changed along the line of sight—essentially something, but because you will too.” As for the astronomers on the PathGrid treating velocity like a third dimension. “COMPLETE contained very large team, they’re able to make a boast that few ‘position-position velocity’ maps of star- of their stargazing colleagues can match. forming regions that had been made to date, “It’s great to think that something I’m doing and we wanted to see and understand this data is going to have an impact on how a cancer all at once,” she says. patient is treated and help to improve their Analyzing 3D images has been an impor- chances of survival,” says Walton. –SARAH REED tant part of diagnostic medicine for many years, but until Halle heard Goodman’s plea Sarah Reed is a freelance writer and former Science intern.

SCIENCE

9

01001

NEWS

May the Best Analyst Win Exploiting crowdsourcing, a company called Kaggle runs public competitions to analyze the data of scientists, companies, and organizations LAST MAY, JURE ŽBONTAR, A 25-YEAR-OLD

computer scientist at the University of Ljubljana in Slovenia, was among the 125 million people around the world paying close attention to the televised finale of the annual Eurovision Song Contest. Started in 1956 as a modest battle between bands or singers representing European nations, the contest has become an often-bizarre affair in which some acts seem deliberately bad—France’s 2008 entry involved a chorus of women wearing fake beards and a lead singer altering his vocals by sucking helium—and the outcome, determined by a tally of points awarded by each country following telephone voting, has become increasingly politicized. Žbontar and his friends gather annually and bet on which of the acts will win. But this year he had an edge because he had spent hours analyzing the competition’s past voting patterns. That’s because he was among the 22 entries in, and the eventual winner of, an online competition to predict the song contest’s results. The competition was run by Kaggle, a small Australian start-up company that seeks to exploit the concept of “crowdsourcing” in a novel way. Kaggle’s core idea is to facilitate the analysis of data, whether it belongs to a scientist, a company, or an organization, by allowing outsiders to model it. To do that, the company organizes competitions in which anyone with a passion for data analysis can battle it out. The contests offered so far have ranged widely, encompassing everything from ranking international chess players to

698 10 0211SpecialNewsSection.indd 698

evaluating whether a person will respond to HIV treatments to forecasting if a researcher’s grant application will be approved. Despite often modest prizes—Žbontar won just $1000—the competitions have so far attracted more than 3000 statisticians, computer scientists, econometrists, mathematicians, and physicists from approximately 200 universities in 100 countries, Kaggle founder Anthony Goldbloom boasts. And the wisdom of the crowds can sometimes outsmart those offering up their data. In the HIV contest, entrants significantly improved on the efforts of the research team that posed the challenge. Citing Žbontar’s success as another example, Goldbloom argues that Kaggle can help bring fresh ideas to data analysis. “This is the beauty of competitions. He won not because he is perhaps the best statistician out there but because his model was the best for that particular problem. … It was a true meritocracy,” he says. Meeting the mismatch Trained as an econometrician, Goldbloom set up his Melbourne-based company last year to meet a mismatch between people collecting data and those with the skills to analyze it.While writing about business for The Economist, Goldbloom noted that this disconnect afflicted many fields he was covering. He pondered how to attract data analysts, like himself, to solve the problems of others. His solution was to entice them with competitions and cash prizes. This was not a completely novel idea. In 2006, Netflix, an American corporation that

offers on-demand video rental, set up a competition with a prize of $1 million to design software that could better predict which movies customers might like than its own in-house recommendation software, Cinematch. Grappling with a huge data set— millions of movie ratings—thousands of teams made submissions until one claimed the prize in 2009 by showing that its software was 10% better than Cinematch. “The Netflix Prize and other academic data-mining competitions certainly played a part in inspiring Kaggle,” Goldbloom says. The prizes in the 13 Kaggle competitions so far range from $150 to $25,000 and are offered by the individuals or organizations setting up the contests. For example, chess statistician Jeff Sonas and the German company ChessBase, which hosts online games, sponsored a Kaggle challenge to improve on the playerranking system developed many decades ago by Hungarian-born physicist and chess master Arpad Elo. Its top prize was a DVD signed by several world chess champions. Still, Kaggle has shown that it doesn’t take a million-dollar prize to pit data analyst against data analyst. Kaggle’s contests have averaged 95 competitors so far, and the chess challenge drew 258 entries. “When I started running competitions, I found they were more popular and effective than I could have imagined,” Goldbloom says. “And the trend in the number of teams entering seems to be increasing with each new competition.” Statistician Rob Hyndman of Monash University, Clayton, in Australia, recently used Kaggle to lure 57 teams, including some from Chile, Antigua and Barbuda, and Serbia, into improving the prediction of how much money tourists spend in different regions of the world. “The results were amazing. … They quickly beat our best methods,” he says. Hyndman suspects that part of Kaggle’s success is offering feedback to competitors. Kaggle works by releasing online a small part of an overall data set. Competitors can analyze this smaller data set and develop appropriate algorithms or models to judge how the variables influence a final outcome. In the chess challenge, for example, a model could incorporate a player’s age, whether they won their previous game, if they played

CREDIT: ADAPTED FROM KAGGLE

Global contest. Kaggle’s competitions draw entries from many countries (arrow thickness reflects number of competitors from a country).

11 FEBRUARY 2011 VOL 331 SCIENCE www.sciencemag.org SCIENCE

www.sciencemag.org 2/3/11 4:50 PM

SPECIALSECTION 2011 Data Collections Booklet

CREDIT: A. GOLDBLOOM; MELBOURNE (INSET)

with white or black pieces, and other variables to predict whether a player will win their next game. The Kaggle competitors then use their models to predict outcomes from an additional set of inputs, and Kaggle evaluates those predictions against real outcomes and feeds back a publicly displayed score. In the chess challenge, the results of more than 65,000 matches between 8631 top players were offered as the training data set, and entrants had to predict the winners of nearly 8000 other already-played games. During a competition, which usually lasts 2 months, people or teams can keep submitting new entries but no more than two a day. “Seeing your rivals, and that they are close, spurs you on,” says Hyndman. Kaggle encourages the sponsors of the competition to release the winning algorithm—although they are not always persuaded to do so—and asks the winning team to write a blog post about how they tackled the problem and why they think their particular approach worked well. Goldbloom hopes that this means other entrants get something out of the competition despite not winning. They not only hone analytical skills by taking part, he says, but also are able to learn from other approaches. Predicting potential Although only a handful of its competitions have finished, Kaggle has had promising results so far. Each contest has generated a better model for its data than what was used beforehand. Bioinformaticist William Dampier of Drexel University in Philadelphia, Pennsylvania, organized the competition to predict, from their DNA, how a person with HIV might respond to a cocktail of antiretroviral drugs. This problem had been tackled extensively in academia, where the best models predicted the response of a patient to a set of three drugs with about 70% accuracy. By the end of the 3-month contest, the best entry was predicting a person’s drug response with 78% accuracy. Dampier says even this improvement in accuracy could help doctors further improve their treatment strategies beyond the current “guess the drug and check back later” approach. Dampier considers Kaggle’s approach innovative, noting that it draws in data analyzers with various backgrounds and perspectives who are not shackled by a field’s dogma. Such outsiders, he suspects, are

more likely to see something different and useful in the data set. “The results talk, not your position or your prestige. It is simply how well you can predict the data set,” says Dampier. His point is well illustrated by Žbontar. Despite not tabbing Eurovision’s actual winner, Germany, his overall prediction of the results beat a team from the SAS Institute—a data-mining company—and a team from the Massachusetts Institute of Technology. His submission incorporated both past national voting patterns—Eastern European countries tend to vote for each other, for example—and betting odds for the current contest. Goldbloom also attributes Kaggle’s success to crowdsourcing’s capacity to harness

winner of the HIV-treatment competition, the Kaggle contest motivated him to hone his skills in a newly learned computer language called R, which he used to encode the winning data model. Raimondi also enjoys the competitive aspect of Kaggle challenges: “It was nice to be able to compare yourself with others; … it became kind of addictive. … I spent more time on this than I should.” What has proved tricky for Kaggle is persuading companies, agencies, and researchers to open up their data. Goldbloom tries to assuage companies’ concerns about putting some of their data up on the Web by pointing out that they will get a competitive advantage if the Kaggle contestants produce a better solution to their data problems. So

KAGGLE COMPETITIONS

PRIZE

COMPETITORS

Predicting acceptance of grant applications for the University of Melbourne

$5000

90

Predicting the “edges” of online social networks

$950

106

Improving chess player rating system

$617

258

Forecasting the movement of tourists around the globe

$500

57

Predicting HIV progression in people taking different combinations of drugs

$500

109

Predicting how far each country’s football team will progress through the World Cup

$100

65

$10,000

205

$1000

22

Estimating travel time on one of Australia’s main traffic arteries Forecasting the final rankings of countries in the 2010 Eurovision Song Contest

Business solution. Anthony Goldbloom (left) founded Kaggle to run contests to solve data problems.

the collective mind. “Econometrists, physicists, electrical engineers, actuaries, computer scientists, bioinformaticists—they all bring their own pet techniques to the problem,” says Goldbloom. And because Kaggle encourages competitors to trade ideas and hints, they can learn from each other. One sponsor of a Kaggle competition estimates that some entrants may have spent more than 100 hours refining their data analysis. This begs the question: What’s the attraction, given the small prizes? Many data analysts, Goldbloom discovered, crave real-world data to develop and refine their techniques. Timothy Johnson, an 18-yearold math undergraduate at the California Institute of Technology in Pasadena, says working with the real data of the chessranking competition—he finished 29th— was more challenging, educational, and “fun” than analyzing the fabricated data sets classes offer. For Chris Raimondi, a search-engine expert based in Baltimore, Maryland, and

far, two private companies, one government agency, and three universities are among the groups to have used Kaggle. As for researchers, Goldbloom says most reject his advances with an almost “visceral reaction.” Overcoming such reluctance to expose data may be key to his company’s survival. No one pays to enter a competition, so Kaggle depends on charging a fee to those running a contest—the sum changes from competition to competition. “We aren’t profitable yet, but we have some huge projects coming up and we hope to be profitable by the end of the year,” says Goldbloom. Žbontar hopes Kaggle survives, as he’s looking forward to bettering his prediction model for this year’s Eurovision Song Contest and perhaps prying his friends out of more beer money. In a blog post analyzing his victory this past year, he issued this playful challenge: “I have many ideas for next year, which I will, for the moment at least, keep to myself.” –JENNIFER CARPENTER

www.sciencemag.org SCIENCE VOL 331 11 FEBRUARY 2011 www.sciencemag.org 0211SpecialNewsSection.indd 699

SCIENCE

699 11 2/3/11 4:50 PM

PERSPECTIVE

Climate Data Challenges in the 21st Century Jonathan T. Overpeck,1* Gerald A. Meehl,2 Sandrine Bony,3 David R. Easterling4 Climate data are dramatically increasing in volume and complexity, just as the users of these data in the scientific community and the public are rapidly increasing in number. A new paradigm of more open, user-friendly data access is needed to ensure that society can reduce vulnerability to climate variability and change, while at the same time exploiting opportunities that will occur. limate variability and change, both natural and anthropogenic, exert considerable influences on human and natural systems. These influences drive the scientific quest for an understanding of how climate behaved in the past and will behave in the future. This understanding is critical for supporting the needs of an everbroadening spectrum of society’s decision-makers as they strive to deal with the influences of Earth’s climate at global to local scales. Our understanding of how the climate system functions is built on a foundation of climate data, both observed and simulated (Fig. 1). Although research scientists have been the main users of these data, an increasing number of resource managers (working in fields such as water, public lands, health, and marine resources) need and are seeking access to climate data to inform their decisions, just as a growing range of policy-makers rely on climate data to develop climate change strategies. Quite literally, climate data provide the backbone for billiondollar decisions. With this gravity comes the responsibility to curate climate data and share it more freely, usefully, and readily than ever before.

C

The Exploding Volume of Climate Data Documenting the past behavior of the climate system, as well as detecting changes and their causes, requires the use of data from instrumental, paleoclimatic, satellite, and model-based sources. The earliest instrumental (thermometer and barometer) records stretch back to the mid- to late 1600s, although widespread land- and ship-based observations were not initiated until the early to mid-1800s, mostly in support of weather forecasting and analysis. Changes in observations through time, due to shifts in observing practices, instrumentation, and land use, have made it necessary to develop and apply advanced dataprocessing algorithms in order to describe the time 1

Institute of the Environment, 845 North Park Avenue, Suite 532, University of Arizona, Tucson, AZ 85721, USA. 2National Center for Atmospheric Research, Boulder, CO, USA. 3CNRS, Laboratoire de Météorologie Dynamique, Institut Pierre-Simon Laplace, Université Pierre et Marie Curie, Paris, France. 4 National Oceanic and Atmospheric Administration (NOAA)/ National Climatic Data Center, Asheville, NC, USA. *To whom correspondence should be addressed. E-mail: [email protected]

700

evolution of climate. Inevitably, there are uncertainties in the observational records that need to be translated into the degree of confidence asso-

Fig. 1. Climate data from observations and climate model simulations are critical for understanding the past and predicting the future. Increasingly, the climate data enterprise must serve both scientist and nonscientist equally well in term of observed (left) and future (right) climate variability and change. Observations, models, research, and understanding are all underpinned by climate data, and all in turn inform uses in society such as those shown surrounding the arrow. The globe at left shows observed annual mean surface temperature anomalies (2006–2010) from the 1951–1980 base period average [NASA data are described in (20)]. Arctic sea-ice extent is the 5-year (2006–2010) June-July-August average, excluding sea-ice concentrations less than 10% [NOAA data are described in (21)]. The globe at right depicts projected surface temperature anomalies for an example five-member annual mean ensemble average from a climate model (CCSM4), 2081–2100 minus the 1986–2005 average, for the future greenhouse gas and aerosol emission scenario RCP8.5 (22). The sea-ice extent from the model is a 5-year (2096–2100) June-July-August ensemble average, excluding sea-ice concentrations less than 10%. The left side of the time series at bottom is annual mean observed globally averaged surface temperatures (20), and the right side depicts future projections for fivemember ensemble averages from CCSM4 for three emission scenarios (RCP2.6, RCP4.5, and RCP8.5). The magnitude of future climate change depends on what society decides to do now in terms of emissions reductions. Taking little action produces the greatest warming as reflected by the RCP8.5 trajectory, whereas aggressive reductions as represented by RCP2.6 result in stabilized warming at a much lower level.

11 FEBRUARY 2011

12

ciated with our understanding of how the climate system behaves. In addition to the already large body of digital instrumental data available in diverse holdings around the globe, a substantial number of critical observations, such as many early temperature observations, are not yet widely available as digital records. It is important to create and maintain central repositories of these data in a manner that firmly defines the origin and nature of the data and also ensures that they are freely available (1, 2). In addition, an increasing array of paleoclimatic proxy records from human and natural archives, such as historical documents, trees, sediments, caves, corals, and ice cores, are being generated. These records are particularly helpful in understanding climate variability before the period of instrumental data,

VOL 331

SCIENCE

SCIENCE

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION over century to millennial time scales, through periods of past abrupt climate change, and during times when climate forcing was substantially different from that of today (3)—all critical for understanding what the climate of the future is likely to be. Some of these records have been centrally archived (4), but many have not or are described only in isolated references. Another key source of climate data is spaceborne instruments. The development of long-term, high-quality climate observations from satellites is more difficult than from surface-based instrumental data, because individual satellites and their instruments have short life spans (typically a few years), over which their orbits and sensitivities can change. These problems require the use of advanced data-processing techniques, and the resulting data are prone to being reprocessed as previously unknown problems are discovered over time. In addition, gaps in the records and systematic errors between satellites (or a lack of overlapping calibration periods) make the increasingly important construction of coherent climate data records more of a challenge (5). A third broad type of data is model-based “reanalyses”: hybrid model-observational data sets created by assimilating observations into a global or regional forecast model for a given time period (such as 1958 to the present). These provide physically consistent and expanded depictions of the observed time-evolving climate system and have become indispensable in climate system research. The future of reanalysis rests in the establishment of dedicated efforts that include frozen model versions and allow reprocessing of all observational data fields as models and input data sets improve. Future reanalysis methods will include more diverse observational data types (such as atmospheric chemistry, biospheric, oceanographic, and cryospheric data) and longer time scales (including paleoclimatic time scales). Finally, there has been an explosion in data from numerical climate model simulations, which have increased greatly in complexity and size. Data from these models are expected to become the largest and the fastest-growing segment of the global archive (Fig. 2). The archiving and sharing of output from climate models, particularly those run with a common experimental framework, began in the mid-1990s, starting with output from the early global coupled atmosphere-ocean general circulation models (AOGCMs) used for making future climate change projections (6). This led to the Coupled Model Intercomparison Project (CMIP), organized by the World Climate Research Program (WCRP), inviting all the modeling groups to make increasingly realistic simulations of 20th-century and possible future 21st-century climates (7–9). Recently, CMIP3 involved 16 international modeling groups from 11 countries, using 23 models and submitting 36 terabytes of model data, all archived by the

atmospheric dynamics and regional precipitation, as well as predicting natural climate variability and how much Earth, and local parts of it, could warm for a given amount of greenhouse gas forcing. New high-resolution active remote sensing observations from satellite instruments (such as CALIPSO lidar or CloudSat radar) are revealing the vertical distribution of clouds for the first time. However, to facilitate the comparison of model outputs with these complex new observations effectively, it has been necessary to develop and distribute new diagnostic tools (referred to as “observation simulators”) visualizing what these satellites would see if they were flying above the simulated atmosphere of a model (11, 12). Year Thanks to these developments, it will Fig. 2. The volume of worldwide climate data is expanding soon be possible to rigorously assess rapidly, creating challenges for both physical archiving and sharing, the realism of cloud simulations in the as well as for ease of access and finding what’s needed, partic- latest generation of models; for the ularly if you are not a climate scientist. The figure shows the price of an additional 6% [160 teraprojected increase in global climate data holdings for climate bytes (TB)] of CMIP5-related climate models, remotely sensed data, and in situ instrumental/proxy data that must be shared. data. Climate change modeling has evolved in just 5 years from running Program for Climate Model Diagnosis and In- a few AOGCM experiments with a single category tercomparison (PCMDI), signaling a “new era in of model, to running many more experiments with climate change research” (10). This activity has a much larger profusion of models of increasing made it possible for anyone to openly access these resolution and complexity. First-generation Earth state-of-the-art climate model outputs for analysis system models (ESMs) are now being run as part of the current CMIP5 exercise (13, 14). ESMs and research. Climate model data have been archived and include at least an interactive carbon cycle coupled accessed, exchanged, and shared primarily within to the traditional AOGCMs, which have atmothe physical climate science research community, sphere, ocean, land, and sea-ice components. Also, although there has been growing interest in the high-resolution climate models (such as those with use of these climate model data by other com- 20-km grid spacing) are run for time slices, past munities of researchers. CMIP was designed to and future, for integrations of a decade or two in provide this broader access to climate model output order to obtain a better quantification of regional for researchers from a wide range of communities. climate change and smaller-scale phenomena such The Intergovernmental Panel on Climate Change as hurricanes [for example, see (15)]. The net result (IPCC) was also able to use CMIP multimodel data is a huge increase in data volume (Fig. 2). Early sets to provide state-of-the-art assessments of what phases of the CMIP project involved less than the models as a group indicate about possible 1 TB of model data, whereas CMIP3 archived future climate change (10). Now climate models 36 TB, and CMIP5 is expected to make availare beginning to be used for much more than able 2.5 petabytes (PB). New capabilities of the climate research. In particular, they are expected to Earth System Grid portal will provide distributed inform decisions that society must take at global to access to a large part of this new model output local scales to adapt to natural climate variations as (16), making it possible for modeling groups to well as to anthropogenic climate change, and to share data from distributed local servers with guide the implementation of possible mitigation Web-based access tools. Model data thus do not measures. This puts new demands on the variety, need to be centrally archived but can be accessed scale, and availability of observational data needed in a distributed fashion. Clearly, this is an for model evaluation and development, and ex- example to be followed more broadly, with the pands, yet again, the volume of climate data that caveat that the safety and reliability of long-term archives of these data must not be jeopardized. must be shared openly and efficiently (Fig. 2). An illustration of the challenges and possibilities posed by the future interaction of fine- Meeting the Needs of a Wide Range of Users scale observational data with more complex models The burgeoning types and volume of climate data is the evaluation of clouds and the hydrologic alone constitute a major challenge to the clicycle. These processes are critical for simulating mate research community and its funding bodies.

www.sciencemag.org

www.sciencemag.org

Model Satellite/Radar In Situ/Other

SCIENCE

VOL 331

SCIENCE

701

11 FEBRUARY 2011

13

Institutional capacity must exist to produce, format, document, and share all these data, while, at the same time, a much larger community of diverse users clamors to access, understand, and use climate data. These include an ever-increasing range of scientists (ecologists, hydrologists, social scientists, etc.) and decision-makers in society who have real money, livelihoods, and even lives at stake (resource managers, farmers, public health officials, and others). Key users also include those with public responsibilities, as well as their constituents in the general public who must support and understand decisions being made on their behalf. As a result, climate scientists must not only share data among themselves, but they must also meet a growing obligation to facilitate access to data for those outside their community and, in doing so, respond to this broader user community to ensure that the data are as useful as possible. In addition to the latest IPCC assessment, various ongoing national climate assessment and climate services activities are being initiated that will need to access and use climate data, just as a growing number of climate adaptation and mitigation efforts around the globe will need to be better informed by climate data. These efforts will succeed only if climate data are made readily accessible in forms useful to scientists and nonscientists alike. The Future of Climate Data: An Emerging Paradigm Thus, two major challenges for climate science revolve around data: ensuring that the everexpanding volumes of data (Fig. 2) are easily and freely available to enable new scientific research, and making sure that these data and the results that depend on them are useful to and understandable by a broad interdisciplinary audience. A new paradigm that joins traditional climate research with research on climate adaptation, services, assessment, and applications will require strengthened funding for the development and analysis of climate models, as well as for the broader climate data enterprise. Increased support from the funding agencies is needed to enhance data access, manipulation, and modeling tools; improve climate system understanding; articulate model limitations; and ensure that the observations necessary to underpin it all are made. Otherwise, climate science will suffer, and the climate information needed by society—climate assessment, services, and adaptation capability— will not only fall short of its potential to reduce the vulnerability of human and natural systems to climate variability and change, but will also cause society to miss out on opportunities that will inevitably arise in the face of changing conditions. At present, about half of the international modeling groups are restricted from sharing digital climate model data beyond the research com-

702

munity because of governmental interest in the sale of intellectual property for commercial applications; the same holds true for some observational data. Open and free availability of model data, observations, and the software used for processing is crucial to all aspects of the new paradigm. Governments that currently restrict either model output or observed data distribution must be convinced that it is in the best interests of everyone that all climate data be made openly available to all users, including those engaged in research and applications. International agreements must eliminate data restrictions, just as journals and funding agencies should require easy access to all data associated with the papers they publish and the work they fund. The optimal use of climate data requires a more effective interdisciplinary communication of data limitations with regard to, for example, spatial and temporal sampling uncertainties; instrument changes; quality-control procedures; and, in particular, what model-based climate predictions or projections do well and not so well. The first step is to increase the accessibility of observations and high-resolution simulations to a wide range of users, either via a few centralized portals (such as PCMDI) with broader responsibilities, or using a more decentralized approach. A second step is to develop an international depository site for model diagnostic tools (such as satellite simulators) and evaluation metrics that would help users assess the reliability of specific aspects of model simulations (such as sea ice, El Niño–Southern Oscillation, or monsoons, droughts, and other climate extremes). The key is that new data-sharing systems have to be evaluated and improved until all types of interdisciplinary users are able to be effective partners in the use of climate data. An increasingly daunting aspect of having tens and eventually hundreds of petabytes of climate data openly available for analysis (Fig. 2) is how to actually look at and use the data, all the while understanding uncertainties. More resources need to be dedicated to the development of sophisticated software tools for sifting through, accessing, and visualizing the many model versions, experiments, and model fields (temperature, precipitation, etc.), as well as all of the observed data that is online. In parallel, it is becoming increasingly important to understand complex model results through a hierarchy of models, including simple or conceptual models (17). Without this step, it will be extremely difficult to make sense of such huge archived climate data sets and to assess the robustness of the model results and the confidence that may be put in them. Again, fulfilling the needs of all types of interdisciplinary users needs to be the metric of success. Increasingly, climate scientists and other types of scientists who work effectively at the interface between research and applications are working closely together, and even “coproducing” knowl-

11 FEBRUARY 2011

14

VOL 331

SCIENCE

SCIENCE

edge, with climate stakeholders in society (18, 19). These stakeholders, along with the interdisciplinary science community that supports them, are the users that must drive the climate data enterprise of the future. References and Notes 1. C. K. Folland et al., in Temperature Trends in the Lower Atmosphere: Steps for Understanding and Reconciling Differences, T. R. Karl, S. J. Hassol, C. D. Miller, W. L. Murray, Eds. (U.S. Climate Change Science Program and the Subcommittee on Global Change Research, Washington, DC, 2006), pp. 119–127. 2. D. R. Easterling et al., in Weather and Climate Extremes in a Changing Climate. Regions of Focus: North America, Hawii, Carribbean, and U.S. Pacific Islands, T. R. Karl et al., Eds. (U.S. Climate Change Science Program and the Subcommittee on Global Change Research, Washington, DC, 2008), pp. 117–126. 3. E. Jansen et al., in Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, S. Solomon et al., Eds. (Cambridge Univ. Press, Cambridge, 2007), pp. 433–497. 4. Archived at the World Data Center for Paleoclimatology, www.ngdc.noaa.gov/wdc/usa/paleo.html. 5. National Research Council, Climate Data Records from Environmental Satellites (National Academy Press, Washington, DC, 2004). 6. G. A. Meehl, G. J. Boer, C. Covey, M. Latif, R. J. Stouffer, Eos 78, 445 (1997). 7. G. A. Meehl, G. J. Boer, C. Covey, M. Latif, R. J. Stouffer, Bull. Am. Meteorol. Soc. 81, 313 (2000). 8. G. A. Meehl, C. Covey, B. McAvaney, M. Latif, R. J. Stouffer, Bull. Am. Meteorol. Soc. 86, 89 (2005). 9. C. Covey et al., Global Planet. Change 37, 103 (2003). 10. G. A. Meehl et al., Bull. Am. Meteorol. Soc. 88, 1383 (2007). 11. J. M. Haynes, R. T. Marchand, Z. Luo, A. Bodas-Salcedo, G. L. Stephens, Quickbeam. Bull. Am. Meteorol. Soc. 88, 1723 (2007). 12. H. Chepfer et al., Geophys. Res. Lett. 35, L15704 (2008). 13. K. A. Hibbard, G. A. Meehl, P. Cox, P. Friedlingstein, Eos 88, 217 (2007). 14. K. E. Taylor, R. J. Stouffer, G. A. Meehl, A summary of the CMIP5 Experimental Design (2009); www-pcmdi.llnl.gov/. 15. K. Oouchi et al., J. Meteorol. Soc. Jpn. 84, 259 (2006). 16. D. N. Williams et al., Bull. Am. Meteorol. Soc. 90, 195 (2009). 17. I. M. Held, Bull. Am. Meteorol. Soc. 86, 1609 (2005). 18. M. C. Lemos, B. J. Morehouse, Global Environ. Change Hum. Policy Dimensions 15, 57 (2005). 19. R. S. Pulwarty, C. Simpson, C. R. Nierenberg, in Integrated Regional Assessment of Global Climate Change, C. G. Knight, J. Jäger, Eds. (Cambridge Univ. Press, Cambridge, 2009), pp. 367–393. 20. J. Hansen, R. Ruedy, M. Sato, K. Lo, Rev. Geophys. 48, RG4004 (2010). 21. R. W. Reynolds, N. A. Rayner, T. M. Smith, D. C. Stokes, W. Wang, J. Clim. 15, 1609 (2002). 22. R. H. Moss et al., Nature 463, 747 (2010). 23. The authors thank G. Strand [National Center for Atmospheric Research (NCAR)] and S. Veasey (National Climatic Data Center) for their contributions. NOAA supported this work (J.T.O.) through its Regional Integrated Sciences and Assessments Program. Portions of this study were also supported (G.A.M.) by the Office of Science (B.E.R.), U.S. Department of Energy, Cooperative Agreement no. DE-FC02-97ER62402; and NSF. NCAR (G.A.M.) is sponsored by NSF. 10.1126/science.1197869

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION PERSPECTIVE

Challenges and Opportunities of Open Data in Ecology O. J. Reichman,* Matthew B. Jones, Mark P. Schildhauer Ecology is a synthetic discipline benefiting from open access to data from the earth, life, and social sciences. Technological challenges exist, however, due to the dispersed and heterogeneous nature of these data. Standardization of methods and development of robust metadata can increase data access but are not sufficient. Reproducibility of analyses is also important, and executable workflows are addressing this issue by capturing data provenance. Sociological challenges, including inadequate rewards for sharing data, must also be resolved. The establishment of well-curated, federated data repositories will provide a means to preserve data while promoting attribution and acknowledgement of its use.

E

Paper ta on

iti

is

qu

ac Metadata + semantics

Data Federation

Data deposition

KNB Visualization

B

ow

kfl

or

w

Quality assurance

Data citation

Dryad

NBII GBIF Modeling

Discovery + access

Analysis

s ow kfl or

w

Integrate + transform

Data 01 0 1Data 1Data 01 01 10 1 01 1Data 01 10 0 10 11 0 1 11 1 01 11 110 10 01 11 11 1101

Fig. 1. Data on ecological and environmental systems are (A) acquired, checked for quality, documented using an acquisition workflow, and then both the raw and derived data products are versioned and deposited in the DataONE federated data archive (red dashed arrows). Researchers discover and access data from the federation and then (B) integrate and process the data in an analysis workflow, resulting in derived data products, visualizations, and scholarly papers that are in turn archived in the data federation (red dashed arrows). Other researchers directly cite any of the versioned data, workflows, and visualizations that are archived in the DataONE federation.

www.sciencemag.org

www.sciencemag.org

Da

Data acquisition

i ys al an

*To whom correspondence should be addressed. E-mail: [email protected]

A

ta

National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, 735 State Street, Suite 300, Santa Barbara, CA 93101, USA.

dressing the profound environmental concerns we face today and, inevitably, in the future. Unfortunately, only a small fraction of ecological data ever collected is readily discoverable and accessible, much less usable. Based on our own experience building data archives for ecology, we estimate that less than 1% of the ecological data collected is accessible after publication of associated results (6, 7). Rather than providing

Solving Technology Challenges Reviews of ecological informatics have described three major technological challenges: data dispersion, heterogeneity, and provenance (8, 9). Ecosystems and habitats vary across the globe, and data are collected at thousands of locations. Although large quantities of data representing relatively few data sets are typically managed by major research projects, institutes, and agencies, most ecological data are difficult to discover and preserve because they are contained in relatively small data sets dispersed among tens of thousands of independent researchers. Data heterogeneity creates challenges due to the breadth of topics studied by ecologists and the varied experimental

Da

cology is an integrative, collaborative discipline (1, 2), amplifying the need for open access to data. The field has rapidly matured over the past century from small-scale, short-term observations and experiments conducted by individuals to include large-scale, long-term, multidisciplinary projects that integrate diverse data sets using sophisticated analytical approaches. Ecological investigations often require interactions with adjacent disciplines (e.g., evolution, genomics, geology, oceanography, and climatology) and disparate fields (e.g., epidemiology and economics). This broad scope generates major challenges for finding effective ways to discover, access, integrate, curate, and analyze the range and volume of relevant information. The recent Deepwater Horizon oil spill in the Gulf of Mexico (3) presents a compelling example of the need for far better data access and preservation in ecology and science in general. Understanding spill impacts requires data for benthic, planktonic, and pelagic organisms, chemistry (for oil and dispersants), toxicology, oceanography, and atmospheric science, among others. It also requires data on economic, policy, and legal decisions that affect spill response and cleanup. Despite a few well-organized research groups that can provide relevant data (e.g., the Florida Coastal Ecosystems Long Term Ecological Research site) (4), most current and historical data germane to the spill are inaccessible or lost. Furthermore, despite numerous studies associated with past calamities, such as the Ixtoc spill in the Gulf of Mexico (5), only a small fraction of the data from these studies is available today. Consequently, our ability to understand both short-term and chronic effects of oil spills is severely limited. As these examples illustrate, access to data is not only important for basic ecological research but also crucial for ad-

direct access to data, we share interpretations of distilled data through presentations and publications. To realize advances that are possible through ecological and environmental synthesis, we need to solve the technological and sociological challenges that have limited open access to data. While “open data” will enhance and accelerate scientific advance, there is also a need for “open science”— where not only data but also analyses and methods are preserved, providing better transparency and reproducibility of results.

SCIENCE

VOL 331

SCIENCE

703

11 FEBRUARY 2011

15

protocols used by independent researchers. Data provenance—origins and history—is necessary when, as is typical in ecological research, interesting results emerge after the data go through complex, multistep processes of aggregation, modeling, and analysis. The dispersed data issue has been partially addressed by large regional and subject-oriented data collections [e.g., Global Biodiversity Information Facility specimen records (10), the Knowledge Network for Biocomplexity (11, 12), the Dryad repository (13), and the National Biological Information Infrastructure Metadata Clearinghouse (14)]. Unfortunately, these and related efforts are still highly fragmented and collectively have not reached the critical mass of holdings to make them comprehensive. However, several initiatives to federate currently independent data networks are under way. For example, the DataONE (Fig. 1) project is enabling federated access to ecological data from all of the initiatives listed above and creating straightforward mechanisms for new data providers to join the federation. Similar efforts, such as the Data Conservancy project and the international Global Earth Observation System of Systems (GEOSS), will build large federations that eventually will be cross-linked and interoperable with one another and DataONE. The heterogeneity of ecological data must be addressed when developing technological solutions for managing ecological information. Heterogeneous data in ecology arises from its diversity of subdisciplines (e.g., ecosystems/community ecology, marine/freshwater/terrestrial ecology, and plant/animal/microbial ecology). In addition, adjacent disciplines in earth and life science, as well as relevant disciplines in the social sciences and humanities, have their own terminologies, specialized measurements, and experimental designs that generate heterogeneity. One way to reduce complications arising from data heterogeneity involves adoption of common experimental practices and measurement standards. Logistical constraints and research priorities often make such an approach impractical. A more generically applicable approach is the use of structured metadata such as Ecological Metadata Language (EML) and the Biological Data Profile that have proven useful for characterizing heterogeneous data. Formal metadata specifications provide guidance for consistently describing data objects and data types (e.g., methods, units of measurement, and details of experimental design). However, these metadata systems provide topically labeled boxes to be filled in using natural language and therefore are not amenable to automated interpretation by computers. Advances in the use of controlled vocabularies (e.g., ontologies) will provide well-defined terms to fill in these boxes and enable computers to more precisely assist a researcher in locating and processing data of interest. The Semantic Web in particular is beginning to enhance data interoperability. Linked Open

704

Data methods provide ways to connect together data from distributed sites using standard Web technologies, thereby attaching semantic descriptions to data resources (15). Unlike most Web sites that only provide human-readable Web pages, linked data allows computers to discover and collate data from the Web without human intervention, enabling new types of synthetic data studies at much larger scales. In addition, unified models for representing the semantics of scientific observations and measurements are emerging within a variety of communities, and these can be used in the linked data cloud. These efforts are useful for representing the semantics of ecological observations and for building tools that directly support synthesis through precise data search and automated data integration (16). Although the conceptual basis for these observational data modeling approaches has been demonstrated, substantial implementation must occur before semantic modeling will be available in common ecological data management tools. Another major challenge is the critical need to track the provenance of derived data objects and scientific results from initial data collection, through quality assurance, analysis, modeling, and ultimately publication (17). Provenance is especially important to support scientific results used in policy and management decisions, where field experiments and techniques may not be fully reproducible due to difficulty of replicating environmental conditions. Computer scientists are making considerable progress in developing ways to capture provenance information. Scripted analysis systems like R, and scientific workflow systems like Kepler and Taverna, can be used to document the data processing and analysis details that led to a given set of results (Fig. 1). Scientific workflow applications can record critical information about the analytical process, including details about the data and how it is transformed, providing a comprehensive record of an analysis and its results. In this way, the data, analytical processes, and results become part of a knowledge base supporting evidence-based science, to better inform decisionmaking in conservation and resource management (18). In addition, new research shows that provenance traces from different studies can be linked when producing synthetic analyses that reuse existing data (19). The combination of formal systems for tracking provenance, and federated data repositories like DataONE that provide unique identifiers for every data object, will be instrumental in realizing the goal of fully reproducible science in support of understanding global environmental issues such as climate change, species invasions, and epidemics. Solving Sociologial and Cultural Challenges Although it is challenging to develop new technological solutions to data sharing in ecology, the

11 FEBRUARY 2011

16

VOL 331

SCIENCE

SCIENCE

social and cultural barriers may be even more onerous. Technical solutions will emerge that considerably enhance access to ecological data, but overcoming the cultural and sociological barriers to increased data access requires changing human behavior. Some disciplines (e.g., astronomy and oceanography) have a history of sharing data, perhaps because these fields rely on large, shared infrastructure. Other disciplines, such as genomics, also have shared repositories, largely due to the homogeneity of their data. Traditionally, ecologists have had few incentives for sharing information. Research involved gathering and analyzing one’s own data and publishing the distilled results in peer-reviewed journals. In addition, sharing data was not viewed as a valuable scholarly endeavor or as an essential part of doing science. Recent advances in ecological synthesis, however, are rapidly changing these attitudes to data sharing. Researchers might still be disinclined to share their data until they have fully completed analyzing and reporting on their observations and results. The concern is that if data are made openly available in the interim they may be used by other investigators, effectively scooping the data originators. Properly curated data alleviates this concern, as the use of data without permission or attribution would be condemned by colleagues and funding sources. Proper curation requires time and money and is inadequately supported in research funding. Establishment of a reward system should further motivate investigators to share their data. For example, if data sets are publishable and citable (e.g., Ecological Archives and Dryad), they will become more respected and valued as an important part of research and scholarship (20). The most effective means to alter the reward system is to make data sharing an expectation of funding and publications and reward those who meet these expectations. The National Science Foundation in the United States now requires an explicit data management plan in all proposals, which is a step in the right direction. Journals and societies that mandate data publication concurrently with research publications also have proven to be effective (e.g., GenBank). In addition to support for individual researchers to prepare and submit their data to public archives, the community needs to identify sustainable models for federated data archives that persist over decadal time scales. Models such as DataONE involve leveraging institutional contributions in a large federation to protect against uneven funding for individual institutions. Nevertheless, even these initiatives will not work without a sustained commitment from funding agencies that is specifically targeted at institutional data repositories and coordinating organizations. The evolution of GenBank offers evidence that technological advances and cultural metamorphosis generate paradigm shifts in science.

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION With a tug from software to manage genomic data online and a push from publishers unwilling to continue editing and printing the growing volume of gene sequences, a robust data repository for gene sequences was born. Today, after almost 30 years, registering gene sequences and sharing them broadly is the norm and is recognized as fostering one of the greatest scientific revolutions in the past century. Ecology is poised for a similar transformation. The pull comes from a need for data in synthesis and cross-cutting analysis that is facilitated by the emergence of community metadata standards and federated data repositories that span adjacent disciplines. The push is coming from funding entities that are requiring open access to data, with a dose of urgency engendered by the chronic and acute environmental degradation occurring globally. Furthermore, the rewards for sharing data are increasing. As noted, it is possible to publish peer-reviewed, citable data sets in repositories while giving credit to the data contributors, and there is evidence that published papers that do make available their data are cited more frequently than those that do not (21). We have presented some of the major challenges and emerging solutions for dealing with the vast volume and heterogeneity of ecological data. To accelerate the advance of ecological understanding and its application to critical environmental concerns, we must move to the next level of information management by providing

revolutionary new data-management applications, promoting their adoption, and hastening the emergence of communities of practice. Concurrently, we must encourage the growing culture of collaboration and synthesis that has emerged in ecology that is fundamentally altering the scientific method to require comprehensive data sharing, as well as greater reproducibility and transparency of the methods and analyses that support scientific insights. References and Notes 1. S. Carpenter et al., Bioscience 59, 699 (2009). 2. E. Hackett, J. Parker, D. Conz, D. Rhoten, A. Parker, in Scientific Collaboration on the Internet, G. M. Olson et al., Eds. (MIT Press, Boston, 2008), pp. 277–296. 3. T. J. Crone, M. Tolstoy, Science 330, 634 (2010). 4. Florida Coastal Everglades Data Resources, http://fce. lternet.edu/data/FCE. 5. J. W. Tunnell, Q. R. Dokken, M. E. Kindinger, L. C. Thebeau, “Effects of the Ixtoc I oil spill on intertidal and subtidal infaunal populations along lower Texas coast barrier island beaches,” in Proceedings of the 1981 Oil Spill Conference (American Petroleum Institute, Washington, DC, 1981), pp. 467–475. 6. C. J. Savage, A. J. Vickers, C. Mavergames, PLoS ONE 4, e7078 (2009). 7. P. B. Heidorn, Libr. Trends 57, 280 (2008). 8. W. K. Michener, Ecol. Inform. 1, 3 (2006). 9. M. B. Jones, M. Schildhauer, O. J. Reichman, S. Bowers, Annu. Rev. Ecol. Evol. Syst. 37, 519 (2006). 10. V. S. Chavan et al., State-of-the-Network 2010: Discovery and Publishing of the Primary Biodiversity Data Through the GBIF Network (Global Biodiversity Information Facility, Copenhagen, 2010). 11. S. J. Andelman, C. Bowles, M. R. Willig, R. Waide, Bioscience 54, 240 (2004).

PERSPECTIVE

Changing the Equation on Scientific Data Visualization Peter Fox and James Hendler* An essential facet of the data deluge is the need for different types of users to apply visualizations to understand how data analyses and queries relate to each other. Unfortunately, visualization too often becomes an end product of scientific analysis, rather than an exploration tool that scientists can use throughout the research life cycle. However, new database technologies, coupled with emerging Web-based technologies, may hold the key to lowering the cost of visualization generation and allow it to become a more integral part of the scientific process. critical aspect of the data deluge is the need for users, whether they are scientists themselves, funders of science, or the concerned public, to be able to discover the relations among and between the results of data analyses and queries. Unfortunately, the creation of visual-

A

Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY 12180, USA. *To whom correspondence should be addressed. E-mail: [email protected]

izations for complex data remains more of an art form than an easily conducted practice. What’s more, especially for big science, the resource cost of creating useful visualizations is increasing: Although it was recently assumed that data-centric science required a rough split between the time to generate, analyze, and publish data (1), today the visualization and analysis component has become a bottleneck, requiring considerably more of the overall effort. This trend will continue to get worse as new technologies for data generation are de-

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

12. The Knowledge Network for Biocomplexity, http://knb. ecoinformatics.org. 13. H. White, S. Carrier, A. Thompson, J. Greenberg, R. Scherle, “The Dryad data repository: A Singapore framework metadata architecture in a DSpace environment,” in Proceedings of the International Conference on Dublin Core and Metadata Applications, J. Greenberg, W. Klas, Eds. (Dublin Core Metadata Initiative and Universitätsverlag Göttingen, Berlin, 2008), pp. 157–162. 14. I. San Gil, V. Hutchison, G. Palanisamy, M. Frame, J. Libr. Metadata 10, 99 (2010). 15. C. Bizer, T. Heath, K. Idehen, T. Berners-Lee, “Linked data on the Web (LDOW2008),” in Proceedings of the 17th International Conference on World Wide Web (Association for Computing Machinery, New York, 2008), pp. 1265–1266. 16. J. Madin, J. S. Bowers, M. Schildhauer, M. B. Jones, Trends Ecol. Evol. 23, 159 (2008). 17. P. Buneman, S. Khanna, W.-C. Tan, Lect. Notes Comput. Sci. 1974, 87 (2000). 18. W. Sutherland, A. Pullin, P. Dolman, T. Knight, Trends Ecol. Evol. 19, 305 (2004). 19. P. Missier et al., Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science., Presentation at WORKS 2010: 5th Workshop on Workflows in Support of Large-Scale Science, IEEE Computer Society, New Orleans, 14 November 2010. 20. T. Vision, Bioscience 60, 330 (2010). 21. H. A. Piwowar, R. S. Day, D. B. Fridsma, J. Ioannidis, PLoS ONE 2, e308 (2007). 22. Supported by the National Center for Ecological Analysis and Synthesis, a Center funded by NSF (grant EF-0553768), the University of California, Santa Barbara, and the State of California. Additional support for M.B.J. was provided by NSF grant OCI-0830944 and for O.J.R. by NSF grant DEB-0444217. 10.1126/science.1197962

creasing in price at an incredible rate (in terms of cost per data generated), whereas visualization costs are falling much more slowly. As a result of these trends, the extra effort of making our data understandable, something that should be routine, is consuming considerable resources that could be used for many other purposes. A consequence of the major effort for visualization is that it becomes an end product of scientific analysis, rather than an exploration tool allowing scientists to form better hypotheses in the continually more data-intensive scientific process. However, new database technologies and promising Web-based visualization approaches may be vital for reducing the cost of visualization generation and allowing it to become a central piece of the scientific process. As an anecdotal example, consider the papers in the recently published The Fourth Paradigm, a collection of invited essays about the emerging area of data-intensive science (2). Only one of the more than 30 papers is primarily about visualization needs, but virtually all of the essays include visualizations that show off particular scientific results. From Presentation In the computing sciences, visualization has been in the hands of two communities. The first is the

705

11 FEBRUARY 2011

17

human-computer interaction (HCI) community, which has considered visualization to be an important technology to study in its own right. Many of the tools used by scientists were developed by HCI practitioners as spin-offs of more general visualization technologies. The second is the graphics community, which has often been focused on the hardware for creating high-quality visualizations in science and other communities. The work of these research communities has led to very exciting capabilities including large-scale “immersive environments” (3), high-end threedimensional displays, rendering software kits, and visualization libraries, among others. Visualizations are absolutely critical to our ability to process complex data and to build better intuitions as to what is happening around us. For instance, consider how online display presentations, rather than text-only content, has enriched weather forecasts and allowed anyone to explore data ranging from precipitation to wind fields to temperature. These weather displays generally include the means to provide contextualization of the meteorological data (such as location, time, and key annotations) and to animate it. In addition, as weather data has moved to the Web, interactive visualizations have become more common. For example, through your browser you can localize the reports, look at video images, and link to many other sources. Many other data visualizations on the Web are becoming increasingly more sophisticated and interactive, while at the same time becoming easier to generate thanks to the prevalence of open Application Programming Interfaces and interactive Web-based visualization tool kits. Applications now allow the rapid generation of maps, charts, timelines, graphs, word clouds, search interfaces, RSS feeds, and many others capabilities. Additionally, owing to Web-based linking technologies, the visualizations change as the data changes (because they can make use of direct links to the data), which drastically reduces the effort to keep the visualizations timely and up to date. These “low-end” visualizations are often used in business analytics, open government data systems, and media infographics, but they have generally not been used in the scientific process. Despite the increasing prevalence of these techniques on the Web, we are often looking at tables of numbers, best-fit curves, or other analytic results rather than being able to use visual means when we interact with the complex scientific data in many fields. Many of the visualization tools that are available to scientists do not allow live linking as do these Web-based tools. Once the visualization is created, it is no longer tied to the data, so that it becomes an immutable information product—as the data changes, the visualization is no longer up to date. There are many reasons why this is the case, but two of them really dominate the situation. The first is that collecting scientific data is often

706

difficult and instrument-specific. As a result, most scientific data is created in a form and organization that facilitates its generation rather than focusing on its eventual use. The second is the scale of scientific data. The need for more and better data, as well as the continuing increases in our ability to design new data collection devices have continually kept scientists at the leading edge of data users. Thus, data collection, with the use of traditional databases, has mostly focused on the efficiency of query-based retrieval of the collected data, rather than on data exploration.

To Exploration Although many of the largest data sources in the world remain scientific databases, the 20 years of exponential growth in the World Wide Web have brought a number of other players to the massive database table. The backend data technologies that power large Web companies like Google and Facebook and massive online communities such as World of Warcraft are generating huge

Fig. 1. Correlation of aerosol optical depth for the same instrument [the Moderate Resolution Imaging Spectroradiometer (MODIS)] operating on different satellites (Terra and Aqua) for the year 2008. The visualization reveals that there is a zero-correlation anomaly centered on the date line with the shape of an orbital character. [Image courtesy of NASA/Goddard Space Flight Center (G. Leptoukh)] The data-scaling problem is particularly exacerbated by the difficulty in linking data from multiple instruments or sources together. Both the speed and precision of data retrieval have often been dependent on the quality of the data models being used, and modeling integrated data from many sources is generally exponentially harder than developing the model for any single source. The challenge is that many of the major scientific problems facing our world are becoming critically linked to the interdependence and interrelatedness of data from multiple instruments, fields, and sources. Consider, for example, climate modeling or translational biology, which increasingly require a systems-level perspective of our world, in the former, and our interactions with that world, in the latter. Although scientists with a deep understanding of their own data can often make sense of the raw data or that from field-specific analytic tools, getting the same sort of intuitions from data generated by different instruments or in a different part of an interdisciplinary study is much harder. Thus, where the need for exploration via visualization grows in importance for these large-scale interdisciplinary efforts, the technical issues of

11 FEBRUARY 2011

18

collection and scaling have traditionally worked against that end.

VOL 331

SCIENCE

SCIENCE

and diverse data collections. These companies and communities have been pushing new approaches to databases. The wealth of data coming from the interactions of the estimated 1.9 billion users of the Web (4) is requiring these and other hubs to rapidly expand and deploy new capabilities, bring new data resources online quickly, and link together extremely large numbers of diverse and difficult-tomodel data sources. In these enterprises, the traditional, model-based approaches to data integration, which are still used in most scientific efforts, has given way to new approaches such as NoSQL (5), “big data,” (6) and scalable linked data (7, 8). Driven by the rapid pace of change on the Web, these new data approaches let visualization be integrated into the analytics process, which allows for the rapid understanding of these vast data holdings. The ability to understand, for example, the constantly changing connectivity of the social network underlying Facebook is crucial to the site’s ability to remain responsive to its rapidly growing user community. Adding realtime search capabilities to Google has required a deep understanding of the “twitter sphere” and an ability to track it as it changes and grows.

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION The capabilities being seen in the Web domain may hold the key to breaking the scientific visualization bottleneck. These new approaches come with two key capabilities: (i) easy-to-use, low-end tools that will allow scientists to rapidly generate visualizations to explore hypotheses and (ii) scalable tools for creating and curating “high-end” visualizations, tied to new approaches in data collection and archiving, making it possible to develop and maintain existing visualizations at lower cost. However, these tools also create a number of research challenges that the scientific community must tackle. First, new approaches are needed for determining how best to visualize particular kinds of scientific data. A strong start in this direction can be seen in the “Periodic Table of Visualization Methods” developed by Lengler and Epler (9), which shows a number of visualization techniques organized by the type of data (or processes) they apply to and the complexity of their application. Additionally, discussion is beginning to move from general principles of effective visualization (10) to much more specific advice of use for scientists, such as how best to merge particular kinds of statistics with visualizations (11). A second challenge is to create and maintain data and information provenance for visualizations; often these are key to understanding and fixing data errors. As an example, consider the data shown in Fig. 1, which depicts Earth observation results from two satellites. It becomes immediately clear from this visualization that something odd is happening in the middle of the presentation, where the displayed data quality is clearly different than that elsewhere in the diagram. Understanding the cause of this error, however, requires knowing which observations came from which satellite, the orbital characteristics, and even how a “day” is defined for the data products, but more importantly, knowing when a combination of these factors leads to the artifact displayed (which turns out to be due to overpass time differences; in this case the correlated values are defined up to 22 hours apart at the date line). When known, these time differences can be accounted for and a new and corrected visualization created (12). A related and scientifically critical challenge is the need to ensure that fitness for purpose is well explained in systems that can generate a wide variety of visual analysis products. Factors such as data and information quality, bias, and contextual relevance rarely make their way into visual representations, but they must. There are major efforts under way to meet this set of challenges, but research efforts are still required for a scalable and Web-enabled solution. Another challenge follows from the desirable features that these visualizations are linked to the underlying data and can change dynamically as the data changes, and that these visualizations can be interactive in a Web context. Though powerful for exploring data, these features drive us toward visualizations that are primarily quantitative, as

distinct from ones that may be interpretative or nonrealistic (for instance, a cartoon or a purposefully distorted view). In presenting scientific results, particularly outside of a traditional scientific context, these visualizations can be extremely powerful, but they generally require creative or artistic efforts that are beyond the range of current computational capabilities. Finding ways to couple the fully or semiautomated approaches of data analytics with the more creative, human-based methods so that change in the former can be exploited to help maintain the latter is a clear and emerging challenge for future publication and presentation of scientific results. Finally, once these visualization products become routine, their management becomes a critical part of the scientific process, which means we must develop techniques that can maintain the visualization products throughout their life cycle. If we think that analysis pipelines are opaque, what about visualization pipelines? It is necessary to the future of scientific problem solving that we can find means to open these visualization pipelines to provide suitable data provenance so that the visualizations can be maintained and reused across our ever-widening and increasingly interdisciplinary scope of scientific problems. As modern information technologies increasingly allow scientists to take advantage of rapid visualizations to provide for better understanding of what our data tells us about the problems we are solving, we can stop thinking of visualization

Fig. 2. Projecting the results of network-visualization tools onto a large screen allows a group of scientists to explore the relations among a large number of data elements without the specific need for expensive visualization tools. By using simple visualization techniques such as these, results can be shared early in the research process, rather than waiting to use special-purpose visualization technologies as an end product of scientific analysis.

www.sciencemag.org

www.sciencemag.org

as a necessary evil at the end of the scientific pipeline and use it as a tool in data comprehension. Similar to businesses that have come to use data analytics as a means to keep pace with a changing world, scientists must explore how better data sharing and visualization technologies will allow them to do the same. This requires scientists to use visualization tools earlier in the process and to document the relations between the data and the visualizations produced. At the top end of the visualization spectrum, and for those that can afford and maintain them, substantial capabilities are being applied to explore and understand large data sets. At the lower end, when data size is not a limiting factor, we see an improving set of capabilities suitable for scientific use based primarily on Web-based visualization services. In between, however, at a scale where increasingly more scientists work, we do not have routine and scalable capabilities for visualizing the data and information sources we need to advance science. What can be done? First, we must work with tool designers to make sure that visualizations are sharable during the entire life span of the scientific process. As one example of this, there are a number of standards for both the graphics and the metadata involved in sharing visualizations, but very few of these are supported in current scientific application tools. Second, there has been little work in the standardization of the workflow and linking technologies needed specifically for high-end scientific

SCIENCE

VOL 331

SCIENCE

707

11 FEBRUARY 2011

19

visualizations. Research scientists need to work more closely with their computing colleagues to make sure that these needs are met and that the development of new analytic methods tied to scientific, as opposed to business, analysis is pursued. Finally, we must work together to explore new ways of scaling easy-to-generate visualizations to the data-intensive needs of current scientific pursuits. (Figure 2 shows the use of standard projection techniques as a low-cost alternative to expensive high-end technologies.) Although there are scientific problems that do call for specialized visualizers, many do not. By bringing visualization into everyday use in our laboratories, we can better understand the requirements for the design of new tool kits, and we can learn to share and maintain visualization workflows and products the way we share other scientific systems. A side effect may well be the lowering of barriers (such as costs and accessibility) to more sophisticated visualization of increasingly larger data sets, a crucial functionality for today’s data-intensive scientist.

References and Notes 1. National Center for Atmospheric Research (NCAR), University Corporation for Atmospheric Research (UCAR), Towards a Robust, Agile, and Comprehensive Information Infrastructure for the Geosciences: A Strategic Plan for High-Performance Simulation (NCAR, UCAR, Boulder, CO, 2000); www.ncar.ucar.edu/Director/plan.pdf. 2. T. Hey, S. Tansley, K. Tolle, Eds. The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft External Research, Redmond, WA, 2009). 3. We use the term “immersive environments” to include a number of high-end technologies for the visualization of complex, large-scale data. The term “Cave Automatic Virtual Environment (CAVE)” is often used for these environments based on the work of Cruz-Neira et al. (13). A summary of existing projects and technologies can be found at http:// en.wikipedia.org/wiki/Cave_Automatic_Virtual_Environment. 4. 2010 Web usage estimate from Internet World Stats, www.internetworldstats.com/. 5. This general term accounts for a number of different research approaches; the Web site http://nosql-database. org/ maintains links to ongoing blogs and discussions on this topic. 6. R. Magoulas, B. Lorica, Introduction to Big Data, Release 2.0, issue 11 (O’Reilly Media, Sebastopol, CA, 2009); http://radar.oreilly.com/2009/03/big-data-technologiesreport.html.

PERSPECTIVE

Challenges and Opportunities in Mining Neuroscience Data Huda Akil,1* Maryann E. Martone,2 David C. Van Essen3 Understanding the brain requires a broad range of approaches and methods from the domains of biology, psychology, chemistry, physics, and mathematics. The fundamental challenge is to decipher the “neural choreography” associated with complex behaviors and functions, including thoughts, memories, actions, and emotions. This demands the acquisition and integration of vast amounts of data of many types, at multiple scales in time and in space. Here we discuss the need for neuroinformatics approaches to accelerate progress, using several illustrative examples. The nascent field of “connectomics” aims to comprehensively describe neuronal connectivity at either a macroscopic level (in long-distance pathways for the entire brain) or a microscopic level (among axons, dendrites, and synapses in a small brain region). The Neuroscience Information Framework (NIF) encompasses all of neuroscience and facilitates the integration of existing knowledge and databases of many types. These examples illustrate the opportunities and challenges of data mining across multiple tiers of neuroscience information and underscore the need for cultural and infrastructure changes if neuroinformatics is to fulfill its potential to advance our understanding of the brain. eciphering the workings of the brain is the domain of neuroscience, one of the most dynamic fields of modern biology. Over the past few decades, our knowledge about the nervous system has advanced at a remarkable pace. These advances are critical for understanding the mechanisms underlying the broad range of brain functions, from controlling breathing to

D 1

The Molecular and Behavioral Neuroscience Institute, University of Michigan, Ann Arbor, MI, USA. 2National Center for Microscopy and Imaging Research, Center for Research in Biological Systems, University of California, San Diego, La Jolla, CA, USA. 3Deparment of Anatomy and Neurobiology, Washington University School of Medicine, St. Louis, MO 63110, USA. *To whom correspondence should be addressed. E-mail: [email protected]

708

forming complex thoughts. They are also essential for uncovering the causes of the vast array of brain disorders, whose impact on humanity is staggering (1). To accelerate progress, it is vital to develop more powerful methods for capitalizing on the amount and diversity of experimental data generated in association with these discoveries. The human brain contains ~80 billion neurons that communicate with each other via specialized connections or synapses (2). A typical adult brain has ~150 trillion synapses (3). The point of all this communication is to orchestrate brain activity. Each neuron is a piece of cellular machinery that relies on neurochemical and electrophysiological mechanisms to integrate complicated inputs and communicate information to other neurons. But

11 FEBRUARY 2011

20

VOL 331

SCIENCE

SCIENCE

7. C. Bizer, T. Heath, K. Idehen, T. Berners-Lee, “Linked data on the Web (LDOW2008),” in Proceedings of the 17th International Conference on World Wide Web, Beijing, 21 to 25 April 2008. 8. G. Williams, J. Weaver, M. Atre, J. Hendler, J. Web Semant. 8, 365 (2010). 9. R. Lengler, M. J. Epler, A Periodic Table of Visualization Methods, available at www.visual-literacy.org/ periodic_table/periodic_table.html. 10. We note that this is also a good example of our general point about the need to integrate visualization into the exploration phase of science; identifying this problem earlier would have enabled substantially higher-quality data collection over the period shown. 11. S. K. Card, J. D. Mackinlay, B. Shneiderman, Readings in Information Visualization: Using Vision to Think (Morgan Kaufmann, San Francisco, 1999). 12. A. Perer, B. Shneiderman, “Integrating statistics and visualization: Case studies of gaining clarity during exploratory data analysis,” ACM Conference on Human Factors in Computing Systems (CHI 2008), Florence, Italy, 5 to 10 April 2008. 13. C. Cruz-Neira, D. J. Sandin, T. A. DeFanti, R. V. Kenyon, J. C. Hart, Commun. ACM 35, 64 (1992). 10.1126/science.1197654

no matter how accomplished, a single neuron can never perceive beauty, feel sadness, or solve a mathematical problem. These capabilities emerge only when networks of neurons work together. Ensembles of brain cells, often quite far-flung, form integrated neural circuits, and the activity of the network as a whole supports specific brain functions such as perception, cognition, or emotions. Moreover, these circuits are not static. Environmental events trigger molecular mechanisms of neuroplasticity that alter the morphology and connectivity of brain cells. The strengths and pattern of synaptic connectivity encode the “software” of brain function. Experience, by inducing changes in that connectivity, can substantially alter the function of specific circuits during development and throughout the life span. A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography”: the integrated functioning of neurons into brain circuits, including their spatial organization, local, and long-distance connections; their temporal orchestration; and their dynamic features, including interactions with their glial cell partners. Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze, and mine information from each level of analysis and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior. The Need for Neuroinformatics The profoundly complex nature of the brain requires that neuroscientists use the full spectrum of tools available in modern biology: genetic,

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION cellular, anatomical, electrophysiological, behavioral, evolutionary, and computational. The experimental methods involve many spatial scales, from electron microscopy (EM) to whole-brain human neuroimaging, and time scales ranging from microseconds for ion channel gating to years for longitudinal studies of human development and aging. An increasing number of insights emerge from integration and synthesis across these spatial and temporal domains. However, such efforts face impediments related to the diversity of scientific subcultures and differing approaches to data acquisition, storage, description, and analysis, and even to the language in which they are described. It is often unclear how best to integrate the linear information of genetic sequences, the highly visual data of neuroanatomy, the time-dependent data of electrophysiology, and the more global level of analyzing behavior and clinical syndromes. The great majority of neuroscientists carry out highly focused, hypothesis-driven research that can be powerfully framed in the context of known circuits and functions. Such efforts are complemented by a growing number of projects that provide large data sets aimed not at testing a specific hypothesis but instead enabling dataintensive discovery approaches by the community at large. Notable successes include gene expression atlases from the Allen Institute for Brain Sciences (4) and the Gene Expression Nervous System Atlas (GENSAT) Project (5), and disease-specific human neuroimaging repositories (6). However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. Below we highlight several major endeavors that provide complementary perspectives on the challenges and opportunities in neuroscience data mining. One is a set of “connectome” projects that aim to comprehensively describe neural circuits at either the macroscopic or the microscopic level. Another, the Neuroscience Information Framework (NIF), encompasses all of neuroscience and provides access to existing knowledge and databases of many types. These and other efforts provide fresh approaches to the challenge of elucidating neural choreography. Connectomes: Macroscopic and Microscopic Brain anatomy provides a fundamental threedimensional framework around which many types of neuroscience data can be organized and mined. Decades of effort have revealed immense amounts of information about local and longdistance connections in animal brains. A wide range of tools (such as immunohistochemistry and in situ hybridization) have characterized the biochemical nature of these circuits that are studied electrophysiologically, pharmacologically, and behaviorally (7). Several ongoing efforts aim to integrate anatomical information into searchable

resources that provide a backbone for understanding circuit biology and function (8–10). The challenge of integrating such data will dramatically increase with the advent of high-throughput anatomical methods, including those emerging from the nascent field of connectomics. A connectome is a comprehensive description of neural connectivity for a specified brain region at a specified spatial scale (11, 12). Connectomics currently includes distinct subdomains for studying the macroconnectome (long-distance pathways linking patches of gray matter) and the microconnectome (complete connectivity within a single gray-matter patch). The Human Connectome Project. Until recently, methods for charting neural circuits in the human brain were sorely lacking (13). This situation has changed dramatically with the advent

Fig. 1. Schematic illustration of online data mining capabilities envisioned for the HCP. Investigators will be able to pose a wide range of queries (such as the connectivity patterns of a particular brain region of interest averaged across a group of individuals, based on behavioral criteria) and view the search results interactively on three-dimensional brain models. Data sets of interest will be freely available for downloading and additional offline analysis. of noninvasive neuroimaging methods. Two complementary modalities of magnetic resonance imaging (MRI) provide the most useful information about long-distance connections. One modality uses diffusion imaging to determine the orientation of axonal fiber bundles in white matter, based on the preferential diffusion of water molecules parallel to these fiber bundles. Tractography is an analysis strategy that uses this information to estimate long-distance pathways linking different gray-matter regions (14, 15). A second modality, resting-state functional MRI (R-fMRI), is based on slow fluctuations in the standard fMRI BOLD signal that occur even when people are at rest. The time courses of these fluctuations are correlated across gray-matter locations, and the spatial patterns of the resultant functional connectivity correlation maps are closely related but not identical to the known pattern of direct anatomical connectivity (16, 17). Diffusion imaging and R-fMRI each have important limitations, but together they offer powerful and complementary windows on human brain connectivity. To address these opportunities, the National Institutes of Health (NIH) recently launched the Human Connectome Project (HCP) and awarded

www.sciencemag.org

www.sciencemag.org

grants to two consortia (18). The consortium led by Washington University in St. Louis and the University of Minnesota (19) aims to characterize whole-brain circuitry and its variability across individuals in 1200 healthy adults (300 twin pairs and their nontwin siblings). Besides diffusion imaging and R-fMRI, task-based fMRI data will be acquired in all study participants, along with extensive behavioral testing; 100 participants will also be studied with magnetoencephalography (MEG) and electroencephalography (EEG). Acquired blood samples will enable genotyping or full-genome sequencing of all participants near the end of the 5-year project. Currently, data acquisition and analysis methods are being extensively refined using pilot data sets. Data acquisition from the main cohort will commence in mid-2012.

SCIENCE

VOL 331

SCIENCE

Neuroimaging and behavioral data from the HCP will be made freely available to the neuroscience community via a database (20) and a platform for visualization and user-friendly data mining. This informatics effort involves major challenges owing to the large amounts of data (expected to be ~1 petabyte), the diversity of data types, and the many possible types of data mining. Some investigators will drill deeply by analyzing high-resolution connectivity maps between all gray-matter locations. Others will explore a more compact “parcellated connectome” among all identified cortical and subcortical parcels. Data mining options will reveal connectivity differences between subpopulations that are selected by behavioral phenotype (such as high versus low IQ) and various other characteristics (Fig. 1). The utility of HCP-generated data will be enhanced by close links to other resources containing complementary types of spatially organized data, such as the Allen Human Brain Atlas (21), which contains neural gene expression maps. Microconnectomes. Recent advances in serial section EM, high-resolution optical imaging methods, and sophisticated image segmentation methods enable detailed reconstructions of the

709

11 FEBRUARY 2011

21

microscopic connectome at the level of individual synapses, axons, dendrites, and glial processes (22–24). Current efforts focus on the reconstruction of local circuits, such as small patches of the cerebral cortex or retina, in laboratory animals. As such data sets begin to emerge, a fresh set of informatics challenges will arise in handling petabyte amounts of primary and analyzed data and in providing data mining platforms that enable neuroscientists to navigate complex local circuits and examine interesting statistical characteristics. Micro- and macroconnectomes exemplify distinct data types within particular tiers of analysis that will eventually need to be linked. Effective interpretation of both macro- and microconnectomic approaches will require novel informatics and computational approaches that enable these two types of data to be analyzed in a common framework and infrastructure. Efforts such as the Blue Brain Project (25) represent an important initial thrust in this direction, but the endeavor will entail decades of effort and innovation. Powerful and complementary approaches such as optogenetics operate at an intermediate (mesoconnectome) spatial scale by directly perturbing neural circuits in vivo or in vitro with light-activated ion channels inserted into selected neuronal types (26). Other optical methods, such as calcium imaging with two-photon laser microscopy, enable analysis of the dynamics of ensembles of neurons in microcircuits (27, 28) and can lead to new conceptualizations of brain function (29). Such approaches provide an especially attractive window on neural choreography as they assess or perturb the temporal patterns of macro- or microcircuit activity. The NIF Connectome-related projects illustrate ways in which neuroscience as a field is evolving at the level of neural circuitry. Other discovery efforts include genome-wide gene expression profiling [for example, (30)] or epigenetic analyses across multiple brain regions in normal and diseased brains. This wide range of efforts results in a sharp increase in the amount and diversity of data being generated, making it unlikely that neuroscience will be adequately served by only a handful of centralized databases, as is largely the case for the genomics and proteomics community (31). How, then, can we access and explore these resources more effectively to support the data-intensive discovery envisioned in The Fourth Paradigm (32)? Tackling this question was a prime motivation behind the NIF (33). The NIF was launched in 2005 to survey the current ecosystem of neuroscience resources (databases, tools, and materials) and to establish a resource description framework and search strategy for locating, accessing, and using digital neuroscience-related resources (34). The NIF catalog, a human-curated registry of known resources, currently includes more than 3500 such resources, and new ones are added

710

daily. Over 2000 of these resources are databases tegration, requiring considerable human effort that range in size from hundreds to millions of to access each resource, understand the context records. Many were created at considerable effort and content of the data, and determine the conand expense, yet most of them remain underused ditions under which they can be compared to other data sets of interest. by the research community. To address the terminology problem, the NIF Clearly, it is inefficient for individual researchers to sequentially visit and explore thou- has assembled an expansive lexicon and ontolsands of databases, and conventional online ogy covering the broad domains of neuroscience search engines are inadequate, insofar as they by synthesizing open-access community ontologies (36). The Neurolex and do not effectively inaccompanying NIFSTD dex or search database (NIF-standardized) oncontent. To promote the tologies provide defidiscovery and use of onnitions of over 50,000 line databases, the NIF concepts, using formal created a portal through languages to represent which users can search brain regions, cells, subnot only the NIF registry cellular structures, molbut the content of mulecules, diseases, and tiple databases simulfunctions, and the relataneously. The current tions among them. NIF federation includes When users search for more than 65 databases a concept through the accessing ~30 million NIF, it automatically exrecords (35) in major pands the query to indomains of relevance clude all synonymous to neuroscience (Fig. 2). or closely related terms. Besides very large geFor example, a query for nomic collections, there “striatum” will include are nearly 1 million anti“neostriatum, dorsal stribody records, 23,000 atum, caudoputamen, brain connectivity records, caudate putamen” and and >50,000 brain activaother variants. tion coordinates. Many Neurolex terms are of these areas are covaccessible through a ered by multiple datawiki (37) that allows bases, which the NIF users to view, augment, knits together into a coand modify these conherent view. Although cepts. The goal is to proimpressive, this reprevide clear definitions of sents only the tip of the each concept that can iceberg. Most individbe used not only by huual databases are undermans but by automated populated because of agents, such as the NIF, insufficient communito navigate the comty contributions. Entire plexities of human neudomains of neurosciroscience knowledge. ence (such as electrophysA key feature is the asiology and behavior) are underrepresented as Fig. 2. Current contents of the NIF. The NIF navi- signment of a unique compared to genomics gation bar displays the current contents of the NIF resource identifier to data federation, organized by data type and level make it easier for search and neuroanatomy. Ideally, NIF users of the nervous system. The number of records in algorithms to distinguish among concepts that should be able not only each category is displayed in parentheses. share the same label. to locate answers that are known but to mine available data in ways For example, nucleus (part of cell) and nucleus that spur new hypotheses regarding what is not (part of brain) are distinguished by unique IDs. known. Perhaps the single biggest roadblock to Using these identifiers in addition to natural lanthis higher-order data mining is the lack of stan- guage to reference concepts in databases and dardized frameworks for organizing neuroscience publications, although conceptually simple, is an data. Individual investigators often use terminol- especially powerful means for making data ogy or spatial coordinate systems customized for maximally discoverable and useful. These efforts to develop and deploy a semantheir own particular analysis approaches. This customization is a substantial barrier to data in- tic framework for neuroscience, spearheaded by

11 FEBRUARY 2011

22

VOL 331

SCIENCE

SCIENCE

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION the NIF and the International Neuroinformatics Coordinating Facility (38), are complemented by projects related to brain atlases and spatial frameworks (39–41), providing tools for referencing data to a standard coordinate system based on the brain anatomy of a given organism. Neuroinformatics as a Prelude to New Discoveries How might improved access to multiple tiers of neurobiological data help us understand the brain? Imagine that we are investigating the neurobiology of bipolar disorder, an illness in which moods are normal for long periods of time, yet are labile and sometimes switch to mania or depression without an obvious external trigger. Although highly heritable, this disease appears to be genetically very complex and possibly quite heterogeneous (42). We may discover numerous genes that impart vulnerability to the illness. Some may be ion channels, others synaptic proteins or transcription factors. How will we uncover how disparate genetic causes lead to a similar clinical phenotype? Are they all affecting the morphology of certain cells; the dynamics of specific microcircuits, for example, within the amygdala; the orchestration of information across regions, for example, between the amygdala and the prefrontal cortex? Can we create genetic mouse models of the various mutated genes and show a convergence at any of these levels? Can we capture the critical changes in neuronal and/or glial function (at any of the levels) and find ways to prevent the illness? Discovering the common thread for such a disease will surely benefit from tools that facilitate navigation across the multiple tiers of data—genetics, gene expression/epigenetics, changes in neuronal activity, and differences in dynamics at the micro and macro levels, depending on the mood state. No single focused level of analysis will suffice to achieve a satisfactory understanding of the disease. In neural choreography terms, we need to identify the dancers, define the nature of the dance, and uncover how the disease disrupts it. Recommendations Need for a cultural shift. To meet the grand challenge of elucidating neural choreography, we need increasingly powerful scientific tools to study brain activity in space and in time, to extract the key features associated with particular events, and to do so on a scale that reveals commonalities and differences between individual brains. This requires an informatics infrastructure that has built-in flexibility to incorporate new types of data and navigate across tiers and domains of knowledge. The NIF currently provides a platform for integrating and systematizing existing neuroscience knowledge and has been working to define

best practices for those producing new neuroscience data. Good planning and future investment are needed to broaden and harden the overall framework for housing, analyzing, and integrating future neuroscience knowledge. The International Neuroinformatics Coordinating Facility (INCF) plays an important role in coordinating and promoting this framework at a global level. But can neuroscience evolve so that neuroinformatics becomes integral to how we study the brain? This would entail a cultural shift in the field regarding the importance of data sharing and mining. It would also require recognition that neuroscientists produce data not just for consumption by readers of the conventional literature, but for automated agents that can find, relate, and begin to interpret data from databases as well as the literature. Search technologies are advancing rapidly, but the complexity of scientific data continues to challenge. To make neuroscience data maximally interoperable within a global neuroscience information framework, we encourage the neuroscience community and the associated funding agencies to consider the following set of general and specific suggestions: 1) Neuroscientists should, as much as is feasible, share their data in a form that is machineaccessible, such as through a Web-based database or some other structured form that benefits from increasingly powerful search tools. 2) Databases spanning a growing portion of the neuroscience realm need to be created, populated, and sustained. This effort needs adequate support from federal and other funding mechanisms. 3) Because databases become more useful as they are more densely populated (43), adding to existing databases may be preferable to creating customized new ones. The NIF, INCF, and other resources provide valuable tools for finding existing databases. 4) Data consumption will increasingly involve machines first and humans second. Whether creating database content or publishing journal articles, neuroscientists should annotate content using community ontologies and identifiers. Coordinates, atlas, and registration method should be specified when referencing spatial locations. 5) Some types of published data (such as brain coordinates in neuroimaging studies) should be reported in standardized table formats that facilitate data mining. 6) Investment needs to occur in interdisciplinary research to develop computational, machine-learning, and visualization methods for synthesizing across spatial and temporal information tiers. 7) Educational strategies from undergraduate through postdoctoral levels are needed to ensure that neuroscientists of the next generation are proficient in data mining and using the datasharing tools of the future.

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

8) Cultural changes are needed to promote widespread participation in this endeavor. These ideas are not just a way to be responsible and collaborative; they may serve a vital role in attaining a deeper understanding of brain function and dysfunction. With such efforts, and some luck, the machinery that we have created, including powerful computers and associated tools, may provide us with the means to comprehend this “most unaccountable of machinery” (44), our own brain. References and Notes 1. World Health Organization, Mental Health and Development: Targeting People With Mental Health Conditions As a Vulnerable Group (World Health Organization, Geneva, 2010). 2. F. A. Azevedo et al., J. Comp. Neurol. 513, 532 (2009). 3. B. Pakkenberg et al., Exp. Gerontol. 38, 95 (2003). 4. www.alleninstitute.org/ 5. www.gensat.org/ 6. http://adni.loni.ucla.edu 7. A. Bjorklund, T. Hokfelt, Eds., Handbook of Chemical Neuroanatomy Book Series, vols. 1 to 21 (Elsevier, Amsterdam, 1983–2005). 8. http://cocomac.org/home.asp 9. http://brancusi.usc.edu/bkms/ 10. http://brainmaps.org/ 11. http://en.wikipedia.org/wiki/Connectome 12. O. Sporns, G. Tononi, R. Kotter, PLoS Comput. Biol. 1, e42 (2005). 13. F. Crick, E. Jones, Nature 361, 109 (1993). 14. H. Johansen-Berg, T. E. J. Behrens, Diffusion MRI: From Quantitative Measurement to in-vivo Neuroanatomy (Academic Press, London, ed. 1, 2009). 15. H. Johansen-Berg, M. F. Rushworth, Annu. Rev. Neurosci. 32, 75 (2009). 16. J. L. Vincent et al., Nature 447, 83 (2007). 17. D. Zhang, A. Z. Snyder, J. S. Shimony, M. D. Fox, M. E. Raichle, Cereb. Cortex 20, 1187 (2010). 18. http://humanconnectome.org/consortia 19. http://humanconnectome.org 20. D. S. Marcus, T. R. Olsen, M. Ramaratnam, R. L. Buckner, Neuroinformatics 5, 11 (2007). 21. http://human.brain-map.org/ 22. K. L. Briggman, W. Denk, Curr. Opin. Neurobiol. 16, 562 (2006). 23. J. W. Lichtman, J. Livet, J. R. Sanes, Nat. Rev. Neurosci. 9, 417 (2008). 24. S. J. Smith, Curr. Opin. Neurobiol. 17, 601 (2007). 25. H. Markram, Nat. Rev. Neurosci. 7, 153 (2006). 26. K. Deisseroth, Nat. Methods 8, 26 (2011). 27. B. F. Grewe, F. Helmchen, Curr. Opin. Neurobiol. 19, 520 (2009). 28. B. O. Watson et al., Front. Neurosci. 4, 29 (2010). 29. G. Buzsaki, Neuron 68, 362 (2010). 30. R. Bernard et al., Mol. Psychiatry, 13 April 2010 (e-pub ahead of print). 31. M. E. Martone, A. Gupta, M. H. Ellisman, Nat. Neurosci. 7, 467 (2004). 32. The Fourth Paradigm: Data Intensive Scientific Discovery, T. Hey, S. Tansler, K. Tolle, Eds. (Microsoft Research Publishing, Redmond, WA, 2009). 33. http://neuinfo.org 34. D. Gardner et al., Neuroinformatics 6, 149 (2008). 35. A. Gupta et al., Neuroinformatics 6, 205 (2008). 36. W. J. Bug et al., Neuroinformatics 6, 175 (2008). 37. http://neurolex.org 38. http://incf.org 39. http://www.brain-map.org 40. http://wholebraincatalog.org 41. http://incf.org/core/programs/atlasing 42. H. Akil et al., Science 327, 1580 (2010).

711

11 FEBRUARY 2011

23

43. G. A. Ascoli, Nat. Rev. Neurosci. 7, 318 (2006). 44. N. Nicolson, J. Trautmann, Eds., The Letters of Virginia Woolf, Volume V: 1932–1935 (Harcourt, Brace, New York, 1982). 45. Funded in part by grants from NIH (5P01-DA021633-02), the Office of Naval Research (ONR-N00014-02-1-0879),

and the Pritzker Neuropsychiatric Research Consortium (to H.A); and the HCP (1U54MH091657-01) from the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research (to D.C.V.E.). The NIF is supported by NIH Neuroscience Blueprint contract HHSN271200800035C via the National

PERSPECTIVE

The Disappearing Third Dimension Timothy Rowe1 and Lawrence R. Frank2 Three-dimensional computing is driving what many would call a revolution in scientific visualization. However, its power and advancement are held back by the absence of sustainable archives for raw data and derivative visualizations. Funding agencies, professional societies, and publishers each have unfulfilled roles in archive design and data management policy. hree-dimensional (3D) image acquisition, analysis, and visualization are increasingly important in nearly every field of science, medicine, engineering, and even the fine arts. This reflects rapid growth of 3D scanning instrumentation, visualization and analysis algorithms, graphic displays, and graduate training. These capabilities are recognized as critical for future advances in science, and the U.S. National Science Foundation (NSF) is one of many funding agencies increasing support for 3D imaging and computing. For example, one new initiative aims at digitizing biological research collections, and NSF’s Earth Sciences program will soon announce a new target in cyberinfrastructure, with a spotlight on 3D imaging of natural materials. Many consider the advent of 3D imaging a “scientific revolution” (1), but in many ways the revolution is still nascent and unfulfilled. Given the increasing ease of producing these data and rapidly increased funding for imaging in nonmedical applications, a major unmet challenge to ensure maximal advancement is archiving and managing the science that will be, and even has already been, produced. To illustrate the problems, we focus here on one domain, volume elements or “voxels,” the 3D equivalents of pixels, but the argument applies more broadly across 3D computing. Useful, searchable archiving requires infrastructure and policy to enable disclosure by data producers and to guarantee quality to data consumers (2). Voxel data have generally lacked a coherent archiving and dissemination policy, and thus the raw data behind thousands of published reports are not released or available for validation and reuse. A solution requires new

T

1

Jackson School of Geosciences, The University of Texas, Austin, TX 78712, USA. 2Department of Radiology, University of California, San Diego, CA 92093, USA. *To whom correspondence should be addressed. E-mail: [email protected] (T.R.); [email protected] (L.R.F.)

712

infrastructure and new policy to manage ownership and release (3). Technological advancements in both medical and industrial scanners have increased the resolution of 3D volumetric scanners to the point that they can now volumetrically digitize structures from the size of a cell to a blue whale, with exquisite sensitivity toward scientific targets ranging from tissue properties to the composition of meteorites. Voxel data sets are generated by rapidly diversifying instruments that include x-ray computed tomographic (CT) scanners (Fig. 1), magnetic resonance imaging (MRI) scanners (Fig. 2), confocal microscopes, synchrotron light sources, electron-positron scanners, and other innovative tools to digitize entire object volumes.

10.1126/science.1199305

CT and MRI have evolved across the widest range of applications. CT is sensitive to density. Its greatest strength is imaging dense materials like rocks, fossils, the bones in living organisms and, to a lesser extent, soft tissue. The first clinical CT scan, made in 1971, used an 80 by 80 matrix of 3 mm by 3 mm by 13 mm voxels, each slice measuring 6.4 Kb and taking up to 20 min to acquire. Each slice image took 7 min to reconstruct on a mainframe computer (4). In 1984, the first fossil, a 14-cm-long skull of the extinct mammal Stenopsochoerus, was scanned in its entirety (5), signaling CT’s impact beyond the clinical setting and in digitizing entire object volumes. The complete data set measured 3.28 Mb. With a mainframe computer, rock matrix was removed to visualize underlying bone, and surface models were reconstructed in computations taking all night. By 1992, industrial adaptations brought CT an order-of-magnitude higher resolution (highresolution x-ray computed tomography, HRXCT) to inspect much smaller, denser objects. A fossil skull of the stem-mammal Thrinaxodon (68 mm long) was scanned in 0.2-mm-thick slices measuring 119 Kb each. Scanning the entire volume took 6 hours, and the complete raw data set occupied 18.4 Mb (6). The scans revealed all internal details reported earlier, from destructive mechanical serial sectioning of a different specimen, and pushed

Fig. 1. (A) Photomicrograph of fossil tooth (Morganucodon sp.). (B) MicroXCT slice, 3.2-mm voxel size; arrows show ring artifact, a correctable problem with raw data but an interpretive challenging with compressed data. (C) Digital 3D reconstruction. (D) Slice through main cusp; red arrows show growth bands in dentine, and blue arrows mark the enamel-dentine boundary, both observed in Morganucodon for the first time in these scans.

11 FEBRUARY 2011

24

Institute on Drug Abuse to M.E.M. We thank D. Marcus, R. Poldrack, S. Curtiss, A. Bandrowski, and S. J. Watson for discussions and comments on the manuscript.

VOL 331

SCIENCE

SCIENCE

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION the older technique toward extinction (6). In 2008, the tiny tooth in Fig. 1 was scanned on a modern nanoXCT scanner, with the use of a cone beam to acquire the entire volume in multiple 20-s views, rather than in individual slices. The scan took 4.6 hours, generating 3.2-mm3 voxels in a data set consuming just over 1 Gb. In MRI, accelerated data acquisition with greater resolution has also been achieved, with the development of increasingly sophisticated hardware and higher-field magnets. Standard clinical human scanners (3 T) can acquire full 3D volumes of data (e.g., human brain) at 0.5-mm isotropic resolution in minutes. High-field (7 and 11.7 T) small-bore scanners are currently standard for high-resolution small-animal biomedical imaging (~80 and ~40 mm3 voxels, respectively), but human 7-T systems have now been developed. Unlike other clinical scanners, MRI can discriminate soft tissues because of its sensitivity to the state of tissue water. This sensitivity can take many forms and affords the ability to create contrast based on a wide variety of variations in tissue microstructure and physiology, such as local water content, tissue relaxation times, diffusion, perfusion, and oxygenation state. Therefore, for a single spatial location, there might be associated multiple voxels, and the data associated with each voxel can be of much higher dimensionality. For example, a typical diffusion tensor MRI (DT-MRI) data set might consist of multiple (e.g., 60) uniquely diffusion-sensitized images (60 voxels per spatial location), and each voxel can have associated with it multiple diffusionrelated parameters derived from these 60 voxels (local diffusion tensor, mean diffusivity, neural fiber reconstructions, etc.). By noninvasively producing digital data to visualize and measure internal architecture and physiology, voxel scanners provide a rich source of new quantifiable scientific discovery for anyone with access to those data. Retrievable archived digital data lends itself to repeated reuse with the latest advances in computational power and sophistication. This has become evident in data visualization, for example, where surface meshes can be extracted, internal parts can be segmented for discrete visualization, and a new frontier of quantitative 3D analysis is opening. Moreover, fully 3D digitized data allow different modalities to be combined to build data sets that are more informative than the individual data sets by themselves (Fig. 2). Scanned objects and segmented parts can be printed as physical replicas, through the use of stereolithography, laser sintering, and other “rapid prototyping” devices. The near future promises holographic display, 3D shape queries, and other potential uses. As better voxel data have been obtained, their half-life and utility have steadily grown. Early CT data were stored on magnetic tape, whereas Stenopsochoerus was stored on floppy disks, and these data all died as their storage media became

policy on ownership and release. The last element immensely amplified the value and return on sequence data. For voxels, this key element is missing. Voxel-based data underlie thousands of publications. Yet we have no census of how many data sets survive, and the basic scientific tenet of data disclosure is unfulfilled. Only two small prototype biological voxel collections have appeared. The DigiMorph project (7) now serves HRXCT data sets from a collection of 1018 vouchered biological and paleontological specimens scanned by the University of Texas since 1992 (8), and the Digital Fish Library (DFL) project serves 295 MRI data sets (9) generated more recently at the University of California, San Diego, from specimens in the Scripps Institution of Oceanography Marine Vertebrates Collection (10). Both treat accessioned data sets much like vouchered specimens. The Web site for DigiMorph went live in 2002 and that for DFL in 2005, and together they have reached more than 3 million unique visitors who have downloaded 3 Tb of data. The two most popular specimens at DigiMorph have each been accessed by ~80,000 unique visitors. More than 100 peer-reviewed scientific papers, theses, and dissertations are published on data sets from the two sites. Both collections effectively increased access to expensive instrumentation at reduced costs Fig. 2. Multimodal imaging, segmentation, and registration of and delivered high-quality data to Island Kelpfish, Alloclinus holderi. (A) Specimen (94 mm standard a large, diversifying global audilength); (B) 7T MRI sagittal slice from DFL (100-mm3 voxel size); ence for use in both research and (C) CT reconstruction from DigiMorph; (D) combined CT bone and education. MRI soft-tissue 3D reconstructions from DFL. As with genetics, our experience predicts that data disclosure, reuse, ~100 Mb of storage, but its derivative products can query efficiency, outreach, and other advances easily take up 10 times that space. Data archiving will be driven by strategically designed voxel and management thus become increasingly im- archives. As data volumes increase, one strategic portant in providing the access and information to element is compression. For example, abstractfacilitate the continuing use of such complex data, ing data sets into small (~1 to 5 Mb) animations and require the development of a cyberinfrastruc- was crucial to Web dissemination by DigiMorph ture commensurate with the evolving complexity and DFL. But compression can destabilize data as computing environments evolve, and it poses of the raw data and its derivatives. But how far into the future will today’s voxels risks to interpretation and validation. A best pracsurvive, and just how much will the immense tice is to work from the original raw data, and public investment already made in voxels return? DigiMorph and DFL have received increasing Their potential power in fueling science into the numbers of requests for the full-resolution voxel future is well illustrated by genetic sequences, a data sets, which are not yet available online. A much younger data “species” than the voxel. The great measure of return from the huge investment rapid growth of genetics was facilitated by the already made in voxels will depend on susability to (i) inexpensively digitize sequences, (ii) tainable archives with life spans and capacities analyze them on inexpensive computers, (iii) equal to the utility and size of the raw data and its share data across the Internet, and (iv) reuse in- derivatives. Technology for producing a voxel commons dividual data sets and derivative products (e.g., alignment data sets) that were assembled into is less challenging than addressing the void in large, publicly accessible collections governed by policies for handling voxel data. Views on data obsolete. Thrinaxodon marked a turning point. The emergence of CD-ROM afforded inexpensive and more enduring storage and dissemination of not only the raw data set, but also derivative digital products, including animations, volumetric models, and scientific reports on the data. When published in 1993 on CD-ROM (6), the Thrinaxodon archive consumed 623 Mb and demonstrated another trend, that derivative products consume larger volumes than the original data. The aforementioned DT-MRI data set would take

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

713

11 FEBRUARY 2011

25

ownership, latency, and release, even within the academic community, are diffuse and polarized, which calls for standards set by publishers, societies, and funding agencies. Data quality and formats vary, and different disciplines have much to gain by developing their own standards, architectures, interface designs, and metadata tied to particular species of data. Many professional societies, however, still seem unaware that voxel data fundamental to new discoveries in their own disciplines are not being released, much less validated or reused, and in many cases, are not being saved or curated at all. Online supplemental data limits are too small for voxels, nor is it the role of publishers to manage primary data collections. Extending the GenBank model to voxels is a solution within reach, if not without obstacles. Sustained funding is paramount. Another need is professional advancement, still bound to metrics of conven-

tional publication, while the more fundamental tasks of data generation and management go unrewarded. Young careers are still best served by publishing words and pixels, and abandoning used voxels to get on to the next project. As funding agencies pour increasing millions into scanners and scanning, only negligible funding and thought have gone to data archiving or leveraging their initial investments. There is urgency to act. As second-generation voxel scientists have now begun to retire, their data are on track to die with them, as it did with the first voxel pioneers, even as we now train a third generation in 3D imaging and computation. Funding agencies can rejoice in the unexpected longevity and growing value in voxels they have already produced. But they must first secure the basic tenet of science by ensuring that researchers have the means to archive, disclose, validate, and repurpose their primary data.

PERSPECTIVE

Advancing Global Health Research Through Digital Technology and Sharing Data Trudie Lang The imperative for improving health in the world’s poorest regions lies in research, yet there is no question that low participation, a lack of trained staff, and limited opportunities for data sharing in developing countries impede advances in medical practice and public health knowledge. Extensive studies are essential to develop new treatments and to identify better ways to manage healthcare issues. Recent rapid advances in availability and uptake of digital technologies, especially of mobile networks, have the potential to overcome several barriers to collaborative research in remote places with limited access to resources. Many research groups are already taking advantage of these technologies for data sharing and capture, and these initiatives indicate that increasing acceptance and use of digital technology could promote rapid improvements in global medical science. linical research in the world’s poorest regions lags behind the rest of the globe, but, because these communities carry the highest burden from disease, data from studies conducted in these areas could make the biggest impact on global health. Lack of trained staff, low investment in health research, and remote communities with poor infrastructure combine to make clinical research in these settings challenging (1). Innovative use of digital technology has the potential to drive important changes in global health research and in many cases is doing so already. Within the convention of collecting data through clinical studies, there are good examples of how novel data-capture mechanisms are being used across a spectrum of global health research areas. These technologies in combination with open access, data sharing, and

C

Global Health Clinical Trials, Centre for Tropical Medicine, University of Oxford, Oxford OX3 7LJ, UK. E-mail: trudie. [email protected]

714

knowledge exchange (2) could transform clinical research in the world’s poorest regions. Communities most affected by diseases of poverty are held back in economic development by the perpetual cycle of ill health and low income. The world’s poorest communities experience many health issues, often concomitantly. Therefore, rather than working in separation on a specific disease, groups need to work together on complex and overlapping challenges (3). This paper surveys how digital technology is being harnessed to capture, record, store, combine, and share a diversity of data sets. Nevertheless, there are both practical and notional problems to overcome if this technology is to deliver its full potential in advancing improvements in global health. Data Capture and Sharing in Global Health Research The process of tackling a disease or health issue begins by characterizing the problem, initially by the collection of epidemiology and

11 FEBRUARY 2011

26

VOL 331

SCIENCE

SCIENCE

References and Notes 1. B. A. Price, I. S. Small, R. M. Baecker, in Proceedings of the 25th Hawaii International Conference on System Sciences, 7 to 10 January 1992, vol. 2, p. 597. 2. J. L. Contreras, Science 329, 393 (2010). 3. C. Hess, E. Ostrom, Understanding Knowledge as a Commons: From Theory to Practice (MIT Press, Cambridge, MA, 2007). 4. E. C. Beckmann, Br. J. Radiol. 79, 5 (2006). 5. G. C. Conroy, M. W. Vannier, Science 226, 456 (1984). 6. T. B. Rowe, W. Carlson, W. Bottorff, Thrinaxodon: Digital Atlas of the Skull (Univ. of Texas Press, Austin, TX, CD-ROM ed. 1, 1993). 7. www.DigiMorph.org/ 8. www.ctlab.geo.utexas.edu/ 9. http://digitalfishlibrary.org/ and http://csci.ucsd.edu/ 10. www.sio.ucsd.edu/ and http://collections.ucsd.edu/mv/ 11. This study was based on funding to L.R.F. from NSF (DBI-0446389) and T.R. from NSF (EAR-0948842, CNS-0709135, IIS-0208675, IIS-9874781, EAR-940625B, BSR-89-58092), the Intel Foundation, and the Jackson School of Geosciences. 10.1126/science.1202828

laboratory data. Once a disease or health issue is understood (even if only in part), targets can be sought and interventions designed. An intervention could be a new drug or vaccine, but it might be a new way of diagnosing a disease, introducing a change in the training of healthcare workers, or a new practice in managing a disease or healthcare situation. Any new interventions like these must be tested through clinical trials, and ongoing observational studies are necessary to support implementation and policy by monitoring the effectiveness of the intervention, its cost effectiveness, and associated changes to quality of life. These measures are required in any healthcare setting, but in the field of global health capturing data to address these questions can be problematic if populations are hard to reach, infrastructure is poor, and there is a lack of trained staff to conduct such studies. Technologies, ranging from geographic information systems to gene sequence databases, are beginning to be used to address key public health questions in resource-limited settings and are already making impressive advances. Data to Quantify and Describe Health Issues In developing countries, key data from health and demographic surveillance systems have to be accessed by both researchers and policymakers; hence, there are efforts to standardize and link such databases. In remote areas, handheld satellite Global Positioning Systems are proving valuable for surveying communities and collecting data to characterize and quantify diseases and to identify exactly where a new intervention or change in management is needed (Fig. 1). The International Network for the Demographic Evaluation of Populations and Their Health in Developing Countries (INDEPTH) is a global network engaged in conducting longi-

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION tudinal health and demographic evaluation of populations in low- and middle-income countries (4). The capacity of this and similar networks has been greatly enhanced by advances in satellite technology, Internet access speed, and increased access to handheld data-entry devices. Development of open-source software such as openXdata (5) has transformed surveillance studies by bringing scale, quality, coverage, and, importantly, knowledge and sharing of best practice. These surveillance networks are succeeding as groups become willing to share data and collaborate and are beginning to make global disease surveillance a reality. Similarly, the East African Community has made an important step forward in establishing the East African Integrated Disease Surveillance Network (EAIDSNet), in which countries share data on communicable diseases to improve public health in their region (6). In Asia, the South East Asia Infectious Diseases Clinical Research Network is successfully sharing data across dozens of sites and several countries and makes many resources for researchers available on the Web (7, 8). There are many other types of networks operating in combination to give a more comprehensive understanding of specific diseases, and when data are combined a powerful picture emerges. The Malaria Atlas Project (9) works with geographers, statisticians, epidemiologists, biologists, and public health specialists in endemic countries to assemble a spatial database combining medical intelligence and satellitederived climate data to define the limits of malaria transmission. This initiative has succeeded in compiling an archive of community-based estimates of parasite prevalence drawn from 85 countries. In the search for new interventions to combat diseases of poverty, sharing and linking databases are clearly advantageous. Assembling findings on biochemical targets, genetics studies, and the pathogens themselves is vital to improve the speed of new drug and vaccine development. One such project is MalariaGen (10), in which researchers from 21 countries are collaborating to build a malaria genome database. This project has addressed some of the fundamental challenges that are inherent to data sharing, such as the ethical challenges of recruiting participants and setting out clear agreements on data linking and release. Establishing a set of policies to address these issues has worked well for this network (11). Measuring the Potential of New Interventions Clinical trials are highly challenging in resourcelimited settings, but obstacles are being countered by remote data-collection technology, as well as by distance learning and knowledge-sharing strategies. Open-source clinical trial data management systems (Fig. 2) permit international standard data management for noncommercial organiza-

Fig. 1. A health worker uses EpiSurveyor (www.episurveyor.org) free mobile data-collection software in Cameroon. [Photo reproduced with permission from J. Selanikio, www.DataDyne.org] tions by removing the cost impediments of commercial clinical trial software (12). The rapid expanse of mobile phone networks, and more recently 3G (third-generation) technology, has transformed the potential for electronic direct data entry in even the most remote corners of the globe. The continual evolution of mobile networks expands the possibility for clinical trials to be conducted to high data-quality standards in many regions and increases the professional networking opportunities for local researchers. A scarcity of trained staff has limited some countries’ capacity to design and manage independent trial programs without the involvement of external sponsors (13). Knowledge sharing will not only increase the numbers of skilled staff but also improve methods. The wiki concept (Web sites where content can be openly shared, changed, and developed) is likely to be an important contribution, because through this route researchers can share tools and protocols and improve the design and conduct of clinical trials. Large clinical trials to establish novel approaches to disease management are needed, for example, to evaluate new uses for antibiotics or to assess treatment and management options for a variety of healthcare issues. Such trials generate large data sets that are too cumbersome for conventional publication, but increasingly these are being made available online. A good example of

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

one recent large trial is the AQUAMAT study of hospitalized malaria patients, which enrolled 5425 patients across 11 centers in nine African countries. This trial provides supplementary data online, including enrollment figures, endpoint review, and quality assessments (14), all of which will benefit others planning similar trials or policymakers needing more detail than that provided in the primary publication. Data to Drive Policy Change and Support Implementation Licensing of new drugs and vaccines, or gaining evidence for a new public health measure, is not the end in terms of collecting and sharing data. Changes in national treatment and management policies require ongoing data on safety and efficacy, quality of life outcomes, and health economic impacts; hence, it is important that after implementation of any new intervention data continue to be gathered. For instance, pathogens can become resistant to drugs; hence, networks have been established that monitor changes in drug efficacy. For example, the World Wide AntiMalarial Resistance Network of disease-endemic country researchers collates data to inform and respond rapidly to the malaria parasite’s ability to adapt to drug treatments (15). Pharmacovigilance provides long-term safety monitoring vital to the success of drug implementation programs, but these activities generate

715

11 FEBRUARY 2011

27

GlobalHealthTrials.org Search

Home

Guidance & Resources

The Network

Regional Faculties

Sign in

L AT E S T

Join now - it's free How would you handle this data management dilema? http://www.globalhealthtrials.or...

United States

Map

North Atlantic Ocean

Iraq Algeria

Libya

Mexico

Egypt

Satellite

Hybrid

Terrain

Iran Afghanistan Pakistan

Saudi Arabia

China

India

Chad Sudan

T

Venezuela Colombia DR Congo

Indones

Brasil

Peru

West African Regional Faculty is operational. Invite colleagues to your courses, share knowledge and network locally to support all trials A novel trial in Cameroon http://www.globalhealthtrials.or...

Angola Bolivia

Namibia South Atlantic Ocean

Chile

Over 700 members from 56 countries. Get involved in building this knowledge sharing platform for all trial staff working in Global Health

Indian Ocean

Madagascar

South Africa

PA RT I C I PATO RY AC T I O N R E S E A R C H

E D U C AT I O N & TRAINING

Take part in studies to improve how we run trials

CPD Scheme eLearning Modules

C O N T R I BU T E Submit a guidance article Start a discussion Write a blog

Would you like us to turn your clinical trials training courses into e-learning modules that others can benefit from? Just let us know!

Follow us on Twitter @GHealthTrials

Fig. 2. Global Health Trials is a free, open-access collaborative program that aims to promote and to make easier the conduct of noncommercial clinical trials across all diseases in resource-poor settings by providing guidance and support and enabling the sharing of best practice. [Image reproduced with permission from www.globalhealthtrials.org]

large volumes of data that are difficult to handle. The World Health Organisation Collaborating Centre for Advocacy and Training in Pharmacovigilance aims to improve drug monitoring in developing countries (16) through the provision of large shared databases, training programs, and resource development. Another project that embraces the advantages of open-source software and the ethos of data sharing is the Millennium Global Village Project in sub-Saharan Africa. In this project, many ongoing research activities are linked by their Web-based program (17). Organizations are also using digital technology to support point-of-care diagnostics and treatment in remote regions poorly supplied with medical expertise and where treatment is hard to access, through the use of decision tree tools and access to resources and guidance. (18). There’s no doubt that taking knowledge to the community via smart phones and laptops holds enticing potential for health care under any circumstances.

716

Issues and Challenges of Data Sharing and Capture The examples present the illusion that digital technology is being readily adopted in global health research; unfortunately, this is not yet the case. The reality is one of unreliable electricity supply, not enough computers, and numerous and seemingly trivial (but cumulatively limiting) frustrations. Even simple equipment failures can become insuperable problems if parts are expensive to obtain and engineers are rare. In describing the issues they faced in setting up an electronic system to support clinical decisions in HIV care, a group in Kenya concluded that the ability to recognize and adapt to the specific needs of resource-limited settings was fundamental to successful implementation (19). Thus, the difficulties of applying digital technology and data capture in developing-world settings are both practical and philosophical. The practical challenges encompass the range of physical and technical mechanisms of capturing, handling, and storing the data. This includes ac-

11 FEBRUARY 2011

28

VOL 331

SCIENCE

SCIENCE

cess to the technology and equipment, as well as skills training. There are still gaps in funding and knowledge to be met. The matters of individual and organizational attitudes to digital technology and adoption of a wiki culture are not so easy to address and depend on the acceptance and understanding of new technology and concepts to realize the potential they hold. An essential component for successful adoption of digital media is the willingness of scientific communities to share data (20). Researchers acknowledge that data sharing increases the impact, utility, and profile of their work. Conversely, research is highly competitive (21, 22), and publications depend on individual ability to produce novel data, which can be a disincentive for collaboration. There are also major ethical considerations in sharing data between researchers and between countries and in making data available for open access. The issues around consent and ownership are yet more complex within networks. Common frameworks and defined principles first

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION need to be established if a data-sharing network is to succeed, particularly when it comes to the ethical and privacy issues surrounding patient data (23, 24). Shifting Attitudes Widely dispersed researchers in resource-limited countries may have few opportunities to travel to courses or attend meetings, but they can meet online and share experiences, guide each other, and access resources. Learning and knowledge sharing online could play a vital role in adjusting the imbalance in research capacity. However, this medium for learning needs to become accepted, and senior research staff need to encourage and enable their colleagues to take up the numerous free and open-access learning opportunities that are increasingly available online (13). Undoubtedly integration and knowledge sharing can be vastly improved to make the most use of gathered data, but many organizations in global health exist to address a single disease or work in a specific sector. There is a real need for mechanisms allowing research organizations, governments, and universities to collaborate outside their usual remits and locations to maximize the impact of data and available resources. Governance and ethical issues are also a major concern, because if mistakes are made trust will be quickly lost and enthusiasm for open-

ing access could be stifled. A particular anxiety resulting from disparities between wealthy and resource-limited nations is the removal of data and loss of ownership. Ownership and governance arrangements need to be made transparently for fair access and maintenance of security, and whenever possible the technology should be transferred rather than the data. These issues therefore need to be tackled openly and comprehensively early in the formation of data-sharing collaborations. Groups would be advised to seek advice and obtain example policy documents (such as agreements and terms of reference) from other successful data-sharing groups. A striking range of data sets spanning a wide range of healthcare issues, including infectious and noncommunicable diseases, are accumulating with use of new technology and online collaboration. All this stands to make real changes in the lives of people affected by diseases of poverty. While scientists are rapidly adapting and taking up these approaches, funding agencies and regulators also need to adapt to ensure that all interested communities are able to take maximum advantage of the digital environment to drive improvements in global health. References and Notes 1. P. Mwaba, M. Bates, C. Green, N. Kapata, A. Zumla, Lancet 375, 1874 (2010).

PERSPECTIVE

More Is Less: Signal Processing and the Data Deluge Richard G. Baraniuk The data deluge is changing the operating environment of many sensing systems from data-poor to data-rich––so data-rich that we are in jeopardy of being overwhelmed. Managing and exploiting the data deluge require a reinvention of sensor system design and signal processing theory. The potential pay-offs are huge, as the resulting sensor systems will enable radically new information technologies and powerful new tools for scientific discovery.

U

ntil recently, the scientist’s problem was a “sensor bottleneck.” Sensor systems produced scarce data, complicating subsequent information extraction and interpretation. In response to the resulting challenge of “doing more with less,” signal-processing researchers have spent the last several decades creating powerful new theory and technology for digital data acquisition (digital cameras, medical scanners), digital signal processing (machine vision; speech, audio, image, and video compression), and digital communication (high-speed modems, Wi-Fi)

Department of Electrical and Computer Engineering, Rice University, Houston, TX 77251–1892, USA. E-mail: [email protected]

that have both enabled and accelerated the information age. These hardware advances have fueled an even faster exponential explosion of sensor data produced by a rapidly growing number of sensors of rapidly growing resolution. Digital camera sensors have dropped in cost to nearly $1/megapixel; this has enabled billions of people to acquire and share highresolution images and videos. Millions of security and surveillance cameras, including unmanned drone aircraft prowling the skies, have joined highresolution telescopes, digital radio receivers, and many other types of sensors in the environment. As a result, a sensor data deluge is beginning to swamp many of today’s critical sensing systems.

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

2. E. Wenger, W. Snyder, Harv. Bus. Rev. 2000, 139 (Jan.-Feb. 2000). 3. A. de-Graft Aikins et al., Global. Health 6, 5 (2010). 4. P. Kowal et al., Glob. Health Action 3 (suppl. 2), 10.3402/gha.v3i0.5302 (2010). 5. OpenXData, www.openxdata.org. 6. EAIDSNet, www.eac.int. 7. H. F. Wertheim et al., PLoS Med. 7, e1000231 (2010). 8. The South East Asia Infectious Disease Clinical Research Network, www.seaicrn.org. 9. S. I. Hay, R. W. Snow, PLoS Med. 3, e473 (2006). 10. The Malaria Genomic Epidemiology Network, Nature 456, 732 (2008). 11. M. Parker et al., PLoS Med. 6, e1000143 (2009). 12. G. W. Fegan, T. A. Lang, PLoS Med. 5, e6 (2008). 13. T. A. Lang et al., PLoS Negl. Trop. Dis. 4, e619 (2010). 14. A. M. Dondorp et al., Lancet 376, 1647 (2010). 15. P. J. Guerin, S. J. Bates, C. H. Sibley, Curr. Opin. Infect. Dis. 22, 593 (2009). 16. M. Pirmohamed, K. N. Atuah, A. N. Dodoo, P. Winstanley, Br. Med. J. 335, 462 (2007). 17. A. S. Kanter et al., Int. J. Med. Inf. 78, 802 (2009). 18. D-Tree International, www.d-tree.org/. 19. S. F. Noormohammad et al., Int. J. Med. Inf. 79, 204 (2010). 20. B. A. Fischer, M. J. Zigmond, Sci. Eng. Ethics 16, 783 (2010). 21. E. Pisani, C. AbouZahr, Bull. W. H. O. 88, 462 (2010). 22. J. Whitworth, Bull. W. H. O. 88, 467 (2010). 23. B. Malin, D. Karp, R. H. Scheuermann, J. Investig. Med. 58, 11 (2010). 24. R. Horton, Lancet 355, 2231 (2000). 25. The author received no specific funding for this work and has no conflicts of interest to declare. 10.1126/science.1199349

In just a few years, the sensor data deluge has shifted the bottleneck of many data acquisition systems from the sensor back to the processing, communication, or storage subsystems (Fig. 1). To see why, consider the exponentially growing gap between global sensing and data storage capabilities. A recent report (1) found that the amount of data generated worldwide (which is now dominated by sensor data) is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data—more bits than all of the stars in the universe. In contrast, the total amount of world data storage (in hard drives, memory chips, and tape) is growing 31% slower, at only 40% per year. A milestone was reached in 2007, when the world produced more data than could fit in all of the world’s storage; in 2011 we already produce over twice as much data as can be stored. This expanding gap between sensor data production and available data storage means that sensor systems will increasingly face a deluge of data that will be unavailable later for further analysis. Similar exponentially expanding gaps exist between sensor data production and both computational power and communication rates. The danger is that more sensor data can lead to less efficient sensor systems. Consider two brief illustrations. The first is the Defense Advanced Research Projects Agency (DARPA) Autonomous Real-Time Ground Ubiquitous

717

11 FEBRUARY 2011

29

which are then fed into a real-time computing farm to further process and compress for storage. All other events are lost in the acquisition process. Given the growing gap between the amount of data we produce and the amount of data we can process, communicate, and store, systems like ARGUS-IS and the CMS will become more the norm than the exception over time. Successfully navigating the data deluge calls for fundamental advances in the theory and practice of sensor design; signal processing algorithms; wideband communication systems; and compression, triage, and storage techniques.

Processing

Communication

Storage

Processing

Communication

Storage

Sensor

INFORMATION

Surveillance Imaging System (ARGUS-IS) developed for military reconnaissance and realtime monitoring that features a 1.8-gigapixel digital camera constructed from hundreds of cell phone camera chips (2). Each camera image covers up to 160 km2 (almost the size of greater Los Angeles) with a 30-cm ground resolution. When acquiring video at 15 frames per second, the camera produces raw data at a rate of 770 gigabits per second (Gbps). In stark contrast, the wireless communications link to the ground station (where the data are to be exploited by signal processing algorithms) has a maximum

Sensor

Sensor

Sensor

INFORMATION

Sensor

Sensor

Fig. 1. Dealing with the sensor data deluge. In a conventional sensing system (top), the sensor is the performance bottleneck. In a data deluge–era sensing system (bottom), the number and resolution of the sensors grow to the point that the performance bottleneck moves to the sensor data processing, communication, or storage subsystem. rate of just 274 megabits per second (Mbps). Even using today’s state-of-the-art video compression algorithms, the camera sensor produces hundreds of times more image and video data than can ever be communicated off the platform. Moreover, moving the ground station’s signal processing hardware up to the sensing platform is out of the question, because it occupies several large racks of computers. The second example is the Compact Muon Solenoid (CMS) detector of the Large Hadron Collider at CERN, which will produce raw measurement data at a rate of 320 terabits per second (Tbps), far beyond the capabilities of either processing or storage systems today (3). As a stopgap measure, custom hardware carefully triages the raw data stream to a rate of 800 Gbps by selecting only the potentially “interesting” events,

718

A recent Frontiers of Engineering event examined some of the encouraging preliminary results in these directions (4). One promising direction is the design of new kinds of data acquisition systems that replace conventional sensors with compressive sensors that combine sensing, compression, and data processing in one operation. The key enabler is the recognition that the amount of information in many interesting signals is much smaller than the amount of raw data produced by a conventional sensor. More technically, many interesting signals inhabit an extremely low-dimensional subset of the high-dimensional raw sensor data space. Rather than first acquiring a massive amount of raw data and then boiling it down into information via signal processing algorithms, compressive sensors attempt to acquire the information directly.

11 FEBRUARY 2011

30

VOL 331

SCIENCE

SCIENCE

Such low-dimensional signal structure may manifest itself in a number of different ways. In a sparse signal model, N raw data samples can be transformed to a domain where only K (much less than N) representation coefficients are nonzero (5, 6). Sparse models lie at the heart of popular compression and processing algorithms such as JPEG. In a manifold signal model, the raw data can be parameterized (nonlinearly, in general) using just K parameters (7). Such a model is natural for imaging problems involving a known object and K unknown camera parameters. Recent research on compressive sensing has led to two results that in combination promise to temper the data deluge. First, signals from both sparse and manifold models can be acquired without information loss using just on the order of KlogN compressive measurements rather than N rawdata measurements (5, 6, 8). Second, a range of different signal processing algorithms can extract the salient signal characteristics directly from the low-rate compressive measurements (9). The sensing protocols that achieve this low measurement rate are inherently random and distinct from the classical Shannon-Nyquist sampling theory that dominates digital sensing theory and practice. In another promising direction, researchers are turning the data deluge to their advantage by replacing conventional signal processing algorithms based on mathematical models with new algorithms that mine the deluge. One striking example is a tool that fuses a large collection of unorganized images of a scene (say, photos of Notre Dame cathedral from the photo-sharing Web site Flickr) and automatically computes each photo’s viewpoint and a three-dimensional model of the scene (10). In the long run, without radical superexponential advances in computer processing, communication, and storage capabilities, the data deluge is here to stay. The next generation of sensor designs and signal processing theory will have to harness the deluge in order to do more, rather than less, with its bounty. The broader implications for science and engineering are appreciable. Can scientific conclusions be trusted when the raw experimental data are lost and the data triage or compression algorithm might be suspect? Can we resist the temptation to equate correlation with causation when mining massive data sets for scientific conclusions? Can we develop the new lowcomplexity mathematical models and the new practical sensing protocols that are needed to effectively extract information from the bulk of the deluge? Clearly, these are exciting times for sensor system design. References 1. J. Gantz, D. Reinsel, “The Digital Universe Decade—Are You Ready?” IDC White Paper, May 2010; http://idcdocserv.com/925. 2. DARPA ARGUS-IS program, www.darpa.mil/i2o/programs/ argus/argus.asp.

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION 3. The CMS Collaboration, J. Instrumentation 3, S08004 (2008). 4. U.S. National Academy of Engineering and Royal Academy of Engineering, Frontiers of Engineering, EU-US Symposium, Cambridge, UK, 31 August to 3 September 2010; www.raeng.org.uk/international/activities/ frontiers_engineering_symposium.htm.

5. E. J. Candès, J. Romberg, T. Tao, IEEE Trans. Inf. Theory 52, 489 (2006). 6. D. L. Donoho, IEEE Trans. Inf. Theory 52, 1289 (2006). 7. J. B. Tenenbaum, V. de Silva, J. C. Langford, Science 290, 2319 (2000). 8. R. G. Baraniuk, M. B. Wakin, Found. Comput. Math. 9, 51 (2009).

PERSPECTIVE

Ensuring the Data-Rich Future of the Social Sciences Gary King Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing, understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure supporting data sharing, data management, informatics, statistical methodology, and research ethics and policy, and these are collectively holding back progress. I address these changes and challenges and suggest what can be done. ifteen years ago, Science published predictions from each of 60 scientists about the future of their fields (1). The physical and natural scientists wrote about a succession of breathtaking discoveries to be made, inventions to be constructed, problems to be solved, and policies and engineering changes that might become possible. In sharp contrast, the (smaller number of ) social scientists did not mention a single problem they thought might be addressed, much less solved, or any inventions or discoveries on the horizon. Instead, they wrote about social science scholarship—how we once studied this, and in the future we’re going to be studying that. Fortunately, the editor’s accompanying warning was more prescient: “history would suggest that scientists tend to underestimate the future” (2). Indeed. What the social scientists did not foresee in 1995 was the onslaught of new social science data—enormously more informative than ever before—and what this information is now making possible. Today, huge quantities of digital information about people and their various groupings and connections are being produced by the revolution in computer technology, the analog-todigital transformation of static records and devices into easy-to-access data sources, the competition among governments to share data and run randomized policy experiments, the new technologyenhanced ways that people interact, and the many commercial entities creating and monetizing new forms of data collection (3). Analogous to what it must have been like when they first handed out microscopes to mi-

F

Institute for Quantitative Social Science, 1737 Cambridge Street, Harvard University, Cambridge, MA 02138, USA. E-mail: [email protected]

crobiologists, social scientists are getting to the point in many areas at which enough information exists to understand and address major previously intractable problems that affect human society. Want to study crime? Whereas researchers once relied heavily on victimization surveys, huge quantities of real-time geocoded incident reports are now available. What about the influence of citizen opinions? Adding to the venerable random survey of 1000 or so respondents, researchers can now harvest more than 100 million social media posts a day and use new automated text analysis methods to extract relevant information (4). At the same time, parts of the biological sciences are effectively becoming social sciences, as genomics, proteomics, metabolomics, and brain imaging produce large numbers of personlevel variables, and researchers in these fields join in the hunt for measures of behavioral phenotypes. In parallel, computer scientists and physicists are delving into social science data with their new methods and data-collection schemes. The potential of the new data is considerable, and the excitement in the field is palpable. The fundamental question is whether researchers can find ways of accessing, analyzing, citing, preserving, and protecting this information. Although information overload has always been an issue for scholars (5), today the infrastructural challenges in data sharing, data management, informatics, statistical methodology, and research ethics and policy risk being overwhelmed by the massive increases in informative data. Many social science data sets are so valuable and sensitive that when commercial entities collect them, external researchers are granted almost no access. Even when sensitive data are collected originally by researchers or acquired from

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

9. S. Muthukrishnan, Found. Trends Theor. Comput. Sci. 1 (issue 2), 117 (2005). 10. N. Snavely, S. M. Seitz, R. Szeliski, ACM Trans. Graph. 25, 835 (2006).

10.1126/science.1197448

corporations, privacy concerns sometimes lead to public policies that require the data be destroyed after the research is completed—a step that obviously makes scientific replication impossible (6) and that some think will increase fraudulent publications (7). Indeed, we appear to be in the midst of a massive collision between unprecedented increases in data production and availability about individuals and the privacy rights of human beings worldwide, most of whom are also effectively research subjects (Fig. 1). Consider how much more informative to researchers, and potentially intrusive to people, the new data can be. Researchers now have the possibility of continuous-time location information from cell phones, Fastlane or EZPass transponders, IP addresses, and video surveillance. We have information about political preferences from person-level voter registration, primary participation, individual campaign contributions, signature campaigns, and ballot images. Commercial information is available from credit card transactions, real estate purchases, wealth indicators, credit checks, product radio-frequency identification (RFIDs), online product searches and purchases, and device fingerprinting. Health information is being collected via electronic medical records, hospital admittances, and new devices for continuous monitoring, passive heart beat measurement, movement indicators, skin conductivity, and temperature. Extensive quantities of information in unstructured textual format are being produced in social media posts, e-mails, product reviews, speeches, government reports, and other Web sources. Satellite imagery is increasing in resolution and scholarly usefulness. Social everything— networking, bookmarking, highlighting, commenting, product reviewing, recommending, and annotating—has been sprouting up everywhere on the Web, often in research-accessible ways. Participation in online games and virtual worlds produces even more detailed data. Commercial entities are scrambling to generate data to improve their business operations through tracking employee behavior, Web site visitors, search patterns, advertising click-throughs, and every manner of cloud services that capture more and more information. Efforts in the social sciences that make data, code, and information associated with individual published articles available to other scholars have been advancing through software, journal policies, and improved researcher practices for some time (8, 9). However, this movement is at risk of

719

11 FEBRUARY 2011

31

collapsing unless the improvements in methods of the original author, with no enforceable rules online; and to share with selected individuals for sharing sensitive, private, or proprietary data governing when access must be provided. This their most private thoughts and secrets. So why, (10) are able to be modified fast enough to keep deserves serious reconsideration and action. We when analyzing these and other personally idenup with the changes in the types and quantities of need to devolve Web visibility and scholarly credit tifiable sensitive data for the public good, does data becoming available and unless public policy for the data to the original author while ensuring policy regularly require researchers (through uniadapts to permit and encourage researchers to use that the data are professionally archived with access versity Institutional Review Boards) to do their them. The necessary technological innovations standards formalized in rules that do not require ad work in locked rooms without access to the Inare more difficult than it may seem. For example, hoc decisions of or control by the original author ternet, other data sources, electronic communication with other researchers, or many of their the venerable strategy of anonymizing data is (12, 13). Second, we need to nurture the growing usual software and hardware tools? Surely we not very useful when, for example, date of birth, gender, and ZIP code alone are enough to per- replication movement (14, 15). More individual can develop policies, protocols, legal standards, sonally identify 87% of the U.S. population (11). scholars should see it as their responsibility to and computer security so that privacy can be mainAnd the cross-classification of 10 survey ques- deposit data and replication information in public tained while data sharing and analysis proceeds tions of 10 categories each contains more unique archives, such as those associated with the Data in far more convenient, efficient, and productive ways. Progress in social science reclassifications than there are people search would be greatly accelerated on the planet. And now think of the THE FUTURE OF SOCIAL SCIENCE DATA if policies merely allowed researchchallenges of sharing continuous-time ers more often—as they do corpocell phone–location information from rations, governments, and private a whole city, or biological information citizens—to analyze sensitive data with hundreds of thousands of variusing appropriate digital rather ables. The political situation is also than physical security. complicated, with a media storm Some Fourth, even when privacy is not generated by each new revelation of privacy an issue, data sharing involves more how personal information is becomplease than putting the data on a Web site. ing publicly available, but at the same Scientists and editors of scholarly time citizens are voluntarily giving journals are not professional archiup more privacy than ever, such as vists, and many homegrown one-off via the rapid transition from private solutions do not last long. Data fore-mail to public or semi-public somats have been changing so fast that cial media posts. archiving standards require special If privacy can be protected in a preservation formatting, using inway that still allows data sharing, ternationally agreed-upon metaconsiderable progress can be made data protocols and appropriate data for people everywhere without harm Scholarly Legal citation standards. Social scientists coming to any one research subject. credit need to continue to build a common, This seems easier than, for example, support open-source, collaborative infrastructhe situation with most randomized Replication Interoperate ture that makes data analysis and medical experiments, in which if data across fields sharing easy (9, 16). However, uneverything works as expected those less we are content to let data sharin one treatment arm will be harmed ing work only within disciplinary relative to those in the other arms. Moreover, most concern about data Fig. 1. New types of research data about human behavior and society pose silos—which of course makes little sense in an era when social science sharing involves individuals, where- many opportunities if crucial infrastructural challenges are tackled. research is more interdisciplinary as social scientists usually seek to make generalizations about aggregates, and so Preservation Alliance for the Social Sciences than ever—we need to develop solutions that opspanning the divide is often possible with ap- (16). More journals should encourage or require erate, or at least interoperate, across scholarly fields. Last, social scientists could use additional help authors to make data available as a condition of propriate statistical methods. What can we do to take advantage of the new publication, and granting agencies should con- from the legal community (17). Standard inteldata while facilitating data sharing and at the tinue to encourage data sharing norms. More lectual property rules and data use agreements same time protecting privacy? First, before we try importantly, when we teach we should explain need to be developed so that every data set does to convince other parts of society to give us some that data sharing and replication is an integral part not have its own essentially artisan legal work that leeway, we social scientists need to get our own of the scientific process. Students need to under- merely increases transaction costs and reduces act together. At present, large data sets collected stand that one of the biggest contributions they or data sharing. The federal government should reby social scientists in most fields are routinely anyone is likely to be able to make is through consider and relax the rules that prevent academic researchers from collecting, sharing, and shared, but the far more prevalent smaller data data sharing (8). sets that are unique or derived from larger data Third, we need to continue research into publishing from data that those in other sectors sets are regularly lost, hidden, or unavailable— privacy-enhanced data sharing protocols (10) and of society do routinely. Of course, social scientists have plenty to do often making the related publications unreplica- to communicate better what is possible to govble. In most cases, many data sets associated with ernment officials. Modern technology allows even before we publish and share data. We must individual publications, and the related computer hundreds of millions of people to do electronic find ways of educating students about noncode and other information necessary to reproduce banking, commerce, and investing on the web; to standard data types, computational methods that the published tables and figures from the input view their personal medical records; to store their scale, legal protocols, data sharing norms, and data, are not available unless you obtain permission photographs, videos, and personal documents statistical tools that can take advantage of the

720

11 FEBRUARY 2011

32

VOL 331

SCIENCE

SCIENCE

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION new opportunities. Data are now arriving fast enough that the work life of many current social scientists is observably changing: Whereas they once sat in their offices working on their own, rates of co-authorship are increasing fast, and a collaborative laboratory-type work model is emerging in many subfields. These trends would be greatly facilitated by universities and funding agencies recognizing the need to build the infrastructure to support social science research. For the first time in many areas of the social sciences, new forms and quantities of information may well make dramatic progress possible. Will we be ready?

References and Notes 1. H. Weintraub et al., Science 267, 1609 (1995). 2. D. E. Koshland, Science 267, 1575 (1995). 3. G. King, K. Scholzman, N. Nie, Eds., The Future of Political Science: 100 Perspectives (Routledge, New York, 2009), pp. 91–93. 4. D. Hopkins, G. King, Am. J. Pol. Sci. 54, 229 (2010). 5. A. M. Blair, Too Much to Know: Managing Scholarly Information before the Modern Age (Yale Univ. Press, New Haven, 2010). 6. C. Mackie, N. Bradburn, Eds., Improving Access to and Confidentiality of Research Data (National Research Council, Washington, DC, 2000), p. 49. 7. R. F. White, The Independent Review XI, 547 (2007). 8. G. King, PS Pol. Sci. Polit. 39, 119 (2006).

Metaknowledge James A. Evans* and Jacob G. Foster The growth of electronic publication and informatics archives makes it possible to harvest vast quantities of knowledge about knowledge, or “metaknowledge.” We review the expanding scope of metaknowledge research, which uncovers regularities in scientific claims and infers the beliefs, preferences, research tools, and strategies behind those regularities. Metaknowledge research also investigates the effect of knowledge context on content. Teams and collaboration networks, institutional prestige, and new technologies all shape the substance and direction of research. We argue that as metaknowledge grows in breadth and quality, it will enable researchers to reshape science—to identify areas in need of reexamination, reweight former certainties, and point out new paths that cut across revealed assumptions, heuristics, and disciplinary boundaries.

W

Department of Sociology, University of Chicago, Chicago, IL 60637, USA. *To whom correspondence should be addressed. E-mail: [email protected]

now be obtained on large scales, enabled by a concurrent informatics revolution. Over the past 20 years, scientists in fields as diverse as molecular biology and astrophysics have drawn on the power of information technology to manage the growing deluge of published findings. Using informatics archives spanning the scientific process, from data and preprints to publications and citations, researchers can now track knowledge claims across topics, tools, outcomes, and institutions (1–3). Such investigations yield metaknowledge about the explicit content of science, but also expose implicit content—beliefs, preferences, and research strategies that shape the direction, pace, and substance of scientific discovery. Metaknowledge research further explores the interaction of knowledge content with knowledge context, from features of the scientific system such as multi-institutional collaboration (4) to global trends and forces such as the growth of the Internet (5). The quantitative study of metaknowledge builds on a large and growing corpus of qualitative investigations into the conduct of science from history, anthropology, sociology, philosophy, psychology, and interdisciplinary studies of science. Such investigations reveal the existence of many intriguing processes in the production of scientific knowledge. Here, we review quantitative assessments of metaknowledge that trace the distribution of such processes at large scales. We

www.sciencemag.org

www.sciencemag.org

10.1126/science.1197872

argue that these distributional assessments, by characterizing the interaction and relative importance of competing processes, will not only provide new insight into the nature of science but will create novel opportunities to improve it.

PERSPECTIVE

hat knowledge is contained in a scientific article? The results, of course; a description of the methods; and references that locate its findings in a specific scientific discourse. As an artifact, however, the article contains much more. Figure 1 highlights many of the latent pieces of data we consider when we read a paper in a familiar field, such as the status and history of the authors and their institutions, the focus and audience of the journal, and idioms (in text, figures, and equations) that index a broader context of ideas, scientists, and disciplines. This context suggests how to read the paper and assess its importance. The scope of such knowledge about knowledge, or “metaknowledge,” is illustrated by comparing the summary information a first-year graduate student might glean from reading a collection of scientific articles with the insight accessible to a leading scientist in the field. Now consider the perspective that could be gained by a computer trained to extract and systematically analyze information across millions of scientific articles (Fig. 1). Metaknowledge results from the critical scrutiny of what is known, how, and by whom. It can

9. The Dataverse Network, http://TheData.org. 10. C. C. Aggarwal, P. S. Yu, Eds., Privacy-Preserving Data Mining: Models and Algorithms (Springer, New York, 2008). 11. L. Sweeney, J. Law Med. Ethics 25, 98 (1997). 12. G. King, Sociol. Methods Res. 36, 173 (2007). 13. M. Altman, G. King, D-Lib 13, 10.1045/march2007altman (2007). 14. G. King, PS Pol. Sci. Polit. 28, 494 (1995). 15. R. G. Anderson, W. H. Green, B. D. McCullough, H. D. Vinod, J. Econ. Methodol. 15, 99 (2008). 16. DATA-Pass, www.icpsr.umich.edu/icpsrweb/DATAPASS/. 17. V. Stodden, Int. J. Comm. Law Pol. 13, 1 (2009). 18. My thanks to M. Altman and M. Crosas for helpful comments on an earlier version.

SCIENCE

VOL 331

SCIENCE

Patterns of Scientific Content The analysis of explicit knowledge content has a long history. Content analysis, or assessment of the frequency and co-appearance of words, phrases, and concepts throughout a text, has been pursued since the late 1600s, ranging from efforts in 18thcentury Sweden to quantify the heretical content of a Moravian hymnal (6) to mid–20th-century studies of mass media content in totalitarian regimes. Contemporary approaches focus on the computational identification of “topics” in a corpus of texts. These can be tracked over time, as in a recent study of the news cycle (7). “Culturomics” projects now follow topics over hundreds of years, using texts digitized in the Google Books project (3). Topics can also be used to identify similarities between documents, as in topic modeling, which represents documents statistically as unstructured collections of “topics” or phrases (8). With the rise of the Internet and computing power, statistical methods have also become central to natural language processing (NLP), including information extraction, information retrieval, automatic summarization, and machine reading. Advances in NLP have made it one of the most rapidly growing fields of artificial intelligence. Now that the vast majority of scientific publications are produced electronically (5), they are natural objects for topic modeling (9) and NLP. Some recent work, for example, uses computational parsing to extract relational claims about genes and proteins, and then compares these claims across hundreds of thousands of papers to reconcile contradictory results (10) and identify likely “missing” elements from molecular pathways (11). In such fields as biomedicine, electronic publications are further enriched with structured metadata (e.g., keywords) organized into hierarchical ontologies to enhance search (12). Citations have long been used in “scientometric” investigations to explore dependencies among

721

11 FEBRUARY 2011

33

claims, authors, institutions, and fields (13). Search data produce another trace that can be analyzed (14). In public health, the changing tally of influenzarelated Google searches has been used to predict emerging flu epidemics faster than can be accomplished by public health surveillance (15). Similar analysis could predict emerging research topics and fields. The rise in scientific review articles and the concomitant explosion of scientific publications over the past century trace a growing supply and demand for the focused assessment and synthesis of research claims. As the number of analyses investigating a particular claim has become unmanageable [e.g., the efficacy of extrasensory perception (16); the influence of class size on student achievement (17); the role of b-amyloid in Alzheimer’s disease (18)], researchers have increasingly engaged in meta-analysis—counting, weighting, and statistically analyzing a census of published findings on the topic (19, 20). Whereas “the combination of observations” had been the central focus of 19th-century statistics (21), the

Reader

combination of findings across articles was first formulated by Karl Pearson (22) regarding the efficacy of inoculation and later by Ronald Fisher in agricultural research (23). By the mid-20th century, with burgeoning scientific literatures, meta-analysis rapidly entered medicine, public health, psychology, and education. Scientists performing meta-analyses were forced to confront the “file-drawer problem”: Negative and unpublishable findings never leave the “file drawer” (19, 24). Indeed, all approaches to the analysis of explicit knowledge content aim to discover heretofore hidden regularities such as the file-drawer problem. These regularities in turn reveal the effects of implicit scientific content: detectable but inexplicit beliefs and practices. Implicit Preferences, Heuristics, and Assumptions Such implicit content includes a range of factors, from unstated preferences, tastes, and beliefs to the social processes of communication and citation. The file-drawer problem, for example, is driven by the well-attested preference for publishing

positive results (18, 25) and statistical findings that exceed arbitrary, field-specific thresholds (26). Such preferences may lead to a massive duplication of scientific effort through retesting doomed hypotheses. Magnifying their effect is a trend toward agreement with earlier results, which leads scientists to censor or reinterpret their data to be consistent with the past. Early results thus fossilize into facts through a cascade of positive citation, forming “microparadigms” (27, 28). In choosing what parts of past knowledge to certify through positive citation, scientists are likely to accept authors with a history of success more readily. Scientific training further disciplines researchers to focus on established hubs of knowledge (29), with most articles shunning novelty to examine popular topics with popular methods. Even high-impact journals prefer publications on “hot topics”—albeit those using less popular methods (30). Somewhat mitigating the trend toward assent and convergence, scientists often attempt to counter or extend high-profile research. This is particularly true for research staking novel claims

Evaluation

Perception

Comparison

(e.g., authors) Name

Bin

W.E. Spicer Z.-X. Shen

Best

D.S. Dessau B.O. Wells D.M. King

Good

Vague, unordered impression of field

New student Name

Rank

W.E. Spicer

1

Z.X. Shen

2

D.S. Dessau

3

B.O. Wells

4

D.M. King

5

Name

Citations

Detailed, ordered ranking of field

Leading scientist h-index # papers Avg. Cites

W.E. Spicer 26,859

77

726

37.0

Z.X. Shen

15,115

61

273

55.4

D.S. Dessau

5,108

29

80

63.9

B.O. Wells

4,002

28

68

58.9

D.M. King

2,347

11

14

167.6

Computer system

Fig. 1. Readers vary in the information they extract from an article. A new graduate student perceives a tiny fraction of available information, focusing on familiar authors, terms, references, and institutions. Her evaluation is limited to categorical classification (e.g., of the authors) into known and unknown (“important” and “unimportant”). For comparison she has the small collection of papers she has read. A leading scientist perceives a wealth of latent data, assembling individuals into mentorship relations and locating terms, as well as graphical and mathematical idioms, in historical and theoretical context. His evaluations generate rank orders

722

11 FEBRUARY 2011

34

VOL 331

based on his experience in the field. He can compare a paper to thousands, and searches a large literature efficiently. An appropriately trained computer would complement this expertise with quantification and scale. It can rapidly access quantitative and relational information about authors, terms, and institutions, and order these items along a range of measures. For comparison it can already access a large fraction of the scientific literature—millions of articles and an increasing pool of digitized books; in the future it will scrape further data from Web pages, online databases, video records of conferences, etc. SCIENCE

SCIENCE

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION

Fig. 2. Research dynamics as manifest in collaboration networks. The graphic represents the network of researchers who were co-authors of papers concerning the two largest biochemical clusters in PubMed—critical organic molecules (e.g., calcium, potassium, ATP) and neurotransmitters (e.g., norepinephrine, serotonin, acetylcholine)—during the period 1995 to 2000. Researchers are connected by a link if they have collaborated on at least one article; core researchers (shown here) have 25+ collaborators. Magenta nodes correspond to researchers who were primarily authors of papers concerning biochemicals within the largest cluster (critical organic molecules) and green nodes to authors primarily publishing in the second largest cluster (neurotransmitters). All other clusters are ignored. Dark blue nodes denote and can lead to the rapid alternation of conflicting findings known as the “Proteus” phenomenon (31). High-profile research attracts “more eyeballs,” and a reliable negation of its findings will often attract considerable interest (32). Indeed, the individual incentive to publish in prestigious journals may itself be a distorting preference, potentially leading to higher incidence of overstated results (25, 33). The foregoing discussion of implicit content traces the essential tension (34) between tradition and originality. Rather than assuming that these forces always and everywhere resolve in equilibrium, metaknowledge investigations have begun to model their relative strength in different fields and recalibrate scientific certainty in those findings. By making this implicit content available to researchers, metaknowledge could inform individual strategies about research investment, pointing out overgrazed fields where crowding leads to diminishing returns and forgotten opportunities where premature certainty has halted promising investigation.

These studies use simple analysis of explicit content to infer preferences and biases. More powerful methods of natural language processing and statistical analysis will be essential for revealing subtle content. Scientific documents will likely demonstrate a range of cognitive short cuts, such as the availability heuristic, in which data and hypotheses are weighted on the basis of how easily they come to mind (35, 36). Although such heuristics are individually “irrational” because they violate normative theories of probability and decision-making, they can be beneficial to science overall. For example, the vast majority (91%) of geologists who had worked on Southern Hemisphere samples supported the theory of continental drift, versus 48% of geologists who had not. Because Southern Hemisphere data supporting drift were more available to them, these geologists “irrationally” overweighted these data and hence became the core community that went on to build the case for plate tectonics (36). Identifying the distribution of these heuristics across scientific investigations will allow consid-

www.sciencemag.org

www.sciencemag.org

researchers, the majority of whose articles link the two chemical clusters. Light blue nodes are extreme “innovators” who publish mostly on new chemicals. The graphic is laid out using the Fruchterman-Reingold algorithm (63), which treats the network as a physical system, minimizing the energy if nodes are programmed to repel one another and links to draw them together like springs. The magenta and green nodes form relatively tight groups, indicating densely interlinked collaboration, whereas most interdisciplinary researchers occupy the boundary region between them. Extreme innovators (e.g., in the lower right) are marginal. This illustrates a suggestive homology between the semantic and social structure of science. The figure does not attempt to disentangle the relative primacy of social or semantic structure.

SCIENCE

VOL 331

SCIENCE

eration of their consequences, expose possible bias, and recalibrate scientific certainty in particular propositions (10). Moreover, subtle but systematic regularities across articles within a scientific domain may signal the presence of “ghost theories”: unstated assumptions, theories, or disciplinary paradigms that shape the type of reasoning and evidence deemed acceptable (37). For example, because it is widely supposed that most human psychological properties are universal, results from “typical” experimental participants (American undergraduates, in 67% of studies published in the Journal of Personality and Social Psychology) are often extended to the entire species (38). A recent metaanalysis (38) demonstrated that this assumption is false in several domains, including fundamental ones such as perception, and recommends expensive changes in sampling to correct for the resulting bias. Cognitive anthropologists (39), feminists (40), and ethnic studies scholars (41) have also pointed to examples of knowledge and ways of knowing that are distinct to social groups.

723

11 FEBRUARY 2011

35

Computation can assist in the large-scale hunt for regularities (such as the frequent appearance of “undergraduate” in participant descriptions) that signal the presence of ghost theories, even when untied to pre-identifiable groups, and could eventually help to identify these unwritten axioms, opening them up to public debate and systematic testing. Knowledge Context Scientific documents contain both explicit and implicit content, both of which interact powerfully with the context of their production (42). For example, the reliability of a result increases if it is produced in several disparate labs rather than a few linked by shared methods or shared mentorship. Scientific training likely places a long-lasting stamp on a researcher (43). This suggests that tracing knowledge transmission from teacher to student could reveal much about the spread and entrenchment of ideas and practices. The changing organization of research also shapes research content, with teams increasingly producing the most highly cited research (44). Studies of the structure of collaboration networks (45–47) reveal intriguing disciplinary differences, but researchers are just beginning to explore the impact of teams and larger networks on the creation, diffusion, and diversity of knowledge content. Historians have documented instances in which research dynasties and larger institutions influence scientific knowledge (48–50), but the extraction of metaknowledge on the distribution of these influences would enable estimation of their aggregate capacity to channel the next generation of research. Figure 2 hints at the importance of understanding the relationship between social and scientific structures by showing how chemists and biologists whose collaborations bridge scientific subgroups tend to investigate reactions that themselves bridge distinct clusters of molecules. Investigation of this phenomenon should simultaneously explore the degree to which chemical structure shapes the social structure of investigation, and vice versa. Teams and collaboration networks are embedded in larger institutional structures. Peer review, for example, is central to scientific evaluation and reward. Theoretical work on peer review, however, suggests potential inefficiencies and occasions for knowledge distortion (51, 52). Research also occurs in specific physical settings: universities, institutes, and companies that vary in prestige, access to resources, and cultures of scientific practice. Institutional reputations likely color the acceptance of research findings. Indeed, recent work on multi-institutional collaboration demonstrates that institutions tend to collaborate with others of similar prestige, potentially exacerbating this effect (4). One underexplored issue regards the influence of shared resources—databases, accelerators, telescopes—on the organization of related research and the pace of advance. Given the capital intensity of much contemporary science, this question is critical for science

724

policy. It also underscores the broader role of funding in the direction and success of research programs. For example, can focused investment unleash sudden breakthroughs, or is the slow development of community, shared culture and a toolkit more important to nurture a flow of discoveries? And what of private patronage? There is evidence from metaknowledge that embedding research in the private or public sector modulates its path. Company projects tend to eschew dogma in an impatient hunt for commercial breakthroughs, leading to rapid but unsystematic accumulation of knowledge, whereas public research focuses on the careful accumulation of consistent results (53). The production of scientific knowledge is embedded in a broader social and technological context. A wealth of intriguing results suggests that this is a fruitful direction for further metaknowledge work. In biomedical research, for example, social inequalities and differences in media exposure (54) partially determine research priorities. Organized and well-connected groups lobby for increased investment in certain diseases, drawing resources away from maladies that disproportionately affect the poor and those in impoverished countries. The dearth of biomedical knowledge relevant to poorer countries is likely exacerbated by lack of access to science, as revealed by decreased citations to commercial access publications in poor countries (55). The long-term impact of the Internet and related technologies on scientific production remains unclear. Early results indicate that online availability not only allows researchers to discover more diverse scientific content, but also makes visible other scientists’ choices about what is important. This feedback leads to faster convergence on a shrinking subset of “superstar” articles (5). The Web also hosts radical experiments in the dissemination of scientific practices, such as myExperiment (56). It has fostered “open source” approaches to collaborative research, such as the Polymath project (57) and Zooniverse (58), and the provision of real-time knowledge services, such as the computational knowledge engine WolframAlpha (59). Online projects create novel opportunities to generate collaborations and data, but digital storage may also render their results more ephemeral (60). Understanding how social context and emerging media interact with scientific content will enable reevaluation not only of the science, but of the strengths and liabilities these new technologies hold for public knowledge. Why Metaknowledge? Why Now? The ecology of modern scientific knowledge constitutes a complex system: apparently complicated, involving strong interaction between components, and predisposed to unexpected collective outcomes (61). Although science has ever been so, the growing number of global scientists, increasingly connected via multiple channels—international

11 FEBRUARY 2011

36

VOL 331

SCIENCE

SCIENCE

conferences, online publications, e-mail, and science blogs—has increased this complexity. Rising complexity in turn makes the changing focus of research and the resolution of consensus less predictable. The informatics turn in the sciences offers a unique opportunity to mine existing knowledge for metaknowledge in order to identify and measure the complex processes of knowledge production and consumption. As such, metaknowledge research provides a high-throughput complement to existing work in social and historical studies of science by tracing the distribution and relative influence of distinct social, behavioral, and cognitive processes on science. Metaknowledge investigations will miss subtle regularities accessible to deep, interpretive analysis, and should draw on such work for direction. We argue, however, that some regularities will only be identifiable in the aggregate, especially those involving interrelations between competing processes. Once identified, these could become fruitful subjects for interpretive investigation. Successfully executing the more ambitious parts of the metaknowledge program will require further improvements in machine reading and inference technologies. Systematic analysis of some elements of scientific production will remain out of reach (62). Nonetheless, as metaknowledge grows in sophistication and reliability, it will provide new opportunities to recursively shape science—to use measured biases, revealed assumptions, and previously unconsidered research paths to revise our confidence in bodies of knowledge and particular claims, and to suggest novel hypotheses. The computational production and consumption of metaknowledge will allow researchers and policy-makers to leverage more scientific knowledge—explicit, implicit, contextual— in their efforts to advance science. This will become essential in an era when so many investigations are linked in so many ways. References and Notes 1. D. R. Swanson, Bull. Med. Libr. Assoc. 78, 29 (1990). 2. D. R. Swanson, N. R. Smalheiser, Artif. Intell. 91, 183 (1997). 3. J.-B. Michel et al., Science 331, 176 (2011); 10.1126/science.1199644. 4. B. F. Jones, S. Wuchty, B. Uzzi, Science 322, 1259 (2008); 10.1126/science.1158357. 5. J. A. Evans, Science 321, 395 (2008). 6. K. Dovring, Public Opin. Q. 18, 389 (1954). 7. J. Leskovec, L. Backstrom, J. Kleinberg, in Proceedings of the 15th Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining, International Conference on Knowledge Discovery and Data Mining, Paris, 28 June to 1 July 2009. 8. T. L. Griffiths, M. Steyvers, Proc. Natl. Acad. Sci. U.S.A. 101 (suppl. 1), 5228 (2004). 9. K. K. Mane, K. Börner, Proc. Natl. Acad. Sci. U.S.A. 101 (suppl. 1), 5287 (2004). 10. A. Rzhetsky, T. Zheng, C. Weinreb, PLoS ONE 1, e61 (2006). 11. C. J. Krieger et al., Nucleic Acids Res. 32 (database issue), D438 (2004). 12. A. H. Renear, C. L. Palmer, Science 325, 828 (2009). 13. B. Cronin, H. B. E. Atkins, The Web of Knowledge: A Festschrift in Honor of Eugene Garfield (Information Today Inc., Medford, NJ, 2000).

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION 14. J. Bollen et al., PLoS ONE 4, e4803 (2009). 15. J. Ginsberg et al., Nature 457, 1012 (2009). 16. J. Pratt, J. Rhine, B. Smith, C. Stuart, J. Greenwood, Extra-Sensory Perception After Sixty Years: A Critical Appraisal of the Research in Extra-Sensory Perception (Holt, New York, 1940). 17. G. V. Glass, M. L. Smith, Educ. Eval. Policy Anal. 1, 2 (1979). 18. S. A. Greenberg, BMJ 339, b2680 (2009). 19. J. E. Hunter, F. L. Schmidt, Methods of Meta-Analysis: Correcting Error and Bias in Research Findings (Sage, Thousand Oaks, CA, ed. 2, 2004). 20. Cochrane Collaboration Reviewer’s Handbook, www.cochrane.org/training/cochrane-handbook. 21. S. M. Stigler, The History of Statistics: The Measurement of Uncertainty Before 1900 (Belknap/Harvard Univ. Press, Cambridge, MA, 1986). 22. K. Pearson, Br. Med. J. 3, 1243 (1904). 23. K. O’Rourke, J. R. Soc. Med. 100, 579 (2007). 24. L. V. Hedges, Stat. Sci. 7, 246 (1992). 25. J. P. Ioannidis, PLoS Med. 2, e124 (2005). 26. A. S. Gerber, N. Malhotra, Sociol. Methods Res. 37, 3 (2008). 27. A. Rzhetsky, I. Iossifov, J. M. Loh, K. P. White, Proc. Natl. Acad. Sci. U.S.A. 103, 4940 (2006). 28. U. Shwed, P. S. Bearman, Am. Sociol. Rev. 75, 817 (2010). 29. M. Cokol, I. Iossifov, C. Weinreb, A. Rzhetsky, Nat. Biotechnol. 23, 1243 (2005). 30. M. Cokol, R. Rodriguez-Esteban, A. Rzhetsky, Genome Biol. 8, 406 (2007). 31. J. P. Ioannidis, T. A. Trikalinos, J. Clin. Epidemiol. 58, 543 (2005).

32. M. Cokol, I. Iossifov, R. Rodriguez-Esteban, A. Rzhetsky, EMBO Rep. 8, 422 (2007). 33. J. P. Ioannidis, JAMA 294, 218 (2005). 34. T. S. Kuhn, The Essential Tension: Selected Studies in Scientific Tradition and Change (Univ. of Chicago Press, Chicago, 1977). 35. A. Tversky, D. Kahneman, Science 185, 1124 (1974). 36. M. Solomon, Philos. Sci. 59, 439 (1992). 37. D. L. Smail, On Deep History and the Brain (Univ. of California Press, Berkeley, CA, 2008). 38. J. Henrich, S. J. Heine, A. Norenzayan, Behav. Brain Sci. 33, 61 (2010). 39. M. E. Harris, Ed., Ways of Knowing: New Approaches in the Anthropology of Experience and Learning (Berghahn, Oxford, 2007). 40. E. Martin, Signs 16, 485 (1991). 41. R. Barnhardt, Anthropol. Educ. Q. 36, 8 (2005). 42. K. Knorr-Cetina, Epistemic Cultures: How the Sciences Make Knowledge (Harvard Univ. Press, Cambridge, MA, 1999). 43. R. Collins, The Sociology of Philosophies: A Global Theory of Intellectual Change (Belknap/Harvard Univ. Press, Cambridge, MA, 1998). 44. S. Wuchty, B. F. Jones, B. Uzzi, Science 316, 1036 (2007); 10.1126/science.1136099. 45. M. E. J. Newman, Phys. Rev. E 64, 016131 (2001). 46. M. E. J. Newman, Phys. Rev. E 64, 016132 (2001). 47. M. E. J. Newman, Proc. Natl. Acad. Sci. U.S.A. 101 (suppl. 1), 5200 (2004). 48. D. J. Kevles, The Physicists: The History of a Scientific Community in Modern America (Knopf, New York, ed. 1, 1978).

Access to Stem Cells and Data: Persons, Property Rights, and Scientific Progress 1

Debra J. H. Mathews, * Gregory D. Graff,2 Krishanu Saha,3 David E. Winickoff4 Many fields have struggled to develop strategies, policies, or structures to optimally manage data, materials, and intellectual property rights (IPRs). There is growing recognition that the field of stem cell science, in part because of its complex IPRs landscape and the importance of cell line collections, may require collective action to facilitate basic and translational research. Access to pluripotent stem cell lines and the information associated with them is critical to the progress of stem cell science, but simple notions of access are substantially complicated by shifting boundaries between what is considered information versus material, person versus artifact, and private property versus the public domain.

A

ty relations and develop an accountable system of stewardship to handle the ethical and legal obligations to materials donors could help ameliorate these obstacles, benefiting academic and industry scientists and promoting both basic and translational research. Difficulties in procuring and managing human cells arise because they are increasingly transcending three distinctions drawn in ethics, law, and common practice—distinctions between information and materials, persons and artifacts, and private property and the public domain

www.sciencemag.org

www.sciencemag.org

SCIENCE

10.1126/science.1201765

(Fig. 1). Some allowance for blurring between and across these conventional distinctions can be useful and productive for science, but it also creates substantial challenges. For example, a cell line is much richer as a research tool by virtue of its connection to an individual human, but that connection also raises privacy concerns if the tool is shared.

PERSPECTIVE

ccess to data and materials is critical to the progress of science generally (1–3), but plays a particularly important role in stem cell science. In the field of pluripotent stem cell research, data and information associated with cell lines are essential to their utility and management. A number of factors currently limit the sharing of data and materials in the field, including strategic behavior of individual scientists, ethical intricacies in using human cell lines, and a complex landscape of intellectual property rights (IPRs). Efforts to manage proper-

49. T. Lenoir, Instituting Science: The Cultural Production of Scientific Disciplines (Stanford Univ. Press, Stanford, CA, 1997). 50. P. Coffey, Cathedrals of Science: The Personalities and Rivalries That Made Modern Chemistry (Oxford Univ. Press, London, 2008). 51. S. Allesina, http://arxiv.org/abs/0911.0344 (2009). 52. S. Thurner, R. Hanel, http://arxiv.org/abs/1008.4324 (2010). 53. J. A. Evans, Soc. Stud. Sci. 40, 757 (2010). 54. E. M. Armstrong, D. P. Carpenter, M. Hojnacki, J. Health Polit. Policy Law 31, 729 (2006). 55. J. A. Evans, J. Reimer, Science 323, 1025 (2009). 56. myExperiment, www.myexperiment.org. 57. The Polymath Blog, http://polymathprojects.org. 58. Zooniverse, www.zooniverse.org/home. 59. WolframAlpha, www.wolframalpha.com. 60. E. Evangelou, T. A. Trikalinos, J. P. Ioannidis, FASEB J. 19, 1943 (2005). 61. R. Foote, Science 318, 410 (2007). 62. H. M. Collins, Sci. Stud. 4, 165 (1974). 63. T. M. J. Fruchterman, E. M. Reingold, Softw. Pract. Exper. 21, 1129 (1991). 64. This research benefited from NSF grant 0915730 and responses at the U.S. Department of Energy’s Institute for Computing in Science (ICiS) workshop “Integrating, Representing, and Reasoning over Human Knowledge: A Computational Grand Challenge for the 21st Century.” We thank K. Brown, E. A. Cartmill, M. Cartmill, and two anonymous reviewers for their detailed and constructive comments on this essay.

VOL 331

SCIENCE

Work to Date, Work to Do Other research communities in the life sciences have experienced problems similar to those outlined above, and some have addressed them through new institutions such as public DNA sequence databases, tissue banks, and mouse repositories (3). Within stem cell science, there are important efforts under way to improve access to both cell lines and associated information [e.g., UK Stem Cell Bank, European Human Embryonic Stem Cell Registry (hESCreg)] (4). However, these efforts struggle to keep pace with evolving data requirements and the proliferation of new induced pluripotent stem cell lines and do not provide all of the information about 1 Johns Hopkins Berman Institute of Bioethics, Deering Hall, 208, 1809 Ashland Avenue, Baltimore, MD 21205, USA. 2Department of Agricultural and Resource Economics, B328 Clark Hall, Colorado State University, Fort Collins, CO 80523, USA. 3 The Whitehead Institute for Biomedical Research, 9 Cambridge Center, Cambridge, MA 02142, USA. 4Department of Environmental Science, Policy, and Management, 115 Giannini Hall, University of California, Berkeley, CA 94720 USA.

*To whom correspondence should be addressed. E-mail: [email protected]

725

11 FEBRUARY 2011

37

cell lines that scientists may need. Existing banks and repositories also suffer unnecessary overlap and duplication of effort and lack interoperability, limiting their utility to scientists (3, 5, 6). For example, a recently issued statement by the Hinxton Group (7) notes that problems with access are twofold, involving access both to cell lines themselves and to critical information about those lines, including technical, provenance, and IPRs characteristics. They note, “… many cell lines are being derived and characterized, though not all lines are being published in the literature, even in the academic sector. Furthermore, useful cell lines created from human materials (especially those created with public funds) and their associated data should be distributed and used widely, constrained only by the wishes of the materials’ donors” (7). In addition, the statement observes that we currently have “a situation in which even a diligent stem cell researcher or entity that wishes to respect IPRs will face considerable uncertainty and enormous costs if they try to survey the IPRs landscape” (7). To address these concerns, the Hinxton Group recommended developing publicly available electronic hubs for accessing a range of relevant data linked to individual stem cell lines. The vision is to coordinate existing resources—including the UK Stem Cell Bank, the NIH Human Embryonic Stem Cell Registry, the International Stem Cell Registry at

tion) would create substantial ethical and economic concerns; proper provision will require creative solutions to both legal and technical challenges. The three crucial boundaries—between information and materials; persons and artifacts; and private property and the public domain—have become highly dynamic, shifting relative to one another as the field advances. If the task of developing common material and data resources is conceptualized as one of facilitating and managing these three dualisms, it may help explain why existing resources have had difficulty gaining traction. Such a perspective also highlights the kind of functions these resources must perform to be really useful. The challenge is to allow productive blurring between these categories (e.g., tight connections between information and material), while maintaining critical boundaries (e.g., protecting donor privacy). Any successful architecture must manage these tensions capably. Information Versus Materials Today, a strong and durable cut between what constitutes “data” and what constitutes “materials” cannot be easily made. For example, the recent federal court decision invalidating a patent to Myriad Genetics on a gene associated with breast cancer (8) found that the informational nature of DNA, not its “physical embodiment,” was more meaningful in considering

Fig. 1. Blurred boundaries in human stem cell research. the University of Massachusetts, the European Human Embryonic Stem Cell Registry, and the International Stem Cell Banking Initiative— together with the patent landscaping efforts of the Japan Patent Office and the UK Intellectual Property Office, and academic efforts such as the Stanford Program on Stem Cells in Society (7). When designing resources to facilitate access to information and materials, careful thought must be given to the kinds of data needed and how coexistence of different categories of data may influence a resource’s utility. Public provision of some kinds of data (e.g., identifying patient information, and proprietary informa-

726

the validity of Myriad’s patent. With DNA, productive use of the material is, today, inextricably linked to the information it embodies. A similar principle can be applied to human cell lines. Technical data (e.g., derivation, culture history, genetic and epigenetic characteristics), donor information (e.g., provenance, medical history, family history), and ownership information [e.g., IPR, Material Transfer Agreements (MTAs), contract conditions] together determine the utility of a cell line. For example, if a donor has, through an informed-consent document, agreed just to a limited set of uses of their cells, that information must always accompany those cells and govern their use. Furthermore, since

11 FEBRUARY 2011

38

VOL 331

SCIENCE

SCIENCE

the cells are alive, what a particular cell line “is” today may be different from what it “is” at a different point in the future. Although a cell line does not change categorically (it’s the same cell line), it does change qualitatively (e.g., at the genetic or epigenetic level), and those simultaneous changes in material and data are real, significant, and inextricably linked. Other associated information may also change— such as medical events experienced by the donor or changes in ownership rights relevant to the line—and may affect the utility of that line. For example, noting that a donor responded to a particular small-molecule therapy could make the corresponding cell line useful for genetic and biochemical studies on mechanism of action. Persons Versus Artifacts The connection to human donors complicates how we regard human cell lines. The lines are simultaneously human-made research tools and entities derived from individual human beings. The cells containing a donor’s complete genome, and information critical to the use of the artifact, or discovered in the course of work with that artifact, will constitute information about that person. Furthermore, for disease-specific pluripotent cell lines, clinical and phenotypic information about the materials donor is increasingly collected. For all cell lines, provenance information is only of interest because informed-consent documents outline researchers’ contractual obligations to the donor as a legal person. Recent cases make it clear that what happens to human research materials, especially “immortalized cell lines,” can be deeply meaningful to those from whom the materials have come. A recent book about Henrietta Lacks (9) and the creation of the HeLa cell line has spurred new discussions about informed consent, commercial use, and, importantly, the continuity between human research subjects and the cells derived from their bodies. Similar issues were raised in a dispute between the Havasupai tribe and Arizona State University, in which tribal members alleged that blood samples were used in genetic studies for which they had not given permission (10). Researchers currently protect confidentiality of donors’ data by one of two methods: anonymization, where identity is irreversibly severed from the material to prevent any future re-identification; or de-identification, where coded and linked identifying information is retained separately from the material. As the paradigm of single-center cell banking is increasingly replaced by multisite arrangements involving institutions across many jurisdictions, there are concerns that these confidentiality practices may be insufficient (5), and that the melding of persons and artifacts may lead to unprecedented violations of privacy and consent (11).

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION

A Promising Start The proposed information and materials hubs for stem cell research (7) may be a solution. Such

00 2000 US WO applications

00 1600

New international patent families (INPADOC) EPO Japan China

00 1200

Australia 00 1000

Korea Canada

00 800

Germany 00 600 400 200

19

9 19 0 9 19 1 9 19 2 9 19 3 9 19 4 9 19 5 9 19 6 9 19 7 9 19 8 9 20 9 0 20 0 0 20 1 0 20 2 0 20 3 0 20 4 0 20 5 0 20 6 0 20 7 0 20 8 09

0

Fig. 2. Stem cell patents and patent applications published by various patent offices. Data source: Thomson Innovation (2010), queried using methods of Bergman and Graff (13).

hubs could improve access to data and materials generally and serve a gate-keeping function for access to various stakeholders, being mindful of ethics and IPRs concerns and providing a solution for mediating the complex and blurred distinctions described above. The construction of such a resource could begin with developing a centralized portal for access to existing resources, one that aggregates key data characterizing their available materials. Additional features, such as IPRs information, and provenance and consent characteristics, could be added as funding becomes available to support the necessary programming and database research. The integration of these kinds of information would produce the greatest value-added to the community (12). Critical to the success of such a resource are commitments from the members of the scientific community to contribute to and curate the information in the resource, and from funding agencies to support the work. There are substantial challenges in developing such a hub, however, including who will fund it, who will do the work, what the resource will look like, where it will reside administratively, and how the various blurred distinctions will be facilitated and managed in practice. We need to think critically about the design of data architectures to provide gate-keeping functions for access by various stakeholders, such as where restrictions on use or existing IPRs require formal negotiations and legal agreements. Expertise needs to be developed across these domains, with fresh thinking about the dualisms and how to manage them. Further research may be needed to learn how

www.sciencemag.org

www.sciencemag.org

00 1800

00 1400

Documents

Private Property Versus Public Domain Following trends seen elsewhere in the sciences, stem cell researchers—and the companies or universities for which they work—are increasingly taking private ownership of early-stage technologies, including cell lines, genes, and associated data (Fig. 2). Simultaneously, researchers in the field draw upon a common repository of knowledge and technologies considered to be in the public domain. However, the boundary demarcating what is public from what is private has become fluid and ill defined. In theory, that which is in the public domain is defined by the absence of private legal claims of ownership and control. Thus, to the extent that private property is not clearly demarcated, the public domain is left ill defined. In stem cell science, the landscape of IPRs is complex, and its boundaries fuzzy. IPRs are not uniformly recognized, registered, or enforced globally: A technology may be closely held by a private owner in one country and effectively left in the public domain in others. Different, inseparable aspects of a technology may be subject to separate property claims; for example, patents could cover both the process to create pluripotent cells and the reagents necessary to do so. Moreover, multiple, narrow claims over interdependent aspects (or uses) of given technologies can create dense thickets of ownership claims that are costly to negotiate and transact. This can be particularly problematic when information and materials are inseparable, as for a given technology, they and their uses may be protected separately—even by different types of property right. Materials may be owned as physical property. Methods of derivation or propagation may be patented. Associated information may simply be held as confidential or maintained in a private database. The complexity of IPRs claims can also be compounded by the person-artifact dualism that characterizes human cell lines. For, in addition to any third-party IPRs claims over a cell line, methods associated with its derivation, or uses of it as an artifact, the donor may have legitimate personal property claims over the cells or their use, as well as privacy rights over associated information. Full information and well-defined property rights are necessary conditions for markets to function efficiently. Without these conditions met in the market for stem cell technologies, search costs, transaction costs, and risk are imposed and detract from the incentives that IPRs are intended to provide. What is needed is as reliable information as possible on where stem cell–related IPRs have and have not been claimed, held valid, and remain in force. This will, moreover, enhance the reliability of the public domain.

SCIENCE

VOL 331

SCIENCE

potential donors view their tissues, their relationship to those tissues following donation, and their own rights relative to third-party IPRs claims over those tissues. These issues are increasingly prevalent across biomedical science, including biobanking, genetics and genomics, and personalized medicine. Community resources of this sort are emerging as necessary infrastructure of the scientific enterprise, no longer an aberration or exception to the rule, but rather the way research communities must function to move forward. References 1. Editor, Nature 461, 145 (2009). 2. D. Field et al., Science 326, 234 (2009). 3. P. N. Schofield et al., CASIMIR Rome Meeting participants, Nature 461, 171 (2009). 4. G. Stacey, C. J. Hunt, Regen. Med. 1, 139 (2006). 5. S. M. Fullerton, N. R. Anderson, G. Guzauskas, D. Freeman, K. Fryer-Edwards, Sci. Transl. Med. 2, 15cm3 (2010). 6. B. Nelson, Nature 461, 160 (2009). 7. The Hinxton Group, “Policies and Practices Governing Data and Materials Sharing and Intellectual Property in Stem Cell Science” (2011); see www.hinxtongroup.org. 8. Association for Molecular Pathology v. U.S. Patent & Trademark Office, No. 09 Civ. 4515 (Southern District of New York, 29 March 2010). 9. R. Skloot, The Immortal Life of Henrietta Lacks (Crown, New York, 2010). 10. M. M. Mello, L. E. Wolf, N. Engl. J. Med. 363, 204 (2010). 11. I. S. Kohane, P. L. Taylor, Sci. Transl. Med. 2, 37cm19 (2010). 12. Winickoff et al., Yale J. Health Policy Law Ethics 9, 52 (2009). 13. K. Bergman, G. D. Graff, Nat. Biotechnol. 25, 419 (2007). 10.1126/science.1201382

727

11 FEBRUARY 2011

39

the base calls and the quality values (4). The ability to process the images in near real time has allowed the speed of sequencing to advance independently from the speed of disk storage devices, which would have otherwise Scott D. Kahn been rate limiting. Although there are computational challenges with such near real-time Many of the challenges in genomics derive from the informatics needed to store and analyze the analysis, this processing affords a two-ordersof-magnitude reduction in data needing to be raw sequencing data that is available from highly multiplexed sequencing technologies. Because single week-long sequencing runs today can produce as much data as did entire genome centers a stored, archived, and processed further. Thus, raw data has been redefined to be bases and few years ago, the need to process terabytes of information has become de rigueur for many qualities, although the data formats here are labs engaged in genomic research. The availability of deep (and large) genomic data sets raises concerns over information access, data security, and subject/patient privacy that must be addressed still a source of ongoing development. Newer and under-development “third-generation” sequencfor the field to continue its rapid advances. ing methods also output bases and base qualhe study of genomics increasingly is be- matics; this widening gap must be addressed ities (5). Improvements in determining the base coming a field that is dominated by the before the overall field of genomics can take from the raw reads, and thus the quality, are growth in the size of data and the re- the leap forward that the community has fore- ongoing, even though the downstream analysis sponses by the broader scientific community to seen and is needed for many applications, tools often do not lever the increased precision of the estimated values. Looking ahead, the ways effectively use and manage the resulting derived spanning from evolution to medicine. Central to the challenge is the definition of in which base quality scores are captured, cominformation. Genomes can range anywhere from 4000 bases to 670 Gb (1); organisms that repro- raw data. Many current sequencing technologies pressed, and archived will optimize storage and duce sexually have two or more copies of the capture image data for each base being sequenced; improve analysis. The size of the collective data genome (ploidy). Humans have emerges as a major concern as one two copies of their inherited gemoves downstream of data creation nome of 3.2 Gb each. Full seSequencing Progress vs Compute and Storage on the sequencer to the analyses quence data has been archived Moore’s and Kryder’s Laws fall far behind and comparisons that constitute the for many thousands of species transition into biologically and/or (2), and more than 3000 humans 100000000 Microprocessor (MIPS) Sequencing (kbases/day) medically relevant information. For have been sequenced to some Compact HDD storage capacity (MB) 10000000 example, the current size of the 1000 substantial extent and reported in Genomes Project (www.1000genomes. the scientific literature; new se1000000 org) pilot data, representing a comquencing is expanding at an exparative analysis of the full genomes ponential pace. 100000 of 629 people, is roughly 7.3 TB of Output from next-generation sequence data. The ability to access sequencing (NGS) has grown 10000 this data remotely is limited by netfrom 10 Mb per day to 40 Gb work and storage capabilities; the per day on a single sequencer, 1000 download time for a well-connected and there are now 10 to 20 ma100 site within North America will range jor sequencing labs worldwide between 7 and >20 days. Having that have each deployed more 10 this data reside within a storage cloud than 10 sequencers (3). Such a does not entirely mitigate the longergrowth in raw output has out1 term challenges, in which aggregastripped the Moore’s Law tion of multiple data stores (data advances in information techstored within different clouds) will nology and storage capacity, in Year be required to perform the comwhich a standard analysis reparative analyses and searching of quires 1 to 2 days on a compute cluster and several weeks on a Fig. 1. A doubling of sequencing output every 9 months has outpaced and over- information that are envisaged. This typical workstation. It is driving taken performance improvements within the disk storage and high-performance fundamental inability to move sequence data in its current form argues a discussion about the value and computation fields. for considerable changes in format definition of “raw data” in genomics, the mechanisms for sharing data, the these must be parsed into a set of intensities for and approach. Without a solution, these downprovenance of the tools that effectively define each of the bases that are subsequently interpreted stream informatics challenges will gate adthe derived information, and the nature of com- as a specific base call and an assigned quality vancements of the entire field; a substantial leap munity data repositories in the years ahead value (the likelihood that the base call is correct). in informatics capability and concomitant changes (Fig. 1). A second challenge is analyzing all The quality values currently represent more stor- in the definition of data must take place to support these data effectively. The pace of innovation age space than the base. The size of these images movement of the field forward. Centralization of in genomic data creation is much higher than for many labs is currently greater than 5 TB of data on the Cloud is a positive start. One approach that is being explored is to the rate of innovation within genomic infor- information per day if they are stored; the impracticality of using and archiving image data move computation to the data rather than moving has motivated the development of real-time the data to the computation. This model is made Illumina, 9885 Towne Centre Drive, San Diego, CA 92121, processing of the images directly to output only possible through so-called service-oriented archiUSA. PERSPECTIVE

On the Future of Genomic Data

T

728

11 FEBRUARY 2011

40

VOL 331

SCIENCE

SCIENCE

www.sciencemag.org

www.sciencemag.org

2011 Data Collections Booklet SPECIALSECTION tectures (SOAs), which encapsulate computation into transportable compute objects that can be run on computers that store targeted data. SOA compute objects function like applications that are temporarily installed on a remote computer, perform an operation, and then are uninstalled. This solution poses a challenge around how compute costs are shared when the computers performing the work are maintained by the data owners rather than the researchers performing the analysis. Collaborative compute grids may offer a solution here (6). There are additional concerns for human data that include data security, protection of subject/ patient privacy, and the use of the information that is consistent with informed consent. Although HIPAA provides guidelines to de-identify the data, researchers have shown that the genomic data are inherently identifiable (7) and that additional safeguards are required. This concern resulted in the NIH temporarily removing all access to several genomic databases until the risks to privacy could be evaluated and processes put in place to minimize these risks (8). The passage of the Genetic Information Non-discrimination Act (GINA) acknowledges some of these fundamental challenges with genomic information and attempts to provide additional regulation so as to discourage inappropriate use of such data. One proposed solution to minimize data storage is to use reference genomes so that ultimately all that needs to be stored in a new analysis are the differences from the reference. Rather than storing every base being sequenced, only base mutations that are distinct from the reference need to be saved; typically, these differences represent just 0.1% of the data. This large reduction in data size offers a solution to the dilemma around publication of results, even though it departs from the standard of submission of discrete sequence reads. However, with analysis methods still under active development it may be premature for the transition to referential formats. Referential formats can also pose problems with capture of data quality throughout a genome. Knowledge of data quality is most needed when evaluating derived information (such as genomic regions of putative function) in order to provide a contextual basis for the certainty of the assignment (or assignments). Once the physical challenges in storage and access of genomic data are solved, the issues involving the quality and provenance of the derived information will persist. This is particularly an issue for published works and aggregated databases of derived informa-

tion, if the semantic of the information in the source data changes over time. There may be no automatable mechanism to revise conclusions or redact records. Although there is a widespread focus on human DNA sequencing and its application to improving clinical understanding and outcome, genomic data can be even more complex. A further problem is that much of the sequencing data being collected is dynamic and is and will be collected at many times, across many tissues, and/or at several collection locations, where standards in quality and data vary or evolve (over the lifetime of each datum). Much sequence data, both affecting humans and not, is not of human origin (for example, of viruses, bacteria, and more). The challenges with analysis and comparison across organisms are exacerbated by these issues. Fields such as metagenomics are actively engaged in scoping the data, and metadata requirements of the problems are being studied, but standards have not yet been agreed upon. The informatics demands of epigenetics data will be more burdensome because of the dynamic nature of gene regulation. Whereas there are ideas being formulated to compress (human) DNA data through the use of the human reference genome as noted above, no such reference exists within the metagenomic and epigenomic fields. The centrality of reference data and standards to the advancement of genomics belies the limited research investments currently being made in this area. Large intersite consortia have begun to develop standard references and protocols, although a broader call to action is required for the field to achieve its goals (for example, the development of standardized and approved clinical grade mutation look-up tables). This is an activity that would benefit from input from the broader informatics community; several such interdisciplinary workshops and conferences have been organized, and these are having modest success in capturing a shared focus to address the challenges presented. One exemplar is the current state of electronic medical records (EMRs) and their inability to capture genomic data in a meaningful manner despite the widespread efforts to apply sequencing information in order to guide clinical diagnoses and treatment (9–14). These efforts require large cross-functional teams that lack the informatics tools to capture the analysis and diagnostic process (or processes) and thus have limited means to build a shared knowledge base. Discussions around personalized

www.sciencemag.org

www.sciencemag.org

SCIENCE

VOL 331

SCIENCE

medicine rarely focus on the data and information challenges, even though these challenges are substantial technically, institutionally, and culturally. Although it is early still for the impact that NGS will have on the practice of medicine, taking action to define and implement a comprehensive, interoperable, and practical informatics strategy seems particularly well timed. The future of genomic data is rich with promise and challenge. Taking control of the size of data is an ongoing but tractable undertaking. The issues surrounding data publication will persist as long as sequence read data are needed to reproduce and improve basic analyses. Future advances with use of referential compression (16) will improve data issues, although most of the analysis methods in use will need to be substantially refactored to support the new format. More difficult will be the challenges that emerge with practical curation of the wealth of information derived from genomic data in the years ahead. The nature of derived information used for clinical applications also raises issues around positive and negative controls and what must be stored as part of the medical record. Similarly, the evolution of informatics frameworks (such as EMRs) and scalable informatics implementations (such as SOA) to handle genomic data will probably be a hard requirement for advancing the biological and medical sciences made possible by the advances in sequencing technologies. References and Notes 1. http://en.wikipedia.org/wiki/Genome 2. www.ncbi.nlm.nih.gov/genbank/ 3. The output of the first 454 (15) and the current HiSEq. 2000 output, assuming 300 Gbase over a 7- to 8-day run. 4. J. Karow, GenomeWeb, 28 July 2009. 5. J. Clarke et al., Nat. Nanotechnol. 4, 265 (2009). 6. https://portal.teragrid.org/ 7. N. Homer et al., PLoS Genet. 4, e1000167 (2008). 8. P. Aldhous, PLoS Genet. (2008). 9. E. A. Worthey, et al., Genet. Med., PMID: 21173700 (2010). 10. A. N. Mayer et al., Genet. Med., PMID: 21169843 (2010). 11. J. E. Morgan et al., Hum. Mutat. 31, 484 (2010). 12. T. Tucker, M. Marra, J. M. Friedman, Am. J. Hum. Genet. 85, 142 (2009). 13. V. Vasta, S. B. Ng, E. H. Turner, J. Shendure, S. H. Hahn, Genome Med. 1, 100 (2009). 14. M. Choi et al., Proc. Natl. Acad. Sci. U.S.A. 106, 19096 (2009). 15. M. Margulies et al., Nature 437, 326 (2005). 16. M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, E. Birney, Genome Res., 10.1101/gr.114819.110 (2011). 10.1126/science.1197891

729

11 FEBRUARY 2011

41

AAAS is here – preparing minoritystudents studentsfor for careers careers in AAAS is here – preparing minority in science. science. Part of AAAS’s mission is to strengthen and diversify the scientific work force. To help achieve this goal AAAS partners with NSF

Part oftoAAAS’s is to strengthen andand diversify the scientifi c work force. To help achieve this goal AAASfrom partners with presentmission the Historically Black Colleges Universities Undergraduate Program, a conference where students HBCUs getNSF to present the Historically Black Colleges and Universities Undergraduate Program, a conference where students from HBCUs experience presenting their research, networking with peers, meeting with representatives from graduate schools, and learning get experience their research, networking with peers, graduate about presenting career opportunities. As a AAAS member your dues meeting support with theserepresentatives efforts. If you’refrom not yet a AAASschools, member,and joinlearning us. aboutTogether career opportunities. As a AAAS member your dues support these efforts. If you’re not yet a AAAS member, join us. we can make a difference. Together we can make a difference.

To learn more, visit aaas.org/plusyou/hbcuup

To learn more, visit aaas.org/plusyou/hbcuup

AAAS here– –bringing bringingscientifi scientificcexpertise expertise to policy AAAS is is here policymaking. making. Good science policy resultofofpoliticians politiciansunderstanding understanding science policy. Toward thisthis end,end, Good science policy is is thethe result science and andscientists scientistsunderstanding understanding policy. Toward AAAS manages the Science & Technology Policy Fellowships program, which embeds scientists and engineers in the federal AAAS manages the Science & Technology Policy Fellowships program, which embeds scientists and engineers in the federal government for up to two years. From Congress to the State Department, each class of Fellows contributes to the policy-making government for up to two years. From Congress to the State Department, each class of Fellows contributes to the policy-making process while getting hands-on experience at the intersection of science and policy. As a AAAS member your dues support these process while getting hands-on experience at the intersection of science and policy. As a AAAS member your dues support these efforts. If you’re not yet a AAAS member, join us. Together we can make a difference. efforts. If you’re not yet a AAAS member, join us. Together we can make a difference.

To learn more, visit aaas.org/plusyou/fellows

To learn more, visit aaas.org/plusyou/fellows

www.aaas.org