Expanded web feature: The changing face of our data landscape
Overcoming fundamental obstacles to the benefits of data sharing
In order to maximise the potential benefits that data sharing has to offer, funders, researchers and society as a whole need to overcome some fundamental obstacles.
Just 8 weeks after the first reported case of swine flu in Mexico, a paper was published in Nature detailing the origins and evolution of the H1N1 outbreak. The paper was as notable for the speed at which it came to fruition as for the findings themselves.
Using public databases, 4 research teams across 3 different time zones worked together to analyse H1N1 gene sequences. The result was not only an international paper detailing the origins and evolution of the H1N1 virus but also a tangible example of the benefits of data sharing.
Dr Andrew Rambaut, one of the researchers involved from the University of Edinburgh, says, “What was new about the way this research was conducted is that data were being made available on public databases in real time. This allowed us to analyse what was happening and then post the analysis to a wiki site as we did them – a sort of open lab book approach”.
Data sharing is increasingly shaping the research landscape. For funders it offers value for money, for researchers greater scope for their work and for society as a whole, the possibility of accelerating scientific progress. However, to maximise the benefits, there are some fundamental obstacles to overcome: resources, volume, skills, access, funding, ownership and mindset to name a few.
Researchers are increasingly developing sites, such as wikis, to share data among their immediate colleagues and wider research communities for the duration of their projects.
However, a coordinated top-down approach is essential to ensure a cyber infrastructure exists to store, support and ultimately make accessible the exabytes (1EB equals a billion gigabytes) of data that already exist and continue to be generated daily by labs across the globe.
Over the past few years, funding bodies in the UK and internationally have been developing policies to optimise the use and value of data produced during the course of projects they fund.
BBSRC published its policy on Data Sharing in 2007, which states: ‘BBSRC is committed to getting the best value for the funds we invest and believes that helping to make research more readily available will reinforce open scientific enquiry and stimulate new investigations and analyses’.
All researchers applying for BBSRC grants have to submit a statement on data sharing as part of their application. Alongside this commitment to individual data sharing, BBSRC also funds projects looking specifically at developing tools, resources and technologies to enable successful data sharing – the Bioinformatics and Biological Resources Fund (BBR).
And in August this year, BBSRC committed £10M to a major emerging pan-European science project – the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) – to help it dramatically increase its data storage and handling capacity.
The funding is the first step in developing the existing data resources and IT infrastructure of EMBL-EBI towards its planned role as the central hub of the emerging European Life-Science Infrastructure for Biological Information (ELIXIR), an initiative involving 32 partners from 13 countries aimed at establishing a sustainably funded infrastructure for biological information in Europe.
All in the name
Storage solutions and software are only part of the issue. The data are only as good as the way in which they are labelled. Out of context data often become meaningless and if not correctly labelled, cannot be accurately retrieved.
Data curation has become a highly skilled area, but because of this, is time consuming and costly. Larger databases employ teams of specialist curators, but anyone sharing data needs to be their own curator.
Minimum Information for Biological and Biomedical Investigations (MIBBI) exist, the uptake of which is increasing, but according to an EBI survey 60% of researchers do not yet use these standards.
And where research teams do set up data sharing solutions, they often go out of date once the project comes to an end as websites and databases are often only funded for the duration of the project.
The altruistic researcher
Policies, technology and infrastructure can only achieve so much. Data sharing requires a major shift in mindset too – individual scientists need to want to share their data.
Carole Goble, Professor of Computer Science at the University of Manchester, explains, “Scientists see sharing data as a great risk, with little reward and a huge amount of effort required to publish shareable data. Everyone wants to use data, but no one likes to create the metadata needed to interpret it”.
In a culture driven by publications and citations, generating data for the greater public good does not necessarily secure funding for researchers and their labs.
“We need a ‘collaborate to compete’ mentality. Academics need to believe that data sharing will build their reputation. For this to happen, systems need to be in place to ensure scientists whose data are used are credited,” says Goble.
Data citation is a possible way to ensure peer recognition and reputation building – with scientists being duty bound to cite data sources. The merits and details of how best to enforce this are being debated.
Towards an end result
There are many obstacles still to overcome, and much work to be done from individual labs through to Governmental departments internationally. But the landscape as it stands is already enabling new ways of working, as highlighted by the work of Dr Rambaut who relies totally on data generated by other scientists.
“My work is entirely computational so I rely on publicly available, high quality and well annotated data being made available in a timely fashion. By making our analysis of these data available as we perform them I hope we are encouraging the labs doing the data collection that sharing their data is a worthwhile endeavour,” he explains.
Dr Rambaut’s research ticks all the boxes for why data sharing is so vital and, alongside being cost effective, resource saving, and science enhancing, in his field – influenza research – this way of working has very real public health implications.
SysMO is a pan-European initiative involving 11 projects and over 90 institutes across Europe, all working in the field of systems biology of microorganisms.
One of the main aims of the initiative, started in 2007, is to pool research capacities and know-how across systems biology research in Europe. The initiative has multiple funders across Europe, including BBSRC.
Professor Carole Goble, from the School of Computer Science at the University of Manchester, leads the SysMO Data Management Project - SysMO-DB - and runs a focus-group alongside the initiative to help build a data sharing infrastructure – SysMO-PALs. SysMO-DB provides a platform for the management and exchange of data and models between the 91 groups comprising the SysMO initiative.
“This initiative highlights many of the data sharing issues experienced across science. Hundreds of researchers are involved across the projects, some are collaborators others are competitors, yet they are required to share data across the SysMO family.”
A framework and buy-in from the researchers is necessary to achieve successful data sharing across projects and countries, this is where Professor Goble’s work comes in.
She and her team work with two bench scientists (PALs) from each of the 11 SysMO projects, finding out how they are prepared to share their data and developing systems to do this.
“The key with SysMO is that everything is under the control of the scientists. We work closely with the PALs so they understand the positives and benefits of data sharing. As well as developing the tools and systems to effectively share data, there is a social engineering element as well – making scientists aware of the importance of it,” explains Professor Goble.
Wheat functional genomics resources
With funding from BBSRC’s BBR initiative, the University of Bristol is working on a 5-year project to collate and host the UK’s Wheat Functional Genomics Resources. This resource consists of data generated by the UK research community, which others, not directly associated with the original project, might find useful. The resources are both physical and electronic.
Professor Keith Edwards, who is involved in the project, says: “Before this project, UK-based resources for wheat research were often isolated and tailored to in-house needs, as such they were often invisible to or inaccessible by the outside world.
“The advantage of the Wheat Functional Genomics Resources is that users will have a one-stop shop to find out if the resources they want are available and from whom. From a funders point of view it means that maximum use is made of the available resources.”
With the backing of the wheat community, in particular Monogram and the Wheat Genetic Improvement Network, the BBR project aims to bring together all relevant resources under one umbrella – namely the existing Monogram Network and its associated website.
“It is important to emphasise that if we are to make the most of our limited resources and address issues such as food security we need to make more of the resources we have. One of the best and cheapest ways to do this is to make the current resources available to as many people, academics and industries as possible,” says Professor Edwards.
This article is based on research published in Nature, 459: 1122-1125 (2009).
tel: 01793 414695
fax: 01793 413382