Profile feature – Carole Goble
A computer scientist decodes how new data-sharing platforms will catalyse discovery in the biosciences.
14 December 2011
How did your career develop?
I'm a computer scientist completely down the middle. I've done it since I was 16 in 'A' level from 1977 which was pretty unusual back then. I graduated at the University of Manchester in 1982 and now I'm a Fellow of the Royal Academy of Engineers because I'm an engineer. I build software for biologists and with biologists, but I am not a biologist or even a bioinformatician.
Professor Carole Goble, University of Manchester.
What drew you too computer science in the first place?
I watched a TV programme when I was 13 all about computers and giant brains and how they were going to do amazing stuff and I thought it looked cool, so I did a career project on computers just before my 'O' Levels.
And never looked back?
I'm a focused determined person. That helps when you need to persuade people to use software.
What are you working on now?
My applications are trying to accelerate how we can link together and create integrated datasets and tool pipelines.
I do broadly three things. I build software platforms for integrating data and tools, specifically a scientific workflow system called Taverna, which was part-funded by BBSRC, that's now being used on a large number of other projects, including businesses. I build e-laboratories where scientists can collaborate and share resources, different types of assets, and cloud curate their resources. Examples include myExperiment, for sharing workflows and computational protocols and MethodBox, for sharing variable sets and scripts for social science surveys.
And then I'm a computer science researcher in semantic web technologies and social computing in order to create knowledge collaborations.
I also work in new models of communication, such as new ways of publishing beyond the traditional science paper – the executable paper, for example – and digital preservation of knowledge assets, models and representation. So everything I do is knowledge management really.
Taverna is one of Goble's many projects.
Tell is about a specific major project...
One of the projects BBSRC funded was the Biocatalogue, a catalogue of web services for life sciences as there are 1000s of datasets and tools in biology community. There are over 2000 services registered in the BioCatalogue. It was a joint venture between my lab and the European Bioinformatics Institute and now the foundation of several European projects.
You must lead a large team...
I have 14 postdocs in my lab: three are scientific informaticians, a couple of computer science researchers and the rest are software engineers; this is unusual as I produce open source and free software products and services – I don't just do the research.
Do you focus on projects for the biology community?
I started in life sciences, but we produce generic products that can be adopted by other communities. For instance in the US, Emory University have rebadged BioCatalogue as EdUnify in order to catalogue their web services on education.
BioCatalogue is the biggest curated catalogue of web services for life sciences in the world. Image: BioCatologue
What else has BBSRC funded?
SysMO-DB (Systems Biology for Microorganisms Database) has received two rounds of funding totalling around £1.7M from BBSRC. It is one of our 'eLaboratories' and a classical knowledge exchange project. It's under a pan-European consortium called ERASysBio that brought different European funding councils together to fund groups in systems biology.
And what did the project actually do?
We built a platform for the 13 consortia to store, share and link together data, models, standardised operating procedures and publications, as well collaboration support, like a yellow pages of who is who in the projects. And we persuaded them to use it.
Now the software we built is the foundation of data sharing for the German Virtual Liver project and is being rolled out to systems biology centres and multi-partner projects in the UK and across Europe.
Is it easy to get scientists to sign up to these new e-platforms?
Not at all. For SysMO-DB we have a group of PhD and post docs – the PALs – that were representatives, advocates and champions of the individual projects. We go bowling together, drinking together. We kind of created a cult, frankly, that overcame the barriers of the project.
So the people factor is still important?
The people factor is the dominant factor. We have to adapt what we're doing to suit what's best for people and let computer science help and not get in the way. Having said that, being smart is figuring out what the things are that mean people go from "won't use it" to "want to use it".
Why is data-sharing so important?
A huge percentage of what BBSRC funds (raw data) gets thrown away! Of course not all data should be preserved. And mostly it's not mindfully thrown away, it's just not managed. A repository for results for a project, where does it go? Onto a PhD student's hard drive who's just left? On a machine that no one knows how to use anymore? Is the data sufficiently well described so someone can use it six months later? And when projects wind up there's no-one left to manage the wiki (web-based site for data), or there's data on hard drives in different institutions. That's the reality – you'd be stunned.
There's a big drive in BBSRC to improve data management, and that the universities have a responsibility to manage the data their staff and students produce, and that is just getting started.
In principal people should be using well established laboratory e-notebooks, but adoption is at best patchy, and the very best e-notebooks cost money.
What other problems are your work addressing?
There is an increasing concern that it is difficult to replicate results because there isn't enough data in the paper to reproduce the study. People don't data share, they 'data flirt'. They release enough metadata (data about the data) so that others are intrigued, but can't replicate the result or reuse it right off the bat. The problem is that there's no real reward mechanism in place for sharing and for the work needed to make reusing what is shared possible. For examples see Science/AAAS 'Data replication & reproduction', PLoSone 'Public availability of published research data in high-impact journals' and Social Science Reserach Network 'The conundrum of sharing research data'.
Is that a major problem?
Absolutely! There are technical mechanisms in place to accomplish data sharing, data citation. BBSRC's latest BBR (Bioinformatics and Biological Resources) call references this by asking for projects to do with data citation. But you need to couple that with the cultural situation in that data curating and data contribution is not yet fully valued; that rewarded impact is not just about publishing an article in Nature. It's a really tricky situation because the person wanting to use your data is likely to be a rival.
What's the answer?
We need to create a social environment where data is recognised as publications are recognised, and you can get credit for curating and contributing data. Credit is needed where credit is due.
Do you also make commercial applications for sale?
Our software is open source. We give all our software away free and run our services free. We don't have spin-out companies, but we do have a commercial partner – Eagle Genomics Ltd – to build bespoke products and consultancy services based around our products.
Tambis was a highly influential project in the early semantic web.
How did you get into all this?
Back in 1995 I met a colleague, Andy Brass, in a corridor who had a problem combining data sets. I had background in databases and knowledge management, so we started a project called Tambis (Transparent Access to Multiple Bioinformatics Sources), which was co-funded by BBSRC. It would work out what database you would need to go to if you asked the question posed by using ontology. We were one of drivers of ontology-based integrating systems in the life sciences.
What do you mean by ontologies in this sense?
It's a way of organising and relating concepts and the terms that name those concepts. We have this situation big time in biology. We want to be able to describe what does this protein do? What does this gene do? What species, sample, what does a sequence mean? When we annotate databases we don't want to use free text; we want controlled terms so we can link them together and organise.
So in biology there are over 150 ontologies – take a look at Biosharing.org or the National Centre for BioMedical ontologies. It's a big business and we were one of the leading groups in the 90s with TAMBIS which was very early on – too early – but now these things are much more bedded in.
So where is this all going? Where will we be in 5-10 years?
There will be a lot more computational bioinformatics science, which will not be considered unusual, but a standard practice. It will be normal to do data science using shared resources. We will collect more data of course, but there will be a much more equal part between 'dry' and 'wet' science.
We're in that transition phase at the moment. In the next 10 years with a new generation of digital natives we will move to a seamless expectation that digital science will always be partnered with wet science. Scientists will use these digital tools more often and it will become routine. And for that we need people who build and manage the resources, infrastructure and software, and they will be more recognised as they are building the foundation of your science.
tel: 01793 413329