Access keys

Skip to content Accessibility Home News, events and publications Site map Search Privacy policy Help Contact us Terms of use

Earlham Institute

The Earlham Institute is a specialist in genomics and bioinformatics, focusing on the interpretation of data to enable and drive bioscience in research and industry.Its goal is to be at the forefront of data intensive science in biology, and a leader in bioinformatics innovation and the application of genome technology.

The Earlham Institute is a centre of UK National Capability, committed to applying genomics knowledge and expertise through enterprise, collaboration and skills development to advance scientific knowledge and promote economic growth.


The institute is a specialist in genomics, including DNA sequencing, whole genome-scale analysis and bioinformatics for the analysis and interpretation of sequencing data to enable advances in the food, health and environmental biosciences.

The Genome Analysis Centre (TGAC) is based on Norwich Research Park. Image: TGAC
The Earlham Institute is based on Norwich Research Park. Copyright: Earlham Institute

Opened on 3 July 2009 with a team of over 70 staff including genomics scientists, technologists and bioinformaticians, the institute has already established itself as an expert partner in high-throughput sequencing. Projects have included sequencing a rubber tree genome in collaboration with an industrial client, and as a partner of the International Wheat Genome Sequencing Consortium.


The institute combines the application of multiple sequencing technology platforms for high quality sequence data generation with access to large data storage and handling resources, applying a range of bespoke software tools for analysis and interpretation of data.

It is equipped with next-generation sequencing platforms such as Illumina HiSEQ2000, IlluminaMiSeq, PacBio RS, OpGen Argus and Roche 454 FLX for high-throughput sequence generation, complemented with capillary sequencers for small projects and sequence improvement. The institute engages with platform developers and developers of new platforms to ensure early access and integration into the institute's portfolio of projects and tools.

DNA sequencing forms the backbone of TGAC’s activities. Image: TGAC
DNA sequencing forms the backbone of the Earlham Institute's activities. Copyright: Earlham Institute

The institute's state-of-the-art computing hardware installation for data processing provides sufficient data storage capacity and, crucially, enables fast and flexible processing of the huge quantities of data generated in any high-throughput sequencing project. The institute's data centre currently houses the world's largest Red Hat Linux system, and the supercomputer provides six terabytes of RAM for processing and 600 terabytes of fast disk storage. A second phase of development for the data centre has recently been completed and the institute has just taken delivery of a third UV system, with 2560 cores and 20TB RAM to enhance our capability for assembling and analysing larger sequencing data sets.


The Bioinformatics division at the Earlham Institute specialises in the analysis of high-throughput sequencing data including de novo [from the beginning] assemblies, re-sequencing projects, expression analysis (RNA-seq) and metagenomics.

The division is organised in three teams: Sequence Informatics, Computational Genomics and Genome Analysis. We also work closely together with the University of East Anglia/the Earlham Institute's Biostatistics team.

The Sequence Informatics Team is responsible for the analysis and management of primary sequencing data at the institute. The aim of the team is to work alongside both the 'wet' lab and sibling bioinformatics groups to be the first port of call for all aspects of data generated by the sequencing machines, including related metadata. Within it, the Core Bioinformatics Team undertakes software development and system administration of our in-house solutions for data management. These include MISO – the institute's Lab Information Management System (LIMS) – which is a bespoke, open-source platform for recording metadata for next-generation sequencing experiments. As part of the Earlham Institute's core remit, the team regularly submits sequencing data to public repositories in the form of raw reads, preliminary contigs, draft assemblies and annotation features.

The Computational Genomics Team is responsible for implementing and executing the high-throughput analysis annotation pipelines at the institute. Activities include the generation of gene sets and the implementation of functional and variation analyses including genotype-to-phenotype association studies. The team also develops tools for novel bioinformatics applications in the context of next generation sequencing and works on data presentation and visualisation tools. Within it, the Crop Genomics Team focuses on the high throughput assembly and annotation of genomic sequence from crop species.

The Genome Analysis Team is responsible for the organisation of the genome annotation and specialised data analysis at the institute. The aim is to analyse, annotate and curate genomic sequences. These activities are characterised by high-quality standards with a focus on detailed and specific aspects such as rare isoforms, gene families and metabolic pathways. Within it, the Microbial Genomes Group analyse next-generation whole genome sequence information for a large variety of bacteria and small eukaryotes with the aim of annotating and visualising comparative genomics data.

To get the latest updates from the Earlham Institute, visit