Implementing iRODS for Next Generation Sequencing Data ...

Implementing iRODS for Next Generation Sequencing Data Management

Gen-Tao Chiang Wellcome Trust Sanger Institute [email protected] ISGC, March 20, 2011

Outline 1. 2. 3. 4. 5.

ISGC, March 20, 2011

DNA Sequencing Managing Data iRODS WTSI use case Future Works

The Sanger Institute Funded by Wellcome Trust.

• 2nd largest research charity in the world. • More than 800 employees. • Based in Wellcome Trust Genome Campus, •

Hinxton, Cambridge, UK. (share with EBI) Most cited in the UK (Science Watch, 2008)

Large scale genomic research.

• Sequenced 1/3 of the human genome.

• •

(largest single contributor). We have active cancer, malaria, pathogen and genomic variation / human health studies. All data is made publicly available. Websites, ftp, direct database. access, programmatic APIs.


By Guy Coates

Data Centre • Completed in 2005. • 1,000 square meters of floor space split equally into four rooms. • Capable to support up to 50,000 processors. • Currently, about 10,000 cores and 10 petabyte storage. ISGC, March 20, 2011

Managing Data


DNA Sequencing


Capillary Based • In 2001, in the era of the HGP, DNA sequencing technology used a capillary-based approach. • Each sequencer produced about 115 kbp (thousand base pairs) per day (Mardis, 2011). ISGC, March 20, 2011

Next Generation Sequencing Life sciences is drowning in data from our new sequencing machines. Traditional sequencing: • 96 sequencing reactions carried out per run.

Next-generation: sequencing. • 52 Million reactions per run. Machines are cheap(ish) and small. • Small labs can afford one. • Big labs can afford lots of them.


Illumina HiSeq • Migrating to Illumina HiSeq since October, 2010. • 5 times more data than Illumina GA2.

• 20 Machines on site. • Make data management extremely difficult. http://www.illumina.com


ER Mardis. Nature 470, 198-203 (2011) ISGC, March 20, 2011

Output Trends 4500

Our peak “old generation” sequencing:

• August 2007: 3.5 Gbases/month.

4000

4000

3500

Current output:

3000

• Jan 2010: 4 Tbases/month. Capillary Illumina

• In August 2007, total size of genbank was 200 Gbases.

Gbases

1000x increase in our sequencing output.

2500

2000

1500 1000

500

Improvements in chemistry continue to increase the output of machines.

0

3.5 Jan 2010

Data Growth Current weeky sequencing: 3000 Gbase

Peak Yearly capillary sequencing: 30 Gbase


Managing Growth We have exponential growth in storage and compute.

•

Storage /compute doubles every 12 months. 

2009 ~7 PB raw

Gigabase of sequence ≠ Gigbyte of storage.

• •

16 bytes per base for for sequence data. Intermediate analysis typically need 10x disk space of the raw data.

Moore's law will not save us.

• •

Transistor/disk density: Td=18 months Sequencing cost: Td=12 months

By Guy Coates

Economic Trends: The Human genome project:

• • •

13 years. 23 labs. $500 Million.

A Human genome today:

• • • •

3 days. 1 machine. $10,000. Large centres are now doing studies with 1000s and 10,000s of genomes.

Changes in sequencing technology are going to continue this trend.

• • •

“Next-next” generation sequencers are on their way. One Pacific Biosciences RS test machine at WTSI now. $500 genome is probable within 5 years.

Managing Data


Bulk Data Data size per Genome

Structured data (databases)

Individual features (3MB)

Variation data (1GB) Alignments (200 GB) Sequencing informatics Sequence + quality data (500 GB) specialists Intensities / raw data (2TB)

Unstructured data (flat files) By Guy Coates

Bulk Data Management We though we were really good at it.

•

• •

All samples that come through the sequencing lab are bar-coded and tracked (Laboratory Information Systems). Sequencing machines fed into an automated analysis pipeline. All the data was tracked, analysed and archived appropriately.

Strict meta-data controls.

•

Experiments do not start in the wet-lab until the investigator has supplied all the required data privacy and archiving requirements.   

Anonymised data → straight into the archive. Identifiable data → private/controlled archives. Some data held back until journal publication.

Mainly for QC pipeline

SRF SRA fastq


Analysis, alignment, assembly

Further analysis Ensembl annotation



We had been focused on the sequencing pipeline.

• For many investigators, data coming off the end •

of the sequencing pipeline is where they start. Investigators take the mass of finished sequence data out of the archives, onto our compute farms and “do stuff”.

Huge explosion of data and disk use all over the institute.

• We had no idea what people were doing with their data.


Alignment Find the best match of fragments to a known genome / genomes. • “grep” for DNA sequences. • Use more sophisticated algorithms that can do fuzzy matching.  Real DNA has Insertions, deletions and mutations.  Typical algorithms are maq, bwa, ssaha, blast. Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC.... Query:

CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTAGGTCATCACCAGCA

Look for differences • Single base pair differences (SNP). • Larger insertions/deletions/mutations.

Typical experiment: • Compare cancer cell genomes with healthy ones.

By Guy Coates

Assembly Assemble fragments into a complete genome.

•

Typical experiment: collect reference genome for a new species.

“De-novo” assembly.

• •

Assemble fragment with no external data. Harder than it looks.  Non uniform coverage, low depth, non-unique sequence (repeats).

By Guy Coates

Analysing Cancer Genomes Cancer genomes contains a lot of genetic damage.

• Many of the mutations in cancer are incidental. • Initial mutation disrupts the normal DNA repair/replication processes. • Corruption spreads through the rest of the genome.

Today: Find the “driver” mutations amongst the thousands of “passengers.

• Identifying the driver mutations will give us new targets for therapies.

Tomorrow: Analyse the cancer genome of every patient in the clinic.

• Variations in a patient and cancer genetic makeup play a major role in •

how effective a particular drugs will be. Clinicians will use this information to tailor therapies.


Accidents waiting to happen... From: (who left 12 months ago) I find the directory is removed . The original directory is "/scratch/ (who left 6 months ago)" ..where is it ? If this problem cannot be solved ,I am afraid that cannot be released.

Need a file tracking systems for unstructured data !! • •

They could not keep track of where the results. Problem exacerbated with student turnover (summer students, PhD students, visiting researchers on rotation).

Big wins with little effort.

•

•

Disk space usage dropped by 2/3. 

Lots of individuals keeping copies of the same data set “so I know where it is”.

Team leaders are happy that their data are where they think they are. 

Important stuff is on file systems that are backed up etc.

But:

• •

Systems are ad-hoc, quick hacks. We want an institute wide, standardised system. 

Invest in people to maintain/develop it.

Data Grid • Many different science fields today require dealing with large and geographically distributed data sets. The size of these data sets has been scaled up from terabytes to petabytes. • The combination of several issues, such as – large datasets, – distributed data – computationally intensive analysis

• Data grid: a unified environment which allows users to deal with all above issues. • SRB, dCAche, CASTOR….etc ISGC, March 20, 2011

iRODS Architecture


iRODS •

•

•

iRODS: Integrated Rule-Oriented Data System. Produced by DICE (Data Intensive Cyber Environments) groups at U. North Carolina, Chapel Hill. Successor to SRB.

Important Features • Catalogue: mapping logical file names to physical locations. • Metadata: metadata can be inserted into each file. • Rule Engines: – Manipulate files or DB. For example, replicate data to multiple resources. – Implement policies.

• Easy to use client tools: – Icommands – Web interface. – API

• Federation ISGC, March 20, 2011

What are we doing with it? Piloting it for internal use.

• Help groups keep track of their data. • Move files between different storage pools. 

Fast scratch space ↔ warehouse disk ↔ Offsite DR centre.

• Link metadata back to our LIMs/tracking databases. We need to share data with other institutions.

• Public data is easy: FTP/http. • Controlled data is hard: • Encrypt files and place on private FTP dropboxes. • Cumbersome to manage and insecure.

First Stage: A preservation system


BAM Multiple NFS Partitions


Rules in Core.irb • forces the “seq” resource group as the default resource acSetRescSchemeForCreate||msiSetDefaultResc(seq,forced) |nop • defines the preferred resource as the “seq” resource group. In addition, it also mentions avoiding the use of res-r2. The idea is to use res-r2 as a backup resource acPreprocForDataObjOpen||msiSetDataObjPreferredResc(s eq%res-g2)##msiSetDataObjAvoidResc(resr2)|msiSetDataObjPreferredResc(seq%res-r2)##nop • After the data has been put into the iRODS system, the data object will be replicated to all the resources within the “seq” resource group. acPostProcForPut||msiSysReplDataObj(seq,all)|nop ISGC, March 20, 2011

Data Replication • At this moment, files can be replicate to both resources (res-g2 and res-r2) directly. irods-g1@irods-g1:/tmp$ iput repl.ir irods-g1@irods-g1:/tmp$ ils -l repl.ir irods-g1 0 res-g2 74 2011-0311.12:11 & repl.ir irods-g1 1 res-r2 74 2011-0311.12:11 & repl.ir ISGC, March 20, 2011

Data Replication • May still miss replication!!! – Network stability – Servers temporary not available – Unknown reasons

• We need extra protection – One more rule using delayExec for periodically checking the replication. – running irule – running irepl –BPr /seq in crontab ISGC, March 20, 2011

Authentication • Password, GSI, Kerberos. • WTSI uses LDAP and AD. iRODS is not supporting LDAP. • Kerberos is chosen because it can be integrated with current Active Directory.


Kerberos • All the users will login to the zone Sanger. • Zone Sanger federated to zone Seq • Users can use their original WTSI username/password to access iRODS and use icommands from their desktop or farm. For example: gtc@irods-sanger1:~$ kinit Password for [email protected]: gtc@irods-sanger1:~$ ils /seq |more /seq: C- /seq/1003 C- /seq/1008 ISGC, March 20, 2011

Web Client


Web Client • Not support Kerberos yet. • Need to share data with other Genome Centre in the future. • Need federation such as Shibboleth….etc.


WTSI Use case


WTSI Use Case • Currently, users of WTSI are using iRODS mainly for managing and accessing sequencing Binary Alignment/Map (BAM) files, which is the binary representation of the Sequence Alignment/Map (SAM) file format. • BAM files are from – Illumina HiSeq – Converted from fastq or Illumina G2 old runs

• Each BAM file normally has an index file called BAM Indexing (BAI) which are stored in iRODS as well. • Data from PacBio are stored in /pacbio but for testing purpose at this moment. ISGC, March 20, 2011

WTSI Use Case • NGP Group is responsible to upload the quality controlled BAM into the iRODS. • Quality control Perl scripts ruungin in sequencing farm uses iRODS upload Perl module to upload the data.


Metadata • The following metadata are inserted at the upload stage. • study, library, sample, id_run, lane, tag, tag_index and human_split. • id_run, lane, and tag_index are together unique within WTSI. • Some BAM files don't have a tag_index, which means the file is for the whole run lane. ISGC, March 20, 2011

Metadata • Each BAM file belongs to one study, has one or more samples, forms one actual library for sequencing, and each sequence may have a tag sequence with it. • Therefore, we added these metadata for the file: study, sample (may have more than one), library and may with tag as well.


• • • • • • • • • • • • • • • • • • • • •

BAM file with tag deskpro19635[gq1]5: imeta ls -d /seq/5635/5635_3#2.bam AVUs defined for dataObj /seq/5635/5635_3#2.bam: attribute: type value: bam units: ---attribute: lane value: 3 units: ---attribute: sample value: SZ0002 units: ---attribute: reference value: /nfs/repository/d0031/references/Streptococcus_equi/4047/all/bwa/S-equi-4047.fasta units: ----


• • • • • • • • • • • • • • • • • • • • • • • •

attribute: study value: Streptococcus equi genome diversity units: ---attribute: tag value: CGATGTTT units: ---attribute: library value: SZ0002 1560825 units: ---attribute: id_run value: 5635 units: ---attribute: tag_index value: 2 units: ---attribute: alignment value: 1 units: ----


Metadata • Some BAM files have non-consensual human data, so we split the files into two parts: human and non-human. The public is not normally able to see the human part. human_split is used to indicate this situation.


Metadata • Sequences in BAM may be aligned to a reference. Therefore the metadata 'alignment' has been created to indicate this. If it has alignment, we add a metadata reference to indicate which one we used. The following are some examples.


• BAM without tag and only human non consented part: • • srpipe@sf-2-1-01:/nfs/sf44/irods$ imeta ls -d /seq/5261/5261_5_human.bam • AVUs defined for dataObj /seq/5261/5261_5_human.bam: • attribute: type • value: bam • units: • ---• attribute: study • value: Plasmodium falciparum Illumina sequencing R&D 1 • units: • ---• attribute: reference


• • • • • • • • • • • • • • •

value: /nfs/repository/d0031/references/Homo_sapiens/1000Genomes/all/bwa/human_g 1k_v37.fasta units: ---attribute: sample value: PK0039 units: ---attribute: human_split value: human units: ---attribute: lane value: 5 units:


• • • • • • • • • • • •

---attribute: library value: PK0039 455682 units: ---attribute: id_run value: 5261 units: ---attribute: alignment value: 1 units:


metadata • User can query metadata and find those data they are interested in.


• • • • • • • • • • • • • • •

gtc@irods-sanger1:~$ imeta qu -z seq -d study = 'Hyperplastic Polyposis' collection: /seq/5208 dataObj: 5208_2.bam ---collection: /seq/5208 dataObj: 5208_3.bam ---collection: /seq/5208 dataObj: 5208_5.bam ---collection: /seq/5230 dataObj: 5230_1.bam ---collection: /seq/5230 dataObj: 5230_2.bam


Future Work


Non-Interactive environment • Kerberos has a limited credential life time; if jobs are queued in the batching system, which is LSF in our case, the Kerberos credential may run out of validation time. • This means that Kerberos needs to be able to support a non-interactive computing environment. • Theoretically, the valid life time can be configured by the AD administrator. Practically, it would go against WTSI policy to configure the credentials with unlimited life time. • Grid myproxy server maybe the answer, but extra works for users. ISGC, March 20, 2011

More zones • In our idea, each major projects may have their own iRODS zones and federated to each other. • Zone UK10K and zone Cancer are going to be created next.


Federation with Collaborators

Biologist

Biologist

Sequencing Centre Data

Biologist

Future Collaborations Sequencing Centre Data

Universities Data

Federation Access

Sequencing Centre Data ISGC, March 20, 2011

Small Labs Data

Collaboration • Typical projects takes 18 months to 3 years. • Does not like HEP model. Tier 0 (CERN) and some Tier 1 centres (RAL, ASGC…etc). • Small labs join and leave more frequently. • Does not have dedicated network. • Will do a pilot iRODS Federation with Broad Institute first. ISGC, March 20, 2011

Moving computing to data? • Moving large amount of genomic data to computing resources is time consuming. • Moving computing to data? • Resources Broker (WMS) of gLite is able to allocate computing jobs based on the metadata provided by each site. • SRM-iRODS developed by ASGC becomes a very important part which glues the computing grid and data management system we are using. ISGC, March 20, 2011

Acknowledgements Sanger Institute •

• • • •

Phil Butcher Pete Clapham Guy Coates Kevin Sale Guoying Qi

•

STFC Jens Jensen