Challenging algorithms in bioinformatics - UiO

7 downloads 236 Views 4MB Size Report
Sep 19, 2012 - Dynamic programming. (Smith-Waterman). FASTA. BLAST. Multiple. Dynamic ... E -1 0 0 2 -4 2 5 -2 0 -3 -3 1
Challenging algorithms in bioinformatics Torbjørn Rognes [email protected] Department of Informatics, UiO 19 September 2012

Definition:

What is bioinformatics?

Bioinformatics is the development and use of computational and mathematical methods to gather, process and interpret molecular biological data. Aim of research: To increase our understanding of the connections between biological processes at different levels while developing better theories and methods in computer science and statistics. An interdisciplinary subject: Computer science/statistics + biology/medicine

Bioinformatics at many levels in biology

Molecules DNA Genomics Assembly Gene finding

Organ

Neuroinformatics

RNA

Protein

Transcriptomics

Sequence analysis

Systems biology

RNomics

Structural biology Proteomics

Microarrays

Sequence analysis

RNA folding

Annotation

RNA-Seq

Read mapping

Cell

Drug design MS analysis

Organism

Population

Personal medicine

Population genetics

Cellular simulations

Evolutionary biology

SNP analysis

Cancer biology

Metagenomics Phylogenomics

ChIP-Seq All levels: mining the literature (natural language processing)

Biobank data analysis

Some basic molecular biology  

The genome is our genetic material The genome consists of DNA 



DNA is built from nucleotides 

  

 

4 nucleotides or bases: A, C, G, U(racil)

RNA is translated into proteins based on codons (3 bases) Proteins consist of amino acids (on average 350 aa) There are 20 different natural amino acids: 



4 nucleotides or bases: A(denine), C(ytosine), G(uanine), T(hymine)

Genes are pieces of DNA DNA is transcribed into RNA when genes are expressed RNA is similar to DNA, but with a different backbone and uracil 



From ~2 to ~3000 million nucleotides

A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y

Proteins form 3D structures that act as molecular machines or as structural building blocks 4

Important challenges in bioinformatics Examples of important computational challenges in bioinformatics (hardest problems first):     

Protein structure prediction Whole-genome de novo sequence assembly Multiple sequence alignment (with many divergent sequences) Sequence database searching Sequence read mapping

5

Protein structure prediction 

Hardest problem (“Holy grail”): predict 3D protein structure directly from sequence   



Protein secondary structure prediction (easier)  



“ab initio“ “homology modelling” “threading” Predict helixes, strands and loops Not 3D

“Folding@Home”

6

Protein 3D structure and design

MPARALLPRRMGHRT LASTPALWASIPCPR SELRLDLVLPSGQSF RWREQSPAHWSGVLA DQVWTLTQTEEQLHC TVYRGDKSQASRPTP DELEAVRKYFQLDVT LAQLYHHWGSVD...

Structure prediction

Protein design

Whole genome sequence assembly 





   

Genome sequencing results in millions of small pieces of the full genome The challenge is to puzzle these together in the right order Genome size ranging from 2Mbp (bacteria) to 3Gbp (human) Read size from 30 bp to 1000 bp Sequencing errors Natural variation (allels) Repeats and similar regions

Whole genome sequence assembly

All the pieces must be puzzled together

Two genome assembly strategies

The overlap-layout-consensus approach

Alignment variants and algorithms Global

Pairwise

Dynamic programming (Needleman-Wunsch)

Local

Dynamic programming (Smith-Waterman) FASTA BLAST

Multiple

Dynamic programming

Dynamic programming

ClustalW T-Coffee MUSCLE MAFFT

MEME

Note: dynamic programming gives an optimal solution, but is extremely space and time demanding with many sequences (n > 2) 13

Pairwise sequence alignment

E.coli AlkA

Hollis et al. (2000) EMBO J. 19, 758-766 (PDB ID 1DIZ)

Human OGG1

Source: Bruner et al. (2000) Nature 403, 859-866 (PDB ID 1EBM)

E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183 ++| + |+ | +| || + | ||+ | || + +| |+ ||+ || + H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209 E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL | | || | |+| | | ||+| |+ | H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL

225 256

Common alignment scoring system Substitution score matrix    

Score for aligning any two residues to each other Identical residues have large positive scores Similar residues have small positive scores Very different residues have large negative scores

Gap penalties    

BLOSUM62 amino acid substituition score matrix

A R N D C Q E G H I L K M F P S T W Y V

A

R

N

D

C

Q

E

G

H

I

L

K

M

F

P

S

T

W

Y

V

4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0

-1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3

-2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3

-2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3

0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1

-1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2

-1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2

0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3

-2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3

-1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3

-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

-1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2

-1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1

-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1

-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2

1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2

0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0

-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3

-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1

0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Penalty for opening a gap in a sequence (Q) Penalty for extending a gap (R) Typical gap function: G = Q + R * L, where L is length of gap Example: Q=11, R=1

E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183 ++| + |+ | +| || + | ||+ | || + +| |+ ||+ || + H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209 E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL | | || | |+| | | ||+| |+ | H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL

225 256

How to find the best alignment(s)?     

There are too many possible alignments of two sequences to enable examination of every possible alignment individually There is an algorithm to identify the best (optimal) alignment(s), i.e the one(s) with the highest score The algorithm is of the dynamic programming (DP) type The algorithm was initially described by Needleman and Wunsch in 1970 Two steps:  



First, identify the highest possible score using DP Then, identify the alignment(s) with the highest score (using temporary results from the initial step)

Dynamic programming:  

General method for solving recursive problems by storing temporary results from smaller problems along the way Used to solve many problems in bioinformatics

16

Dynamic programming Assume that we are aligning the sequences x and y. We may denote the cells using coordinates: φ

y1

y2

y3

y4

φ

(0,0)

(0,1)

(0,2)

(0,3)

(0,4)

x1

(1,0)

(1,1)

(1,2)

(1,3)

(1,4)

x2

(2,0)

(2,1)

(2,2)

(2,3)

(2,4)

x3

(3,0)

(3,1)

(3,2)

(3,3)

(3,4)

x4

(4,0)

(4,1)

(4,2)

(4,3)

(4,4)

• Any path to (s,t) corresponds to a global alignment of (x1:s, y1:t). • Any such alignment (w,z) has a score S(w,z). • Let B(s,t) be the highest score (= optimal score) among these 17

Dynamic programming Assume a score function S(...) with gap penalty g. We are going to fill in the optimal score in each cell: φ

y1

y2

y3

y4

φ

B(0,0)

B(0,1)

B(0,2)

B(0,3)

B(0,4)

x1

B(1,0)

B(1,1)

B(1,2)

B(1,3)

B(1,4)

x2

B(2,0)

B(2,1)

B(2,2)

B(2,3)

B(2,4)

x3

B(3,0)

B(3,1)

B(3,2)

B(3,3)

B(3,4)

x4

B(4,0)

B(4,1)

B(4,2)

B(4,3)

B(4,4)

Optimal score for the alignment of x and y

B(0, 0) = 0 B(i, 0) = S ( x1 , φ ) + ... + S ( xi , φ ) = −ig B(0, j ) = S (φ , y1 ) + ... + S (φ , y j ) = − jg  B(i − 1, j − 1) + S ( xi , y j )  B(i,= j ) max  B (i − 1, j ) + S ( xi , φ )  B(i, j − 1) + S (φ , y ) j  18

Dynamic programming Score function: • S(a,a) = 2 • S(a,b) = 0 for a ≠ b • S(a,-) = -1 Sequences: x = TDWAX y = ZTDX Computing B(i,j) values:

Z

φ

0

-1

-2

D -3

T

-1

0

1

0

-1

D

-1

0

3

W

-2 -3

-2

-1

2

2 3

A

-4

2

-5

-2 -3

1

X

-3 -4

0

3

φ

T

X -4

19

Dynamic programming Z

φ

0

-1

-2

D -3

T

-1

0

1

0

-1

D

-1

0

3

W

-2 -3

-2

-1

2

2 3

A

-4

2

-5

-2 -3

1

X

-3 -4

0

3

φ

T

X -4

ZTD--X -TDWAX S = (-1) + 2 + 2 + (-1) + (-1) + 2 = 3

20

Algorithmic complexity 





Assume that we are aligning two sequences of length m and n Memory: O(nm) A fixed number of tables (one or two) with n*m cells: constant * nm A fixed number of additional variables: constant Little memory needed if we are only interested in the best score Time: O(nm) Calculate B(i,j) and P(i,j) for n*m cells in the table: constant * nm Perform traceback: ≤ constant * (n+m)

21

Multiple sequence alignment    

Align three or more sequences Show corresponding amino acids in the different proteins Place gaps at correct positions Impossible to solve optimally by brute force for more than a few short sequences

22

Finding the best multiple alignment    



To find the best multiple sequence alignments the MSA programs will try to find the one with the highest score The score is usually the sum-of-pairs-score or similar Corresponds approximately to the sum of all pairwise alignment scores For the alignment A of m sequences s1 til sm we have the sum-of-pairs score S(A):

S(a,b) is the pairwise score of a and b, and s-i is the projection of si, that is, si with inserted gaps

The sum-of-pairs score M

Q

P

I

L

L

L

M

L

R

-

L

L

-

M

K

-

I

L

L

L

M

P

P

V

L

I

L

score(k) = S(P,R) + S(P,-) + S(P,P) + S(R,-) + S(R,P) + S(-,P) score for column k = 3

We have S(-,-) = 0

Total score = score(1) + score(2) + .... + score(N)

Clustal W 



One of the most commonly used and wellknown tools for multiple sequence alignment. Now somewhat outdated and surpassed by other tools. Uses a progressive algorithm: Always starts with the most similar sequences and then aligns less similar sequences with each other.

MUSCLE    

MUSCLE = Multiple Sequence Comparison by Log Expectation Iterative procedure: improves the alignment gradually until good enough by introducing random changes in the alignment Very high quality of alignments Much faster than Clustal W

26

Searching sequence databases • Goal: Identify which sequences in a database are significantly similar to a given DNA, RNA or protein sequence. • How: The query sequence is compared (aligned) with each of the database sequences, and the amount of similarity is determined for each database sequence. Example: Query sequence:

acgatcgattagcca

Database sequences: Identical (trivial): Very similar (easy): Similar (moderate): Very diverged (hard):

acgatcgattagcca acgaccgatgagcca atgacggatgagcga atgacgggatgagcga 27

Searching databases 





  

Based on local alignments of query sequence with every database sequence Exhaustive / optimal / brute-force: Smith-Waterman Heuristic: BLAST, FASTA, PARALIGN, SSAHA, PatternHunter, BLAT, ... Speed difference: about 50X Sensitivity Specificity

28

Heuristic search algorithms    

Many heuristic algorithms creates an index into either the query sequence or the database sequences The index / hash key is short for sensitive algorithms and longer for quicker methods Allows fast lookup of matching positions Software: 



BLAST (w/variants), FASTA, SSAHA, BLAT, PatternHunter

Index key size      

2-7 amino acids (BLASTP) 1-2 amino acids (FASTA) 1-6 nucleotides (FASTA) 7-28 nucleotides (BLASTN, MEGABLAST) 10 nucleotides (SSAHA) 11 nucleotides (BLAT)

29

Short read mapping Input data:  10-100 million reads, each 50-150bp  Sequencing errors (typ. 1% error rate) Reference genome:  E.g. human genome, 3 Gbp  Genome variants

ACGTACGTACGTACGTACGTACGTACGTACGTACGT

Output:  0, 1, or more potential genomic locations of each read  Mapping quality assignment Requirements:  Speed, sensitivity, specificity

TCAGTGCATTGCACCCTGGGTGTGATCGTGCATGTG

CACGTACGTGGTTTTGCAACGTGCTGATGCTAGCAT

Short read mapping algorithms 

Q-gram filtering  



“Spaced seeds”  



At least one relatively long match A few mismatches allowed Multiple short matches Multiple mismatches allowed

Suffix array / Burrows Wheeler Transform (BW) / FM-index 

Exact match

31

Burrows-Wheeler 

  

The entire reference genome is initially transformed according to the BurrowsWheeler Transform and compressed into the FM-index. Align base by base of the read from the end (suffix). As the read is traversed, all potential locations are reported. If no match is found when a new base is aligned, then back up and try a base substitution.

Burrows-Wheeler transform