slides
Transcription
slides
Information-Theoretic Analysis of Molecular (Co)Evolution Using Graphics Processing Units Michael Waechter, Kathrin Jaeger, Stephanie Weissgraeber, Sven Widmer, Michael Goesele, and Kay Hamacher ...AEERYAEYKEAFTLFDSDGD... ...TEEQGRQFRQMFEMFDKNGD... ...TDEQQRQYRQMFETFDKDGN... ...TKEQVEEFKQAFSMFDTDGD... ...SEEQVAEFKEAFDRFDKNKD... ...SKEQVAKFKEAFDRIDKNKD... ...SPEQVAEFKQAFSRFDKNGD... ...SEEQVAKFKAAFSRFDTNGD... ...PPEQVAKFKEVFSRFDKNGD... ...AEERYAEYKEAFTLFDSDGD... ...TEEQGRQFRQMFEMFDKNGD... ...TDEQQRQYRQMFETFDKDGN... ...TKEQVEEFKQAFSMFDTDGD... ...SEEQVAEFKEAFDRFDKNKD... ...SKEQVAKFKEAFDRIDKNKD... ...SPEQVAEFKQAFSRFDKNGD... ...SEEQVAKFKAAFSRFDTNGD... ...PPEQVAKFKEVFSRFDKNGD... June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 1 Motivation ●Huge amount of Multiple Sequence Alignments (MSAs) available, some of them really large ● E.g., HIV protease [1]: > 45,000 sequences of length > 1400 ●Put them to use for coevolutionary and structural analysis ●But: Our computations take >25 days [1] Pan et. al.:“The HIV positive selection mutation database” June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 2 Outline ●In this talk we will show… ● MSA analysis using Mutual Information ● GPU parallelization & speed improvements ● 3-point Mutual Information ● an application to a well-known protein contributions ● that the use of this is beneficial June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 3 Introduction – Mutual Information ●Given an MSA: Sequence Sequence Sequence Sequence Sequence Sequence Sequence Sequence 1: 2: 3: 4: 5: 6: 7: 8: AEERYAEYKEAFTLFDSDGD... TEEQGRQFRQMFEMFDKNGD... TDEQQRQYRQMFETFDKDGN... TKEQVEEFKQAFSMFDTDGD... SEEQVAEFKEAFDRFDKNKD... SKEQVAKFKEAFDRIDKNKD... SPEQVAEFKQAFSRFDKNGD... SEEQVAKFKAAFSRFDTNGD... ●Mutual Information between two columns (correlation coevolution): ●Iteration over all column pairs MI matrix: June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 4 Introduction – Shuffling Null-Model ●MI is sensitive to underlying amino acid distribution ●Computational Normalization: Shuffling Null-Model [2] ●Is MI distinguishable from “random evolution” MI? [2] K. Hamacher: “Relating sequence evolution of HIV1-protease to its underlying molecular mechanics” June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber Introduction – Shuffling Null-Model ●Compute original MI ●Iterate 10,000 times: ● Shuffle each MSA column ● Compute rand. MI matrix AEER... TEEQ... TDEQ... SEEQ... SKEQ... PPEQ... ●Normalize original MI using random MI: SEEQ... TDER... TKEQ... SEEQ... APEQ... PEEQ... SEEQ... SEEQ... PEEQ... TPEQ... AKEQ... TDER... . . . June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber . . . 6 Massive parallelism ●Highly compute intensive ●HIV-1 protease on single core: ● MI computation for all column pairs: ~3.5 min ● Repeat for 10,000 iterations: > 25 days ●But: ● Computation of each MI matrix entry independent of all others ● Shuffling of each MSA column independent of all others ●Parallelizable (to hundreds of thousands of threads) June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 7 GPU Implementation ●Iterate 10,000 times: ● Shuffling – Map MSA columns to blocks of threads – Shuffle columns (GPU suited algorithm) – Synchronize ...AEERYA... ...TEEQGR... ...TDEQQR... ...TKEQVE... ...SEEQVA... ...SKEQVA... ...SPEQVA... ● MI computation – Map MI matrix entries to blocks of threads (suitable for MSA access pattern) – Compute MI matrix entries – Synchronize ●Combine results & normalize orig. MI with randomized MI June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 8 Speed Results GeForce GTX 480 Calmodulin 753 sequences of length 264 HIV‐1 protease > 45,000 seqs. of length > 1400 1.1 min 4 threads on Core i7‐960 13.4 min ~ 12x speed‐up 1.85 days 7.3 days ~ 4x speed‐up ●Problem size dependent June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 9 Implications ●One order of magnitude speed-up ●Quickly redo previous steps (e.g., alignment) and recompute MI ●New analysis tool feasible: 3-point MI: Coevolution of a ‘3-clique’ of MSA columns ●Can we deduce more information from 3-point MI than we could from 2-point MI alone? June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 10 Calmodulin ●149 amino acids ●Ca2+ binding conformational change ●Regulates various signaling pathways June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 11 Coevolution in Calmodulin – 2-point MI ●Finding coevolving pairs of amino acids ●Structural or functional connection ●Here: Coevolution within Nand C-terminus ● Ca2+ binding ● Propagation of conformational change ●Conserved inner helix ● No coevolution without variation June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 12 Coevolution in Calmodulin – 3-point MI ●‘3-cliques’ of amino acids ●Higher order correlations ● Concerted motions ● Binding sites June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 13 Coevolution in Calmodulin – 3-point MI ●‘3-cliques’ of amino acids ● ●Color indicates the frequency with which an amino acid contributes to the ‘3-cliques’ set ●Key residues for important functions June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 14 Conclusions ●MI for coevolutionary analysis ●GPU implementation ~10x faster on typical MSAs ●3-point MI analysis possible in acceptable time ●3-point MI does reveal new insights ● Next step could be k-point MI ●It may be possible to detect key residues in unknown proteins June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 15 What happened since? ●Multi-GPU parallelization: ● Distribute Shuffling Null-Model iterations among GPUs ● First tests: 32 GPUs ~32x speed-up (on top of basic GPU speedup!) June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 16 Please visit tinyurl.com/tud‐comic Thank you. for code & documentation or contact us. June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 17