RNF: a method and tools to evaluate NGS read
Transcription
RNF: a method and tools to evaluate NGS read
Rnf: a method and tools to evaluate Ngs read mappers Karel Břinda, Valentina Boeva, Gregory Kucherov [email protected], [email protected], [email protected] Introduction Aligning reads to a reference sequence is a fundamental step in numerous bioinformatics pipelines. The sensitivity and precision of the mapping tool can critically affect the accuracy of produced results. Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. In default of standards for encoding read origins, every evaluation tool had to be made explicitly compatible with the simulator used to generate reads. To solve this obstacle, we have created a format Rnf (Read Naming Format) and an associated software package RnfTools. Rnf Description: Read Naming Format, a generic format for assigning read names with encoded information about original positions. Specification: http://karel-brinda.github.io/rnf-spec/ Read Naming Format Prefix Segments of reads Read tuple ID Suffix (with comments and extensions) sim__0043fd1__(3,13,F,027871,027970),(3,13,R,029171,029270)__[paired_end],C:[100=,42=1X47=] Genome ID Chromosome ID Direction Example of simulated read tuples Coor Rightmost coordinate Leftmost coordinate Their corresponding Rnf names read tuple r001 r002 12345678901234-5678901234567890123456789 Source 1 - reference genome chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa r003 Source 2 - generator of random sequences r004 READS: r001 r002/1 r002/2 r003/1 r003/2 r004 r005 ATG-TAGATA -> TTAGATAACGA -> r005 <- TCAG-CGGG tgcaaataa -> r006 gaa-gacc-t -> ATAGCT............TCAG -> GTAGG -> <- agacctt <- TCGACACG ATATCACATCATTAGACACTA r006 LRN SRN sim__1__(1,1,F,01,10)__[single_end] sim__2__(1,1,F,04,14),(1,1,R,31,39) __[paired_end] sim__3__(1,2,F,09,17),(1,2,F,25,33) __[mate_pair] sim__4__(1,1,F,15,36)__[spliced], C:[6=12N4=] sim__5__(1,1,R,15,22),(1,1,F,25,29), (1,2,R,05,11)__[chimeric] rnd__6__(2,0,N,00,00)__[random] #1 #2 #3 #4 #5 #6 LRN Long read name. SRN Short read name. They are used only if an LRN exceeds 255 characters (maximum allowed read length in Sam). Then a SRN-LRN correspondence file must be created. Evaluation of read mappers using Rnf-compatible programs Mapper evaluation Read simulation FASTA RnfTools Description: An associated software package of Rnfcompatible programs, based on Snakemake [2]. All employed external programs are installed automatically when they are needed. Genome 1 Genome 2 Read simulator RNF encoding BAM FASTQ Reads Alignment Mapper Prerequisites: – Unix-like system (Linux, OSX, etc.) – Python 3.2+ Steps: 2. Mapping All reads were mapped to HG38 by i) Yara, ii) Bwa-Mem, iii) Bwa-Sw, and iv) Bowtie2. 3. Evaluation. The obtained Bam files were evaluated using LAVEnder. Figure → Comparison of the mappers with respect to correctly mapped reads. Figure & Detailed graph for Yara. Figure ↓ 100 % Installation using Easy Install: > easy_install rnftools [2] J. Köster and S. Rahmann. Snakemake – a scalable bioinformatics workflow engine. Bioinformatics 28(19): 2520–2522, 2012. 80 % Part of all reads (%) [1] K. Břinda, V. Boeva, G. Kucherov. RNF: a general framework to evaluate NGS read mappers. arXiv:1504.00556 [q-bio.GN], 2015. FDR in mapping (#wrongly mapped reads / #mapped reads) 1. Simulation of reads. 200.000 reads were simulated by DwgSim using MIShmash: – 100.000 reads from a human genome (HG38), – 100.000 reads from a mouse genome (MM10). Installation using Pip: > pip install rnftools References Report Correctly mapped reads in all reads which should be mapped 60 % 40 % #correctly mapped reads / #reads which should be mapped Source codes and documentation: http://github.com/karel-brinda/rnftools http://rnftools.rtfd.org TXT/HTML RnfTools – example of usage 10-4 10-3 100 % BWA-MEM BWA-SW Bowtie2 YARA 10-2 10-1 90 % 80 % 70 % 60 % 50 % Detailed graph for Bwa-Mem. BWA-MEM YARA FDR in mapping (#wrongly mapped reads / #mapped reads) FDR in mapping (#wrongly mapped reads / #mapped reads) 10-2 10-1 100 % Unmapped correctly Unmapped incorrectly Thresholded correctly Thresholded incorrectly Multimapped Mapped, should be unmapped Mapped to wrong position Mapped correctly 80 % Part of all reads (%) ii) LAVEnder Tool for read mappers evaluation using Rnf reads. Mapper evaluation tool Genome n Components: i) MIShmash Pipeline applying one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim etc.) and transforming the generated reads into Rnf format. RNF decoding 60 % 40 % 20 % 20 % 0% 0% 10-2 10-1 Unmapped correctly Unmapped incorrectly Thresholded correctly Thresholded incorrectly Multimapped Mapped, should be unmapped Mapped to wrong position Mapped correctly 100