Installation of MyPro Virtual Box https://www.virtualbox
Transcription
Installation of MyPro Virtual Box https://www.virtualbox
Installation of MyPro Download Virtual Box https://www.virtualbox.org/wiki/Downloads Open VirtualBox File -> Import Appliance... Select the file (MyPro.ova) to import Or, directly double click on MyPro.ova Please check the box of "Reinitialize the MAC address of all network cards" Import Click on Shared folders to add share (e.g. MyPro on your local computer) Check Auto-mount OK -> OK Start Devices -> Shared Clipboard --> Bidirectional 1 Open Terminal Make a new folder and mount to the shared folder mkdir Data sudo mount -t vboxsf MyPro Data (* key in the pw: manager) 2 Description of MyPro A. Pre-process This script is used to trim, pair and sub-sample your raw reads. A total of 100X (paired) reads are generated. This process is strongly recommended, otherwise much computational time is required for genome assembly. Command: Preprocess.py -read1 S.gordonii_G9B_TTAGGC_L001_R1_001.fastq S.gordonii_G9B_TTAGGC_L001_R2_001.fastq -g 2200000 -read2 B. AutoRun This script is used to perform Assemble, Integrate and Annotate. Command: AutoRun.py G9B -read1 50X_R1.fastq -read2 50X_R2.fastq -evaluate -p '--prefix G9B --genus Streptococcus --species gordonii --strain G9B --addgenes --locustag AA01' 3 '-evaluate' is used to turn on read-mapping function, so that the alignment rate can be obtained for evaluation. '-p' is used to describe the information for annotation. (please see Prokka for details) Alternatively, you can perform Assemble, Integrate and Annotate individually B.1 Assemble This scrip is to assemble reads with the five assemblers: VelvetOptimisier, Edena, Abyss, SOAPdenovo, and SPAdes Command: mkdir G9B Assemble.py G9B -read1 50X_R1.fastq -read2 50X_R2.fastq -evaluate -off abyss '-off abyss' is used to turn off the Abyss assembler if you don't want to execute it. B.2 Integrate This script is to conduct CISA for contig integration. Command: Integrate.py G9B -i abyss.ctg.fa,edena.ctg.fa,velvet.ctg.fa,soap.ctg.fa,spades.ctg.fa -evaluate '-evaluate' is used to turn on read-mapping function, so that the alignment rate can be obtained for evaluation. Insert size and read information are required (kmer_insert.txt in the Assemble folder). B.3 Annotate This script is to conduct Prokka for prokaryotic genome assembly. Command: Annotate.py G9B -i cisa.ctg.fa -p '--prefix G9B --genus Streptococcus --species gordonii --strain G9B --addgenes --locustag AA01' Please note that we have provided high quality reference genomes (Tatusova, et al., 2014) for genus database and updated Swiss-Prot database for Prokka annotation. 4 We have downloaded the 90 reference genomes (66 species) from http://www.ncbi.nlm.nih.gov/genome/browse/reference/ to built a genus db in MyPro. prokka-genbank_to_fasta_db speciesA.gbk > speciesA.faa cd-hit -i speciesA.faa -o SpeciesA -s 0.9 -c 1.0 makeblastdb -dbtype prot -in SpeciesA Similarly, all RefSeq bacterial genomes (656 species) were used to built a genus db which can be downloaded from sourceforge. Besides, we have updated the kingdom database of Prokka with Swiss-Prot data (546790 entries, Nov. 21, 2014). prokka-uniprot_to_fasta_db --term=Bacteria uniprot_sprot.dat > Bacteria/sprot prokka-uniprot_to_fasta_db --term=Archaea uniprot_sprot.dat > Archaea/sprot prokka-uniprot_to_fasta_db --term=Viruses uniprot_sprot.dat > Viruses/sprot C. Post-assemble This script is to (1) merge your ordered contigs if they are overlapped and (2) fill gaps with the contigs of local assembling. (Optional) To use r2cat.jar for ordering your contigs against a related reference genome. You can Export contigs order (FASTA) and Export unmatched contigs (FASTA) separately. 5 Command: Postassemble.py -o cisa.ordered.fa -u unmatched.fa -read1 ../50X_R1.fastq -read2 ../50X_R2.fastq No reference genome: Postassemble.py -u cisa.ctg.fa -read1 ../50X_R1.fastq -read2 ../50X_R2.fastq E. Explore You are able to use Tablet to explore the assembly with aligned reads. Double click on Tablet to Open Assembly: Please note that, in the module of Assemble, MyPro removes the contigs with less 6 than 100 reads to form xxx.ctg.fa, so please use raw.xxx.ctg.fa as reference when using Tablet. All scripts we made for MyPro (in Python) are placed in the folder of MyTools. Users are able to modify the scripts for their own purposes. Besides, the scripts of MyPro are available for download via sourceforge. 7 MyPro includes the following software in a Virtual Box, please make sure you have accepted their agreements if any. Name Download Reference Checkbox Abyss 1.5.2 http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.5.2 (Simpson, et al., 2009) □ Bio-Linux 8 http://environmentalomics.org/bio-linux-download/ (Field, et al., 2006) □ CISA 1.3 http://sb.nhri.org.tw/CISA (Lin and Liao, 2013) □ Edena V3.131028 http://www.genomic.ch/edena.php (Hernandez, et al., 2008) □ FastQC 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Prokka 1.10 http://www.vicbioinformatics.com/software.prokka.shtml (Seemann, 2014) □ r2cat http://bibiserv2.cebitec.uni-bielefeld.de/cgcat?viewType=download (Husemann and Stoye, 2010) □ SOAP2 2.21 http://soap.genomics.org.cn/soapaligner.html (Li, et al., 2009) □ SOAPdenovo 2.04 http://sourceforge.net/projects/soapdenovo2/files/SOAPdenovo2/bin/ (Luo, et al., 2012) □ SPAdes 3.1.1 http://bioinf.spbau.ru/spades (Bankevich, et al., 2012) □ Tablet 1.14.04.10 http://ics.hutton.ac.uk/tablet/download-tablet/ (Milne, et al., 2013) □ VelvetOptimiser 2.2.5 http://bioinformatics.net.au/software.velvetoptimiser.shtml (Zerbino and Birney, 2008) □ Please note that Abyss was made with maxk=128, Velvet was made with MAXKERLENGTH=127. 8 □ Validation of MyPro on three bacterial species Dataset: E. coli MG1655 Paired reads of E. coli MG1655 are available at Illumina website. Mate1 and Mate 2 were downloaded separately. Mate1: ftp://webdata:[email protected]/Data/SequencingRuns/MG1655/MiSeq _Ecoli_MG1655_110721_PF_R1.fastq.gz Mate2: ftp://webdata:[email protected]/Data/SequencingRuns/MG1655/MiSeq _Ecoli_MG1655_110721_PF_R2.fastq.gz Preprocess.py -read1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -g 4650000 -read2 This process took about 40 min. AutoRun.py MG1655 -read1 50X_R1.fastq -read2 50X_R2.fastq This process took about 4 hr for running in a VirtualBox with 16GB RAM @ Dell Precisions Workstations T1600 Computer Workstation (Quad Core Xeon E3-1245, 3.30 GHz with 32GB RAM) n50: soap.ctg.fa: 15329 edena.ctg.fa: 21492 abyss.ctg.fa: 21569 velvet.ctg.fa: 27298 spades.ctg.fa: 87014 Ctgs: soap.ctg.fa: 526 edena.ctg.fa: 375 abyss.ctg.fa: 364 velvet.ctg.fa: 326 spades.ctg.fa: 121 9 The longest ctg's length: soap.ctg.fa: 58371 edena.ctg.fa: 68181 velvet.ctg.fa: 85370 abyss.ctg.fa: 85610 spades.ctg.fa: 224038 Alignment %: raw.soap.ctg.fa: 98.48 raw.velvet.ctg.fa: 98.50 raw.abyss.ctg.fa: 98.67 raw.spades.ctg.fa: 98.81 raw.edena.ctg.fa: 98.96 cisa.ctg.fa Alignment:99.34% whole:4625288 N50: 88512 Number of contigs: 104 Length of the longest contig: 315566 Use r2cat to align cisa.ctg.fa against a reference (Ecoli DH10B, NC_010473), then export the ordered assembly. Click on r2cat.far located on Desktop File --> Match new Query: cisa.ctg.fa Target: NC_010473.fna Start Matching Continue Options --> Sort queries File --> Export contigs order (FASTA) Postassemble.py -o cisa.ctg.ordered.fa -read1 ../ 50X_R1.fastq -read2 ../50X_R2.fastq Alignment:99.32% N50.py Bridged.ctg.fa whole:4608751 N50: 105185 10 Number of contigs: 71 Length of the longest contig: 335840 Dot plot against the reference genome (NC_000913) by r2cat: Quast 2.3 Evaluation: quast.py -o quast -R NC_000913.fna -G NC_000913.gff raw.abyss.ctg.fa raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa Bridged.ctg.fa NC_000913.fna and NC_000913.gff can be downloaded from here. To copy Bridged.ctg.fa to the folder of Assemble cp Bridged.ctg.fa Assemble/ 11 Annotate.py MG1655 -i Bridged.ctg.fa -p ' --genus Escherichia --species coli --strain MG1655 --prefix post' organism: Escherichia coli MG1655 contigs: 71 bases: 4608751 rRNA: 8 tmRNA: 1 tRNA: 74 CDS: 4309 repeat_region: 2 Detailed log: post.log Walltime used: 8.00 minutes Because we have updated the databases for Prokka, the number of hypothetical proteins was reduced from 668 to 657 (Swiss-Prot update) and to 414 (High quality genus database). Meanwhile, the running time was minor increased from 13.13 min to 13.73 min (Swiss-Prot update) but greatly decreased to 8 min (using genus db). If RefSeq bacterial genomes were used as genus database, the number of hypothetical proteins was reduced to 369 but the running time increased to more than 20 min. ============================================================= 12 AutoRun.py MG1655_raw -read1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -read2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq The size of raw reads is about 4GB. This process took around12 hr with a linux server! IBM X3850 Intel Xeon E7-4820, 2.00GHz with 256 GB of RAM. n50: soap.ctg.fa: 35936 edena.ctg.fa: 62035 abyss.ctg.fa: 129004 spades.ctg.fa: 139882 velvet.ctg.fa: 148743 Ctgs: soap.ctg.fa: 257 edena.ctg.fa: 156 spades.ctg.fa: 90 velvet.ctg.fa: 89 abyss.ctg.fa: 85 The longest ctg's length: soap.ctg.fa: 124043 edena.ctg.fa: 149834 velvet.ctg.fa: 265585 spades.ctg.fa: 285684 abyss.ctg.fa: 285832 Alignment %: raw.soap.ctg.fa Alignment:82.53% raw.velvet.ctg.fa Alignment:83.45% raw.spades.ctg.fa Alignment:83.60% raw.abyss.ctg.fa Alignment:83.83% raw.edena.ctg.fa Alignment:83.95% cisa.ctg.fa Alignment:83.93% whole:4640526 N50: 150749 Number of contigs: 59 Length of the longest contig: 315844 Use r2cat to align cisa.ctg.fa against NC_010473, then export the ordered assembly. 13 Postassemble.py -o cisa.ordered.fa -read1 ../MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -read2 ../MiSeq_Ecoli_MG1655_110721_PF_R2.fastq N50.py Bridged.ctg.fa whole:4619352 N50: 266212 Number of contigs: 35 Length of the longest contig: 624650 Dot plot against the reference genome (NC_000913) by r2cat: This is the only case presented here by testing MyPro on a linux server, other examples were run in a Virtual Box (16GB RAM) on a windows computer. 14 Another dataset of E. coli MG1655 was downloaed from: http://systems.illumina.com/systems/miseq/performance_specifications.html E. coli MG1655 prepared with TruSeq Nano library prep kit (New dataset - 2 x 300 bp). E. coli strain MG1655 was sequenced on the MiSeq System using the new MiSeq Reagent Kit v3 and a read length configuration of 2 x 300 bp. The sequence data analysis showed 83% of bases above Q30. Preprocess.py -read1 Ecoli_S1_L001_R1_001.fastq -read2 Ecoli_S1_L001_R2_001.fastq -g 4650000 It took about 4 hr for the pre-process! AutoRun.py MG1655_new -read1 50X_R1.fastq -read2 50X_R2.fastq 2014-10-01,02:14 ==> 2014-10-01,06:09 n50: soap.ctg.fa: 25999 velvet.ctg.fa: 26577 edena.ctg.fa: 41843 abyss.ctg.fa: 47782 spades.ctg.fa: 117761 Ctgs: soap.ctg.fa: 315 velvet.ctg.fa: 306 abyss.ctg.fa: 186 edena.ctg.fa: 177 spades.ctg.fa: 87 The longest ctg's length: velvet.ctg.fa: 86828 soap.ctg.fa: 91293 edena.ctg.fa: 130338 abyss.ctg.fa: 133833 spades.ctg.fa: 274095 Alignment %: raw.velvet.ctg.fa: 95.08 raw.soap.ctg.fa: 95.71 raw.edena.ctg.fa: 98.56 raw.spades.ctg.fa: 98.80 raw.abyss.ctg.fa: 99.05 15 cisa.ctg.fa Alignment:99.44% whole:4613259 N50: 119293 Number of contigs: 74 Length of the longest contig: 275305 Use r2cat to align cisa.ctg.fa against NC_010473, then export the ordered assembly. Postassemble.py -o cisa.ctg.ordered.fa -read1 ../ 50X_R1.fastq -read2 ../50X_R2.fastq N50.py Bridged.ctg.fa whole:4599468 N50: 177141 Number of contigs: 48 Length of the longest contig: 292878 Dot plot against the reference genome (NC_000913) by r2cat: 16 Quast 2.3 Evaluation quast.py -o quast -R NC_000913.fna -G NC_000913.gff raw.abyss.ctg.fa raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa Bridged.ctg.fa 17 Dataset: S. aureus Strain: S. aureus COL Sequencing reads were downloaded from http://www.ebi.ac.uk/ena/data/view/ERR580967 Preprocess.py -read1 ERR580967_1.fastq -read2 ERR580967_2.fastq -g 2800000 This process took about 12 min! Paired reads were trimmed to 140 bp. AutoRun.py SA -read1 50X_R1.fastq -read2 50X_R2.fastq 2014-11-12,06:11 ==> 2014-11-12,08:06 n50: soap.ctg.fa: 51885 edena.ctg.fa: 67066 abyss.ctg.fa: 69203 velvet.ctg.fa: 98413 spades.ctg.fa: 150416 Ctgs: soap.ctg.fa: 136 edena.ctg.fa: 108 abyss.ctg.fa: 87 velvet.ctg.fa: 57 spades.ctg.fa: 56 The longest ctg's length: soap.ctg.fa: 135661 edena.ctg.fa: 156507 abyss.ctg.fa: 163038 spades.ctg.fa: 300973 velvet.ctg.fa: 325720 WholeGenome: spades.ctg.fa: 2779025 soap.ctg.fa: 2779369 18 edena.ctg.fa: 2779839 velvet.ctg.fa: 2787009 abyss.ctg.fa: 2803411 Alignment %: raw.soap.ctg.fa: 87.03 raw.velvet.ctg.fa: 98.44 raw.spades.ctg.fa: 99.11 raw.edena.ctg.fa: 99.42 raw.abyss.ctg.fa: 99.55 Integrate.py SA -i abyss.ctg.fa,edena.ctg.fa,velvet.ctg.fa,spades.ctg.fa -gs 3200000 -evaluate cisa.ctg.fa Alignment:99.76% whole:2885596 N50: 224823 Number of contigs: 34 Length of the longest contig: 452546 Select a reference and download from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Staphylococcus_aureus_Newman_uid58839/ Use r2cat to align cisa.ctg.fa against NC_009641, then export the ordered assembly and the unmatched contig. 19 Postassemble.py -o cisa.ordered.fa -u unmatched.fa -read1 ../50X_R1.fastq -read2 ../50X_R2.fastq N50.py Bridged.ctg.fa whole:2870598 N50: 335868 Number of contigs: 26 Length of the longest contig: 452546 Alignment:99.76% Dot plot against the reference genome (NC_002951) by r2cat: Quast 2.3 Evaluation quast.py -o quast -R NC_002951.fna -G NC_002951.gff raw.abyss.ctg.fa raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa Bridged.ctg.fa 20 21 Strain: S. aureus MW2 Sequencing reads were downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR857/SRR85 7914 Preprocess.py -read1 SRR857914_1.fastq -read2 SRR857914_2.fastq -g 2800000 This process took about 20 min. AutoRun.py MW2 -read1 50X_R1.fastq -read2 50X_R2.fastq 2014-11-21,02:27 ==> 2014-11-21,04:34 n50: soap.ctg.fa: 69354 spades.ctg.fa: 147491 edena.ctg.fa: 268019 abyss.ctg.fa: 383680 velvet.ctg.fa: 1413768 Ctgs: soap.ctg.fa: 77 spades.ctg.fa: 39 abyss.ctg.fa: 18 edena.ctg.fa: 18 velvet.ctg.fa: 8 The longest ctg's length: soap.ctg.fa: 138695 spades.ctg.fa: 410165 edena.ctg.fa: 539806 abyss.ctg.fa: 940710 velvet.ctg.fa: 1413768 WholeGenome: soap.ctg.fa: 2783028 spades.ctg.fa: 2805095 velvet.ctg.fa: 2815477 edena.ctg.fa: 2816795 abyss.ctg.fa: 2826349 Alignment %: raw.soap.ctg.fa: 94.97 raw.abyss.ctg.fa: 95.17 raw.spades.ctg.fa: 99.31 22 raw.velvet.ctg.fa: 99.46 raw.edena.ctg.fa: 99.59 Integrate.py MW2 -i abyss.ctg.fa,edena.ctg.fa cisa.ctg.fa Alignment:93.86% whole:2768641 N50: 1419734 Number of contigs: 5 Length of the longest contig: 1419734 To increase genome size for assembly integration: Integrate.py MW2 -i abyss.ctg.fa,edena.ctg.fa,velvet.ctg.fa,spades.ctg.fa -evaluate -gs 3000000 cisa.ctg.fa Alignment:99.51% whole:2830305 N50: 1419734 Number of contigs: 7 Length of the longest contig: 1419734 Postassemble.py -o cisa.ordered.fa -read1 ../../50X_R1.fastq -read2 ../../50X_R2.fastq -m 423 -s 71 N50.py Bridged.ctg.fa whole:2830305 N50: 1419734 Number of contigs: 7 23 Length of the longest contig: 1419734 Alignment:99.51% Identical assembly was obtained (to cisa.ctg.fa) after post-assembly! Quast 2.3 Evaluation quast.py -o quast -R NC_003923.fna -G NC_003923.gff raw.abyss.ctg.fa raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa 24 Dataset: R. sphaeroides 2.4.1 http://systems.illumina.com/systems/miseq/scientific_data.html The B. cereus ATCC 10987 & R. sphaeroides ATCC BAA-808 samples were sequenced on the MiSeq System using the new MiSeq Reagent Kit v3 at a 2 x 300 bp read length configuration with dual indexing. The total yield was 17.8 Gb with 70% of bases at or above Q30. The average fragment length for this data set is 475 bp. fastqc Rhodo_S2_L001_R2_001.fastq Preprocess.py -read1 Rhodo_S2_L001_R1_001.fastq -read2 Rhodo_S2_L001_R2_001.fastq -g 4600000 This process took about 4 hr! Paired reads were trimmed to 200 bp. AutoRun.py Miseq300 -read1 50X_R1.fastq -read2 50X_R2.fastq 2014-11-14,00:23 ==> 2014-11-14,04:19 n50: soap.ctg.fa: 19291 velvet.ctg.fa: 20714 edena.ctg.fa: 22179 25 abyss.ctg.fa: 22737 spades.ctg.fa: 77772 Ctgs: soap.ctg.fa: 403 velvet.ctg.fa: 373 abyss.ctg.fa: 345 edena.ctg.fa: 293 spades.ctg.fa: 142 The longest ctg's length: soap.ctg.fa: 69823 abyss.ctg.fa: 97096 edena.ctg.fa: 97244 velvet.ctg.fa: 97422 spades.ctg.fa: 320329 WholeGenome: edena.ctg.fa: 4014722 abyss.ctg.fa: 4292653 soap.ctg.fa: 4339440 velvet.ctg.fa: 4369665 spades.ctg.fa: 4429967 Alignment %: raw.edena.ctg.fa: 95.82 raw.abyss.ctg.fa: 97.90 raw.soap.ctg.fa: 98.19 raw.velvet.ctg.fa: 98.21 raw.spades.ctg.fa: 98.24 Keep all assemblies. cisa.ctg.fa Alignment:98.26% whole:4401027 N50: 79366 Number of contigs: 130 Length of the longest contig: 320634 r2cat against the reference R.sphaeroides 2.4.1 (http://sb.nhri.org.tw/CISA/upload/en/2013/8/R_sphaeroides.fna-01021413.gz). 26 Postassemble.py -o ordered.cisa.fa N50.py Bridged.ctg.fa whole:4391111 N50: 87718 Number of contigs: 120 Length of the longest contig: 320634 Alignment:98.26% 27 References Bankevich, A., et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing, Journal of computational biology : a journal of computational molecular cell biology, 19, 455-477. Field, D., et al. (2006) Open software for biologists: from famine to feast, Nature biotechnology, 24, 801-803. Hernandez, D., et al. (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer, Genome Res, 18, 802-809. Husemann, P. and Stoye, J. (2010) r2cat: synteny plots and comparative assembly, Bioinformatics, 26, 570-571. Li, R., et al. (2009) SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics, 25, 1966-1967. Lin, S.H. and Liao, Y.C. (2013) CISA: contig integrator for sequence assembly of bacterial genomes, PLoS One, 8, e60843. Luo, R., et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience, 1, 18. Milne, I., et al. (2013) Using Tablet for visual exploration of second-generation sequencing data, Brief Bioinform, 14, 193-202. Seemann, T. (2014) Prokka: rapid prokaryotic genome annotation, Bioinformatics. Simpson, J.T., et al. (2009) ABySS: A parallel assembler for short read sequence data, Genome Research, 19, 1117-1123. Tatusova, T., et al. (2014) RefSeq microbial genomes database: new representation and annotation strategy, Nucleic acids research, 42, D553-559. Zerbino, D.R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res, 18, 821-829. 28