Installation of MyPro Virtual Box https://www.virtualbox

Transcription

Installation of MyPro Virtual Box https://www.virtualbox
Installation of MyPro
Download Virtual Box
https://www.virtualbox.org/wiki/Downloads
Open VirtualBox
File -> Import Appliance...
Select the file (MyPro.ova) to import
Or, directly double click on MyPro.ova
Please check the box of "Reinitialize the MAC address of all network cards"
Import
Click on Shared folders to add share (e.g. MyPro on your local computer)
Check Auto-mount
OK -> OK
Start
Devices -> Shared Clipboard --> Bidirectional
1
Open Terminal
Make a new folder and mount to the shared folder
mkdir Data
sudo mount -t vboxsf MyPro Data
(* key in the pw: manager)
2
Description of MyPro
A. Pre-process
This script is used to trim, pair and sub-sample your raw reads. A total of 100X
(paired) reads are generated. This process is strongly recommended, otherwise much
computational time is required for genome assembly.
Command:
Preprocess.py -read1 S.gordonii_G9B_TTAGGC_L001_R1_001.fastq
S.gordonii_G9B_TTAGGC_L001_R2_001.fastq -g 2200000
-read2
B. AutoRun
This script is used to perform Assemble, Integrate and Annotate.
Command:
AutoRun.py G9B -read1 50X_R1.fastq -read2 50X_R2.fastq -evaluate -p '--prefix
G9B --genus Streptococcus --species gordonii --strain G9B --addgenes --locustag
AA01'
3
'-evaluate' is used to turn on read-mapping function, so that the alignment rate can be
obtained for evaluation.
'-p' is used to describe the information for annotation. (please see Prokka for details)
Alternatively, you can perform Assemble, Integrate and Annotate individually
B.1 Assemble
This scrip is to assemble reads with the five assemblers: VelvetOptimisier, Edena,
Abyss, SOAPdenovo, and SPAdes
Command:
mkdir G9B
Assemble.py G9B -read1 50X_R1.fastq -read2 50X_R2.fastq -evaluate -off abyss
'-off abyss' is used to turn off the Abyss assembler if you don't want to execute it.
B.2 Integrate
This script is to conduct CISA for contig integration.
Command:
Integrate.py G9B -i abyss.ctg.fa,edena.ctg.fa,velvet.ctg.fa,soap.ctg.fa,spades.ctg.fa
-evaluate
'-evaluate' is used to turn on read-mapping function, so that the alignment rate can be
obtained for evaluation. Insert size and read information are required (kmer_insert.txt
in the Assemble folder).
B.3 Annotate
This script is to conduct Prokka for prokaryotic genome assembly.
Command:
Annotate.py G9B -i cisa.ctg.fa -p '--prefix G9B --genus Streptococcus --species
gordonii --strain G9B --addgenes --locustag AA01'
Please note that we have provided high quality reference genomes (Tatusova, et al.,
2014) for genus database and updated Swiss-Prot database for Prokka annotation.
4
We have downloaded the 90 reference genomes (66 species) from
http://www.ncbi.nlm.nih.gov/genome/browse/reference/ to built a genus db in MyPro.
prokka-genbank_to_fasta_db speciesA.gbk > speciesA.faa
cd-hit -i speciesA.faa -o SpeciesA -s 0.9 -c 1.0
makeblastdb -dbtype prot -in SpeciesA
Similarly, all RefSeq bacterial genomes (656 species) were used to built a genus db
which can be downloaded from sourceforge.
Besides, we have updated the kingdom database of Prokka with Swiss-Prot data
(546790 entries, Nov. 21, 2014).
prokka-uniprot_to_fasta_db --term=Bacteria uniprot_sprot.dat > Bacteria/sprot
prokka-uniprot_to_fasta_db --term=Archaea uniprot_sprot.dat > Archaea/sprot
prokka-uniprot_to_fasta_db --term=Viruses uniprot_sprot.dat > Viruses/sprot
C. Post-assemble
This script is to (1) merge your ordered contigs if they are overlapped and (2) fill gaps
with the contigs of local assembling.
(Optional) To use r2cat.jar for ordering your contigs against a related reference
genome. You can Export contigs order (FASTA) and Export unmatched contigs
(FASTA) separately.
5
Command:
Postassemble.py
-o
cisa.ordered.fa
-u
unmatched.fa
-read1
../50X_R1.fastq
-read2 ../50X_R2.fastq
No reference genome:
Postassemble.py -u cisa.ctg.fa -read1 ../50X_R1.fastq -read2 ../50X_R2.fastq
E. Explore
You are able to use Tablet to explore the assembly with aligned reads.
Double click on Tablet
to Open Assembly:
Please note that, in the module of Assemble, MyPro removes the contigs with less
6
than 100 reads to form xxx.ctg.fa, so please use raw.xxx.ctg.fa as reference when
using Tablet.
All scripts we made for MyPro (in Python) are placed in the folder of MyTools. Users
are able to modify the scripts for their own purposes. Besides, the scripts of MyPro
are available for download via sourceforge.
7
MyPro includes the following software in a Virtual Box, please make sure you have accepted their agreements if any.
Name
Download
Reference
Checkbox
Abyss 1.5.2
http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.5.2
(Simpson, et al., 2009)
□
Bio-Linux 8
http://environmentalomics.org/bio-linux-download/
(Field, et al., 2006)
□
CISA 1.3
http://sb.nhri.org.tw/CISA
(Lin and Liao, 2013)
□
Edena V3.131028
http://www.genomic.ch/edena.php
(Hernandez, et al., 2008)
□
FastQC 0.10.1
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Prokka 1.10
http://www.vicbioinformatics.com/software.prokka.shtml
(Seemann, 2014)
□
r2cat
http://bibiserv2.cebitec.uni-bielefeld.de/cgcat?viewType=download
(Husemann and Stoye, 2010)
□
SOAP2 2.21
http://soap.genomics.org.cn/soapaligner.html
(Li, et al., 2009)
□
SOAPdenovo 2.04
http://sourceforge.net/projects/soapdenovo2/files/SOAPdenovo2/bin/
(Luo, et al., 2012)
□
SPAdes 3.1.1
http://bioinf.spbau.ru/spades
(Bankevich, et al., 2012)
□
Tablet 1.14.04.10
http://ics.hutton.ac.uk/tablet/download-tablet/
(Milne, et al., 2013)
□
VelvetOptimiser 2.2.5
http://bioinformatics.net.au/software.velvetoptimiser.shtml
(Zerbino and Birney, 2008)
□
Please note that Abyss was made with maxk=128, Velvet was made with MAXKERLENGTH=127.
8
□
Validation of MyPro on three bacterial species
Dataset: E. coli MG1655
Paired reads of E. coli MG1655 are available at Illumina website. Mate1 and Mate 2
were downloaded separately.
Mate1:
ftp://webdata:[email protected]/Data/SequencingRuns/MG1655/MiSeq
_Ecoli_MG1655_110721_PF_R1.fastq.gz
Mate2:
ftp://webdata:[email protected]/Data/SequencingRuns/MG1655/MiSeq
_Ecoli_MG1655_110721_PF_R2.fastq.gz
Preprocess.py
-read1
MiSeq_Ecoli_MG1655_110721_PF_R1.fastq
MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -g 4650000
-read2
This process took about 40 min.
AutoRun.py MG1655 -read1 50X_R1.fastq -read2 50X_R2.fastq
This process took about 4 hr for running in a VirtualBox with 16GB RAM @ Dell
Precisions Workstations T1600 Computer Workstation (Quad Core Xeon E3-1245,
3.30 GHz with 32GB RAM)
n50:
soap.ctg.fa: 15329
edena.ctg.fa: 21492
abyss.ctg.fa: 21569
velvet.ctg.fa: 27298
spades.ctg.fa: 87014
Ctgs:
soap.ctg.fa: 526
edena.ctg.fa: 375
abyss.ctg.fa: 364
velvet.ctg.fa: 326
spades.ctg.fa: 121
9
The longest ctg's length:
soap.ctg.fa: 58371
edena.ctg.fa: 68181
velvet.ctg.fa: 85370
abyss.ctg.fa: 85610
spades.ctg.fa: 224038
Alignment %:
raw.soap.ctg.fa: 98.48
raw.velvet.ctg.fa: 98.50
raw.abyss.ctg.fa: 98.67
raw.spades.ctg.fa: 98.81
raw.edena.ctg.fa: 98.96
cisa.ctg.fa Alignment:99.34%
whole:4625288
N50: 88512
Number of contigs: 104
Length of the longest contig: 315566
Use r2cat to align cisa.ctg.fa against a reference (Ecoli DH10B, NC_010473), then
export the ordered assembly.
Click on r2cat.far located on Desktop
File --> Match new
Query: cisa.ctg.fa
Target: NC_010473.fna
Start Matching
Continue
Options --> Sort queries
File --> Export contigs order (FASTA)
Postassemble.py -o cisa.ctg.ordered.fa -read1 ../ 50X_R1.fastq
-read2 ../50X_R2.fastq
Alignment:99.32%
N50.py Bridged.ctg.fa
whole:4608751
N50: 105185
10
Number of contigs: 71
Length of the longest contig: 335840
Dot plot against the reference genome (NC_000913) by r2cat:
Quast 2.3 Evaluation:
quast.py -o quast -R NC_000913.fna -G NC_000913.gff raw.abyss.ctg.fa
raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa
Bridged.ctg.fa
NC_000913.fna and NC_000913.gff can be downloaded from here.
To copy Bridged.ctg.fa to the folder of Assemble
cp Bridged.ctg.fa Assemble/
11
Annotate.py MG1655 -i Bridged.ctg.fa -p ' --genus Escherichia --species coli --strain
MG1655 --prefix post'
organism: Escherichia coli MG1655
contigs: 71
bases: 4608751
rRNA: 8
tmRNA: 1
tRNA: 74
CDS: 4309
repeat_region: 2
Detailed log: post.log
Walltime used: 8.00 minutes
Because we have updated the databases for Prokka, the number of hypothetical
proteins was reduced from 668 to 657 (Swiss-Prot update) and to 414 (High quality
genus database). Meanwhile, the running time was minor increased from 13.13 min to
13.73 min (Swiss-Prot update) but greatly decreased to 8 min (using genus db). If
RefSeq bacterial genomes were used as genus database, the number of hypothetical
proteins was reduced to 369 but the running time increased to more than 20 min.
=============================================================
12
AutoRun.py MG1655_raw -read1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq
-read2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq
The size of raw reads is about 4GB. This process took around12 hr with a linux server!
IBM X3850 Intel Xeon E7-4820, 2.00GHz with 256 GB of RAM.
n50:
soap.ctg.fa: 35936
edena.ctg.fa: 62035
abyss.ctg.fa: 129004
spades.ctg.fa: 139882
velvet.ctg.fa: 148743
Ctgs:
soap.ctg.fa: 257
edena.ctg.fa: 156
spades.ctg.fa: 90
velvet.ctg.fa: 89
abyss.ctg.fa: 85
The longest ctg's length:
soap.ctg.fa: 124043
edena.ctg.fa: 149834
velvet.ctg.fa: 265585
spades.ctg.fa: 285684
abyss.ctg.fa: 285832
Alignment %:
raw.soap.ctg.fa Alignment:82.53%
raw.velvet.ctg.fa Alignment:83.45%
raw.spades.ctg.fa Alignment:83.60%
raw.abyss.ctg.fa Alignment:83.83%
raw.edena.ctg.fa Alignment:83.95%
cisa.ctg.fa Alignment:83.93%
whole:4640526
N50: 150749
Number of contigs: 59
Length of the longest contig: 315844
Use r2cat to align cisa.ctg.fa against NC_010473, then export the ordered assembly.
13
Postassemble.py -o cisa.ordered.fa
-read1 ../MiSeq_Ecoli_MG1655_110721_PF_R1.fastq
-read2 ../MiSeq_Ecoli_MG1655_110721_PF_R2.fastq
N50.py Bridged.ctg.fa
whole:4619352
N50: 266212
Number of contigs: 35
Length of the longest contig: 624650
Dot plot against the reference genome (NC_000913) by r2cat:
This is the only case presented here by testing MyPro on a linux server, other
examples were run in a Virtual Box (16GB RAM) on a windows computer.
14
Another dataset of E. coli MG1655 was downloaed from:
http://systems.illumina.com/systems/miseq/performance_specifications.html
E. coli MG1655 prepared with TruSeq Nano library prep kit (New dataset - 2 x 300
bp). E. coli strain MG1655 was sequenced on the MiSeq System using the new MiSeq
Reagent Kit v3 and a read length configuration of 2 x 300 bp. The sequence data
analysis showed 83% of bases above Q30.
Preprocess.py -read1 Ecoli_S1_L001_R1_001.fastq -read2
Ecoli_S1_L001_R2_001.fastq -g 4650000
It took about 4 hr for the pre-process!
AutoRun.py MG1655_new -read1 50X_R1.fastq -read2 50X_R2.fastq
2014-10-01,02:14 ==> 2014-10-01,06:09
n50:
soap.ctg.fa: 25999
velvet.ctg.fa: 26577
edena.ctg.fa: 41843
abyss.ctg.fa: 47782
spades.ctg.fa: 117761
Ctgs:
soap.ctg.fa: 315
velvet.ctg.fa: 306
abyss.ctg.fa: 186
edena.ctg.fa: 177
spades.ctg.fa: 87
The longest ctg's length:
velvet.ctg.fa: 86828
soap.ctg.fa: 91293
edena.ctg.fa: 130338
abyss.ctg.fa: 133833
spades.ctg.fa: 274095
Alignment %:
raw.velvet.ctg.fa: 95.08
raw.soap.ctg.fa: 95.71
raw.edena.ctg.fa: 98.56
raw.spades.ctg.fa: 98.80
raw.abyss.ctg.fa: 99.05
15
cisa.ctg.fa Alignment:99.44%
whole:4613259
N50: 119293
Number of contigs: 74
Length of the longest contig: 275305
Use r2cat to align cisa.ctg.fa against NC_010473, then export the ordered assembly.
Postassemble.py -o cisa.ctg.ordered.fa -read1 ../ 50X_R1.fastq
-read2 ../50X_R2.fastq
N50.py Bridged.ctg.fa
whole:4599468
N50: 177141
Number of contigs: 48
Length of the longest contig: 292878
Dot plot against the reference genome (NC_000913) by r2cat:
16
Quast 2.3 Evaluation
quast.py -o quast -R NC_000913.fna -G NC_000913.gff raw.abyss.ctg.fa
raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa
Bridged.ctg.fa
17
Dataset: S. aureus
Strain: S. aureus COL
Sequencing reads were downloaded from
http://www.ebi.ac.uk/ena/data/view/ERR580967
Preprocess.py -read1 ERR580967_1.fastq -read2 ERR580967_2.fastq -g 2800000
This process took about 12 min! Paired reads were trimmed to 140 bp.
AutoRun.py SA -read1 50X_R1.fastq -read2 50X_R2.fastq
2014-11-12,06:11 ==> 2014-11-12,08:06
n50:
soap.ctg.fa: 51885
edena.ctg.fa: 67066
abyss.ctg.fa: 69203
velvet.ctg.fa: 98413
spades.ctg.fa: 150416
Ctgs:
soap.ctg.fa: 136
edena.ctg.fa: 108
abyss.ctg.fa: 87
velvet.ctg.fa: 57
spades.ctg.fa: 56
The longest ctg's length:
soap.ctg.fa: 135661
edena.ctg.fa: 156507
abyss.ctg.fa: 163038
spades.ctg.fa: 300973
velvet.ctg.fa: 325720
WholeGenome:
spades.ctg.fa: 2779025
soap.ctg.fa: 2779369
18
edena.ctg.fa: 2779839
velvet.ctg.fa: 2787009
abyss.ctg.fa: 2803411
Alignment %:
raw.soap.ctg.fa: 87.03
raw.velvet.ctg.fa: 98.44
raw.spades.ctg.fa: 99.11
raw.edena.ctg.fa: 99.42
raw.abyss.ctg.fa: 99.55
Integrate.py SA -i abyss.ctg.fa,edena.ctg.fa,velvet.ctg.fa,spades.ctg.fa -gs 3200000
-evaluate
cisa.ctg.fa Alignment:99.76%
whole:2885596
N50: 224823
Number of contigs: 34
Length of the longest contig: 452546
Select a reference and download from
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Staphylococcus_aureus_Newman_uid58839/
Use r2cat to align cisa.ctg.fa against NC_009641, then export the ordered assembly
and the unmatched contig.
19
Postassemble.py -o cisa.ordered.fa -u unmatched.fa -read1 ../50X_R1.fastq
-read2 ../50X_R2.fastq
N50.py Bridged.ctg.fa
whole:2870598
N50: 335868
Number of contigs: 26
Length of the longest contig: 452546
Alignment:99.76%
Dot plot against the reference genome (NC_002951) by r2cat:
Quast 2.3 Evaluation
quast.py -o quast -R NC_002951.fna -G NC_002951.gff raw.abyss.ctg.fa
raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa
Bridged.ctg.fa
20
21
Strain: S. aureus MW2
Sequencing reads were downloaded from
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR857/SRR85
7914
Preprocess.py -read1 SRR857914_1.fastq -read2 SRR857914_2.fastq -g 2800000
This process took about 20 min.
AutoRun.py MW2 -read1 50X_R1.fastq -read2 50X_R2.fastq
2014-11-21,02:27 ==> 2014-11-21,04:34
n50:
soap.ctg.fa: 69354
spades.ctg.fa: 147491
edena.ctg.fa: 268019
abyss.ctg.fa: 383680
velvet.ctg.fa: 1413768
Ctgs:
soap.ctg.fa: 77
spades.ctg.fa: 39
abyss.ctg.fa: 18
edena.ctg.fa: 18
velvet.ctg.fa: 8
The longest ctg's length:
soap.ctg.fa: 138695
spades.ctg.fa: 410165
edena.ctg.fa: 539806
abyss.ctg.fa: 940710
velvet.ctg.fa: 1413768
WholeGenome:
soap.ctg.fa: 2783028
spades.ctg.fa: 2805095
velvet.ctg.fa: 2815477
edena.ctg.fa: 2816795
abyss.ctg.fa: 2826349
Alignment %:
raw.soap.ctg.fa: 94.97
raw.abyss.ctg.fa: 95.17
raw.spades.ctg.fa: 99.31
22
raw.velvet.ctg.fa: 99.46
raw.edena.ctg.fa: 99.59
Integrate.py MW2 -i abyss.ctg.fa,edena.ctg.fa
cisa.ctg.fa Alignment:93.86%
whole:2768641
N50: 1419734
Number of contigs: 5
Length of the longest contig: 1419734
To increase genome size for assembly integration:
Integrate.py MW2 -i abyss.ctg.fa,edena.ctg.fa,velvet.ctg.fa,spades.ctg.fa -evaluate
-gs 3000000
cisa.ctg.fa Alignment:99.51%
whole:2830305
N50: 1419734
Number of contigs: 7
Length of the longest contig: 1419734
Postassemble.py -o cisa.ordered.fa -read1 ../../50X_R1.fastq -read2 ../../50X_R2.fastq
-m 423 -s 71
N50.py Bridged.ctg.fa
whole:2830305
N50: 1419734
Number of contigs: 7
23
Length of the longest contig: 1419734
Alignment:99.51%
Identical assembly was obtained (to cisa.ctg.fa) after post-assembly!
Quast 2.3 Evaluation
quast.py -o quast -R NC_003923.fna -G NC_003923.gff raw.abyss.ctg.fa
raw.edena.ctg.fa raw.velvet.ctg.fa raw.soap.ctg.fa raw.spades.ctg.fa cisa.ctg.fa
24
Dataset: R. sphaeroides 2.4.1
http://systems.illumina.com/systems/miseq/scientific_data.html
The B. cereus ATCC 10987 & R. sphaeroides ATCC BAA-808 samples were
sequenced on the MiSeq System using the new MiSeq Reagent Kit v3 at a 2 x 300 bp
read length configuration with dual indexing. The total yield was 17.8 Gb with 70%
of bases at or above Q30. The average fragment length for this data set is 475 bp.
fastqc Rhodo_S2_L001_R2_001.fastq
Preprocess.py -read1 Rhodo_S2_L001_R1_001.fastq -read2
Rhodo_S2_L001_R2_001.fastq -g 4600000
This process took about 4 hr! Paired reads were trimmed to 200 bp.
AutoRun.py Miseq300 -read1 50X_R1.fastq -read2 50X_R2.fastq
2014-11-14,00:23 ==> 2014-11-14,04:19
n50:
soap.ctg.fa: 19291
velvet.ctg.fa: 20714
edena.ctg.fa: 22179
25
abyss.ctg.fa: 22737
spades.ctg.fa: 77772
Ctgs:
soap.ctg.fa: 403
velvet.ctg.fa: 373
abyss.ctg.fa: 345
edena.ctg.fa: 293
spades.ctg.fa: 142
The longest ctg's length:
soap.ctg.fa: 69823
abyss.ctg.fa: 97096
edena.ctg.fa: 97244
velvet.ctg.fa: 97422
spades.ctg.fa: 320329
WholeGenome:
edena.ctg.fa: 4014722
abyss.ctg.fa: 4292653
soap.ctg.fa: 4339440
velvet.ctg.fa: 4369665
spades.ctg.fa: 4429967
Alignment %:
raw.edena.ctg.fa: 95.82
raw.abyss.ctg.fa: 97.90
raw.soap.ctg.fa: 98.19
raw.velvet.ctg.fa: 98.21
raw.spades.ctg.fa: 98.24
Keep all assemblies.
cisa.ctg.fa Alignment:98.26%
whole:4401027
N50: 79366
Number of contigs: 130
Length of the longest contig: 320634
r2cat against the reference R.sphaeroides 2.4.1
(http://sb.nhri.org.tw/CISA/upload/en/2013/8/R_sphaeroides.fna-01021413.gz).
26
Postassemble.py -o ordered.cisa.fa
N50.py Bridged.ctg.fa
whole:4391111
N50: 87718
Number of contigs: 120
Length of the longest contig: 320634
Alignment:98.26%
27
References
Bankevich, A., et al. (2012) SPAdes: a new genome assembly algorithm and its
applications to single-cell sequencing, Journal of computational biology : a journal of
computational molecular cell biology, 19, 455-477.
Field, D., et al. (2006) Open software for biologists: from famine to feast, Nature
biotechnology, 24, 801-803.
Hernandez, D., et al. (2008) De novo bacterial genome sequencing: millions of very
short reads assembled on a desktop computer, Genome Res, 18, 802-809.
Husemann, P. and Stoye, J. (2010) r2cat: synteny plots and comparative assembly,
Bioinformatics, 26, 570-571.
Li, R., et al. (2009) SOAP2: an improved ultrafast tool for short read alignment,
Bioinformatics, 25, 1966-1967.
Lin, S.H. and Liao, Y.C. (2013) CISA: contig integrator for sequence assembly of
bacterial genomes, PLoS One, 8, e60843.
Luo, R., et al. (2012) SOAPdenovo2: an empirically improved memory-efficient
short-read de novo assembler, GigaScience, 1, 18.
Milne, I., et al. (2013) Using Tablet for visual exploration of second-generation
sequencing data, Brief Bioinform, 14, 193-202.
Seemann, T. (2014) Prokka: rapid prokaryotic genome annotation, Bioinformatics.
Simpson, J.T., et al. (2009) ABySS: A parallel assembler for short read sequence data,
Genome Research, 19, 1117-1123.
Tatusova, T., et al. (2014) RefSeq microbial genomes database: new representation
and annotation strategy, Nucleic acids research, 42, D553-559.
Zerbino, D.R. and Birney, E. (2008) Velvet: algorithms for de novo short read
assembly using de Bruijn graphs, Genome Res, 18, 821-829.
28