ChimPipe Documentation Release v0.8.0 Bernardo Rodríguez-Martín, Emilio Palumbo and Sarah Djebali

Transcription

ChimPipe Documentation
Release v0.8.0
Bernardo Rodríguez-Martín, Emilio Palumbo and Sarah Djebali
November 03, 2014
Contents
1
Contents:
1.1 ChimPipe for chimera detection
1.2 Installation . . . . . . . . . . .
1.3 Manual . . . . . . . . . . . . .
1.4 FAQ . . . . . . . . . . . . . . .
1.5 Downloads . . . . . . . . . . .
1.6 Contact . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
5
9
12
13
i
ii
ChimPipe Documentation, Release v0.8.0
ChimPipe is a pipeline for the detection of chimeric transcripts from Paired-End RNA-seq data developed at the
Computational Biology of RNA Processing group of the Centre for Genomic Regulation (CRG) in Barcelona, Spain.
ChimPipe’s source code is freely available from our GitHub repository.
Contents
1
2
Contents
CHAPTER 1
Contents:
1.1 ChimPipe for chimera detection
1.1.1 Biological importance of chimeras
Chimeras are transcripts whose sequence is encoded in two or more different genes. The study of these transcripts is
relevant in two different contexts:
• Cancer genomics. It is very well know that the generation of fusion genes through chromosomal rearrangements is a major driver in certain types of cancer. These are hydrid genes formed from two previously separate
genes that encode altered proteins with abnormal activity. Thus, the identification and study of chimeric transcripts in cancer can help detecting such alterations in cancer genomes and eventually lead to new diagnoses
and therapeutic targets.
• Genome biology. Understanding how the information is encoded in the genome is one of the challenges of
current biology. In the last years, an increasing number of chimeric transcripts have been reported in non cancer
samples from different species. Although their function and the mechanism through which they arise are not
well understood, their existence could reveal new dimensions in how the information is encoded in the genome.
One could hypothesize that the combination of sequences from different genes could be a way of increasing the
information content of the genomes. Several transcriptional events have been proposed to explain how this
transcripts could arise:
– Trans-splicing. Splicing between two pre-mRNAs from different genes producing a chimera.
– Readthrough or conjoined transcription. Combined transcription of two consecutive genes on the same
chromosome and strand. Typically, such chimera begin at the promoter of the upstream gene and end at
the termination point of the downstream gene.
– SHS-transcriptional slippage. Pairing of short homologous sequences (SHS) among two different transcripts that leads to a recombination producing a chimeric transcript.
1.1.2 ChimPipe’s principle
Briefly, ChimPipe uses the GEMtools RNA-seq pipeline to contiguously map the read pairs to the genome and the
annotated transcriptome, and segmentally maps them into the genome to find de novo splice junctions in the same
chromosome and strand. Then, in a second mapping step, it further segmentally maps the reads that could not be
mapped this way to find de novo splice junctions allowing connections between different chromosomes and strands
through the GEM RNA mapper. ChimPipe further aggregates the two blocks of segmentally mapped reads whose two
parts map in two different genes into chimeric junctions, and applies several filters such as number of spanning reads
and number of read pairs encompassing the chimeric junction.
3
The ChimPipe’s paper is currently being written.
1.1.3 ChimPipe’s features
• Can handle both unstranded and stranded data
• Provides the precise chimeric junction coordinates
• Outputs many pieces of information for each chimeric junction
• Has a good balance between sensitivity and specificity (according to our benchmark)
1.2 Installation
1.2.1 System’s requirements
Hardware:
• 64-bits CPU
• RAM: ~40G for 100 million human PE reads of 52 bases.
Software:
• 64-bit Linux System
• Bedtools v2.20.1 or higher
• Samtools v0.1.19 or higher
• Blast v2.2.29+ or higher (only if you want to generate your own similarity text files, see Reference between
gene pairs in the Manual section)
1.2.2 Downloading ChimPipe
You can do it from the ChimPipe GitHub repository in two different ways:
1. Press the Download ZIP button which is located at the bottom right of the page to download the most recent
version of the code as a zip archive.
2. Clone the Git repository in case you want the code under the Git version control system (currently, it is not
allowed to contribute to the project, but it will in the future):
#
$
#
$
Move into the folder in which you want to clone the repositoy.
cd /users/rg/brodriguez/bin
Clone it.
git clone https://github.com/Chimera-tools/ChimPipe.git
1.2.3 Setting up ChimPipe’s path
Finally, go to your newly created ChimPipe’s directory and type make to set up the path to ChimPipe in your system.
#
$
#
$
4
Move into ChimPipe.
cd /users/rg/brodriguez/bin/ChimPipe
Type make to set up the path to ChimPipe in your system
make
Chapter 1. Contents:
1.3 Manual
This section describes ChimPipe’ input files, how to generate them, how to run ChimPipe, and how to interpret its
outputs.
1.3.1 Input files
ChimPipe takes as input 4 mandatory files and 1 optional file.
Mandatory:
• Paired-end (PE) RNA-seq reads
• Genome index
• Genome annotation
• Transcriptome index
Optional:
• Gene pair similarity file
Paired-end (PE) RNA-seq reads
ChimPipe has been designed to deal with Illumina paired-end RNA sequencing data. It takes as input two FASTQ
files, one for the first mates and one for the second mates of the read pairs. These files need to be located in the same
directory and named according to this convention:
• Mate 1. “SampleId + [.1|_1] + [.fastq|.txt] (+ .gz if compressed)”
• Mate 2. “SampleId + [.2|_2] + [.fastq|.txt] (+ .gz if compressed)”
5.
(a) BERGER_1.fastq.gz and BERGER_2.fastq.gz would be the compressed FASTQ files for mate 1 and mate
2 in the BERGER sample.
Warning: The read identifier should follow Illumina convention to specify which member of the pair the read
is. See our FAQ section for more information.
Warning: Make sure the 2 FASTQ files are in the same directory and you use the same convention for
both mates. E. g: if mate 1 is BERGER_1.fastq.gz, mate 2 can not be BERGER.2.txt.gz but has to be
BERGER_2.fastq.gz
Genome index
An indexed reference genome in GEM format has to be provided to do the mapping steps.
We provide some pre-generated genome indices for human, mouse and drosophila in the Downloads
section, however if you have a different genome, you just need to run the GEMtools indexer (at
ChimPipe/bin/gemtools-1.7.1-i3/gemtools) with your genome in FASTA format to produce it:
$ gemtools index -i genome.fa
It will produce 3 files in the directory where the genome fasta file is located:
• genome.gem – indexed genome file in GEM format (needed for running ChimPipe).
1.3. Manual
5
• genome.hash – hash table with the genome (not needed).
• genome.log – indexer log file (not needed).
Tip: If your machine has more than one CPU, it is recommended to run the indexer with multiple threads. Use the
option -t <threads>, where threads is the number of available CPUs.
Genome annotation
ChimPipe also takes as input a genome annotation file in GTF format (including annotated exons). It can contain
other features different from exons, i. e. introns or UTR, but they will not be considered by ChimPipe in the chimera
detection process. This annotation needs to contain at least two (tag,value) pairs in the attribute field with the gene_id
and the transcript_id tags. Two optional (tag,value) pairs will be taken into account by ChimPipe if they are provided:
gene_name and gene_type. e.g:
# This is an example of an annotated exon with an appropiate format.
# The attributes are the gene_id, transcript_id (mandatory), the gene type and gene name (optional),
# plus some additional (tag,value) pairs that will not be considered by ChimPipe.
chr1
HAVANA exon
69091
70008
.
+
.
gene_id "ENSG00000186092.4"; transcri
transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number
tag
"CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT0
Note: ChimPipe has been benchmarked with Gencode v10 and UCSC Known Genes human gene annotations. It
displayed a better sensitivity with Gencode v10 but a similar false positive rate with both annotations. Thus, it is
advisable to use Gencode annotation, since it is a richer annotation which increases the sensitivity of the chimera
detection process.
Transcriptome index
An transcriptome index in GEM format has to be provided as input to ChimPipe in the same directory as the genome
annotation GTF file, in order to find reads spanning annotated splice junctions.
We provide some pre-generated transcriptome indices for human, mouse and drosophila annotations in the Downloads section, however if your genome annotation or your genome is different, you will need to to run the GEMtools
transcriptome indexer ((at ChimPipe/bin/gemtools-1.7.1-i3/gemtools)) on your previously generated
GEM indexed genome and your annotation in GTF format, as indicated below.
$ gemtools t-index -i genome.gem -a annotation.gtf
It will produce 5 files in your current working directory:
• annotation.gtf.junctions – annotated splice junctions coordinates (not needed)
• annotation.gtf.junctions.fa – annotated splice junctions sequence (not needed)
• annotation.gtf.junctions.gem – transcriptome index in GEM format (needed)
• annotation.gtf.junctions.keys – keys to convert from transcriptome to genome (needed)
• annotation.gtf.junctions.log – indexer log file (not needed)
Tip: If your machine has more than one CPU it is recommended to run the indexer with multiple threads. Use the
option -t <threads>, where threads is the number of available CPUs.
6
Warning: The transcriptome index has to be placed in the same folder as the genome annotation to be used by
ChimPipe.
Gene pair similarity file (Optional)
One of ChimPipe’s filtering steps to discard actefactual chimeras is to filter out those chimeric junctions connecting
genes that encode transcripts with a high sequence homology. To do this ChimPipe produces a matrix with information
about the sequence similarity between annotated gene pairs. This step is done internally and takes around 1:30h. So,
in case you want to run many samples with the same annotation, it is recommended to provide the matrix with the
option --similarity-gene-pairs <MATRIX TEXT FILE>, to avoid recomputing it for every sample.
You can download our pre-generated matrices por human,
mouse and drosophila annotations from Downloads section or you can produce your own matrix with the script
ChimPipe/src/bash/similarity_bt_gnpairs.sh as follows:
$ bash similarity_bt_gnpairs.sh annot.gtf genome.gem
Please check out our FAQ section for more information about how the script works.
Warning: Make sure you run ChimPipe with a similarity matrix generated from the annotation and genome you
are using.
1.3.2 Executing ChimPipe
1. Setting up the environment
As explained in the Installation section, you need to have BEDtools, SAMtools and Blast installed on your system to
execute ChimPipe. In case you do not have them, you can download and install them from their web pages. Once they
are installed, you have to export the path to their binaries.
Please check out our FAQ section in case you have any problem.
2. Running ChimPipe
Once you have generated the genome and the transcriptome indices, you know the quality offset and the library type
of your PE RNA-seq reads, you can run ChimPipe as follows:
bash ChimPipe.sh -i reads_1.fastq -g genome.gem -a annotation.gtf
All these files and parameters given as input to ChimPipe are mandatory arguments. Please see bellow their descripion:
-i|--input reads_1.fastq - First mate sequencing reads. ChimPipe deals with paired-end data.
Please make sure the second mate file is in the same directory as
the first one, and the files are named according to the same convention.
E.g: the second mate of "reads_1.fastq" should be "reads_2.fastq".
-g|--genome-index genome.gem - Index for the reference genome in GEM format.
-a|--annotation annotation.gtf - Reference genome annotation file in GTF format. The transcriptome
index has to be in the same directory as the annotation.
1.3. Manual
7
Optional arguments. Please do ChimPipe.sh -h or --help to see a short help with the main options. You
can also do ChimPipe.sh --full-help to see the all the possible options.
Tip: If your machine has more than one CPU it is recommended to run ChimPipe with multiple threads (at least 4). It
will speed up the mapping steps a lot. Use the option -t|--threads <threads>, where threads is the number
of CPUs available.
Note: The pipeline is restartable: this means that if ChimPipe fails and you run it again, it will skip the steps that
have been already completed. You just need to make sure that you removed the files generated in the step where the
pipeline failed.
1.3.3 Output
By default, ChimPipe produces 3 output files:
• First mapping BAM file
• Second mapping MAP file
• Chimeric junctions file
Tip: If you want to keep intermediate output files, run ChimPipe with the --no-cleanup option.
First mapping BAM file
BAM file containing the reads mapped in the genome, transcriptome and de novo transcriptome with the GEMtools
RNA-seq pipeline.
This is the standard format for next-generation sequencing, meaning that most analysis tools work with this format.
The bam file produced can therefore be used to do other downstream analyses such as gene and transcript quantification
or differential gene expression analysis.
Second mapping MAP file
MAP file containing reads segmentally mapped in the genome allowing for interchromosomal, different strand and
unexpected genomic order mappings.
Chimeric junction file
Tabular text file containing the detected chimeric junctions in your RNA-seq dataset. It has rows of 19 fields, where
each row corresponds to a chimeric junction and the fields contains information about it. Here is a brief description of
the 19 fields:
1. juncId - Chimeric junction identifier. It is an string encoding the position of the chimeric junction in the genome as follows: chrA”_”breakpointA”_”strandA”:”chrB”_”breakpointB”_”strandB. E. g.,
“chr4_90653092_+:chr17_22023757_+” is a chimeric junction between the position 90653092 of the chromosome 4 in the plus strand, and the position 22023757 of the chromosome chr17 in the plus strand.
2. nbstag - Number of staggered reads supporting the chimera.
3. nbtotal - Total number of reads supporting the chimera.
4. maxbeg - Maximum beginning of the chimeric junction, The starting position at which
8
5. maxEnd - Maximum end of the junction
6. samechr - Flag to specify if the connected gene pairs are in the same cromosome (1) or not (0).
7. samestr - Flag to specify if the connected gene pairs are in the same strand (1) or not (0), NA in case the samechr
field was 0.
8. dist - Distance between the two breakpoints, NA in case the “samestr” field was 0.
9. ss1 - Splice donor site sequence.
10. ss2 - Splice acceptor site sequence.
11. gnlist1 - List of genes overlapping the first part of the chimera.
12. gnlist2 - List of genes overlapping the second part of the chimera.
13. gnname1 - Name of the genes in the field gnlist1, ”.” if unknown.
14. gnname2 - Name of the genes in the field gnlist1, ”.” if unknown.
15. bt1 - Biotype of the genes in the field gnlist1, ”.” if unknown.
16. bt2 - Biotype of the genes in the field gnlist2, ”.” if unknown.
17. PEsupport - Total number of read pairs supporting the chimera, ”.” if not Paired-end support. It is a string
containing information about the number of read pairs supporting the connection between the involved gene
pairs as follows: geneA1-GeneA2:nbReadPairs,geneB1-geneB2:nbReadPairs. E.g.: “1-1:1,3-1:2” means that
the connection between the genes 1, in the gnlist1 and gnlist2 respectively, is supported by 1 read pair; and the
connection between the gene 3 in the gnlist1 and the gene 1 in the gnlist2 is supported by 2 read pairs.
18. maxSim - Maximum percent of similarity in the BLAST alignment between the transcript with the longest
BLAST alignment, ”.” if no blast hit found.
19. maxLgal - Maximum length of the BLAST alignment between all the transcripts of the gene pairs connected
by the chimeric junction, ”.” if no blast hit found.
Example
chr1_121115975_+:chr1_206566046_+ 1 1 121115953 206566073 1 1 85450071 GC AG SRGAP2D, SRGAP2, SRGAP2D, SRGAP2 . . 1-1:2, 99.44 1067
1.4 FAQ
1.4.1 Does ChimPipe considers the reads quality for the mapping step?
The quality score (Q) measures the probability that a base is called incorrectly by the sequencing machine. Within
your FASTQ files, they are represented in the fourth line of each read as an ASCII character string (each character
corresponds to the Q score of a certain base in the sequencing read). The correspondence between each ASCII
character and the Q score is based on some offset. This offset varies depending on the sequencing platform (Illumina
machines from CASAVA v1.8 uses 33, while older ones use 64).
Yes, ChimPipe deals with both 33 and 64 quality encodings. It will automatically detect it from your reads, so you
do not need to specify it as a parameter.
1.4.2 Which sequencing library protocols are supported?
Different protocols can be used to generate a RNA-seq library. Currently, ChimPipe can handle data generated with
the following protocols:
1.4. FAQ
9
• Non strand-specific protocols. (unstranded data). The information about from which strand the transcript is
transcribed is not available.
• Strand-specific protocols (stranded data):
– MATE1_SENSE. Mates 1 are sequenced from the transcript sequence (they will map on the same strand
as the transcript), and mates 2 are sequenced from the reverse complement of the transcript sequence (they
will map on the strand that is the opposite of the transcript strand).
– MATE2_SENSE. Mates 1 are sequenced from the reverse complement of the transcript sequence (they
will map on the strand that is the opposite of the transcript strand), and mates 2 are sequenced from the
transcript sequence (they will map on the same strand as the transcript).
ChimPipe is able to infer the protocol used to produce your data. To do it, it takes a subset of 1M of mapped reads and
compares the mapping strand with the strand of the annotated gene where they map. OptionaLly, you can supply this
information and skip this step with the option -l|--seq-library <library>.
1.4.3 Read identifier format
The FASTQ file uses four lines per sequencing read. You need to check the format of the first line of each read, which
begins with the ‘@’ character and is followed by a read identifier. This identifier should meet one of the two Illumina
standards to specify which member of the pair the read is:
• CASAVA lower than v1.8. The identifier has to be a single string ended in /1 or /2 for mate 1 and mate 2,
respectively. E. g.:
$ # Mate 1
@SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/1
GTAACATATTCACAGACATGTCAGTGTGCACTCAGGAACACTGGTTTCATT
+SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/1
IIIIIIIIIIIIIII>=CII=8=H032/-++D+’.@)2+4/+)1’4.#"*.
$ # Mate 2
@SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/2
CAGATGGCATTGCAGAACTAAAAAAGCAAGCTGAATCAGTGTTAGATCTCC
+SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868#0/2
IIIGIIIIIIIIIFI:>DID<AFH=>78@I41I055549.?+.42-’%**’
• CASAVA v1.8 or higher. The identifier consists on two strings separated by a space with the second one
specifying the mate in the first digit. E. g:
# Mate 1
@seq.1_set1:140:D034KACXX:1:2105:18345:62400 1:N:0:
CCCAGCCTGTTTCCCTGCCTGGGAAACTAGAAAGAGGGCTTCTTTCTCCT
+
IJJJIIIIIGIGIEBHHHGFFFF=CDEEEEEDDDDCCCDDA?55><CBBB
# Mate 2
@seq.1_set1:140:D034KACXX:1:2105:18345:62400 2:N:0:
GCACCCTTCACTCCCTCCCTTGGGCGCCTCCCTCCCGAGGGTAGGGACCC
+
FFHIJJCHIIHBHIIIAHFFFFFCDEDEECDBB;??@CD?CCCCCCC@CC
I case your FASTQ files do not meet this format you should modify the identifier. Awk is a perfect tool for such kind
of problems. E. g: I downloaded a dataset from the NCBI Sequence Read archive:
# Mate 1 read without a proper identifier. It has three strings as identifier and does not end with "
@SRR018259.1 BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868 length=51
10
+SRR018259.1 BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868 length=51
# No worries, I can use awk to fix it.
$ awk ’{if (NR%2==1){print $1"_"$2"_"$3"#0/1"} else {print $0}} dataset_1.fastq
@SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868_length=51#0/1
+SRR018259.1_BI:080831_SL-XAN_0004_30BV1AAXX:5:1:708:1868_length=51#0/1
$ # Finally, I apply the same procedure for the mate 2..
1.4.4 How can I export the path to the dependencies?
To export the path of bedtools, samtools and blast (if needed) binaries you just need to type:
$ export PATH=<BEDTOOLS_BINARIES_PATH>:<SAMTOOLS_BINARIES_PATH><BLAST_BINARIES_PATH>:$PATH
$ # E.g. export bedtools and samtools on my system
$ export PATH=/users/rg/brodriguez/bin/bedtools2-2.20.1/bin:/users/rg/brodriguez/bin/samtools-0.1.19:
1.4.5 How does the script to compute gene pair similarity work?
This script will produce a matrix containing gene pair similarity information through 4 steps:
1. Extract the cDNA sequence of each transcript in the annotation.
2. Make a BLAST database out of the transcript sequences.
3. Run BLAST on all trancript against all transcripts to detect local similarity between transcripts.
4. Produce a 8 fields matrix where each row corresponds to a gene pair and it contains information about the
alignment between the pair of transcripts of this two genes with the maximum alignment similarity and length.
Here is a brief description of the 8 fields:
(a) Gene id A
(b) Gene id B
(c) Transcripts alignment similarity
(d) Transcript alignment length
(e) Transcript name A
(f) Transcript name B
(g) Trancript A exonic length
(h) Transcript B exonic length
Note that is expect BLAST binaries to be in your PATH.
Example
ENSG00000000003.10 ENSG00000003402.15 91.43 70 ENST00000373020.4 ENST00000309955.3 2206 14672
1.4. FAQ
11
1.5 Downloads
1.5.1 GEM genome indices
Homo sapiens
• GRCh37/hg19 (February 2009)
Mus musculus
• NCBI37/mm9 (July 2007)
Drosophila melanogaster
• FlyBase genome v5.46 (July 2012)
1.5.2 GEM transcriptome indices
H. sapiens
• UCSC known genes (July 2007, hg19)
• Gencode v19 (July 2013, hg19) long transcripts.
• Gencode v10 (July 2011, hg19) long transcripts.
Note: The long transcript group excludes ribosomal, micro, transfer, small cytoplasmic, small nucleolar and small
nuclear RNAs. They were removed since they are not relevant for chimera detection and they could be a source of
artefactual chimeras due to sequence similarity between them, specially in ribosomal RNAs.
M. musculus
• Ensembl v65 (December 2011, mm9) long transcripts.
D. melanogaster
• FlyBase annotation v5.46 (July 2012, v5.46)
1.5.3 Similarity between gene pair matrices
H. sapiens
• UCSC known genes similarity matrix February 2009 (GRCh37/hg19).
• Gencode v19 similarity matrix for long transcripts (GRCh37/hg19).
• Gencode v10 similarity matrix for long transcripts (GRCh37/hg19).
12
M. musculus
• Ensembl v65 similarity matrix for long transcripts (GRCh37/mm9).
D. melanogaster
• ‘FlyBase v5.46 similarity matrix‘_
1.6 Contact
Please feel free to join the ChimPipe’s mailing list in case you have a question, you want to report an issue or request
a feature.
You can also directly contact us at: [email protected]
1.6. Contact
13

ChimPipe Documentation Release v0.8.0 Bernardo Rodríguez-Martín, Emilio Palumbo and Sarah Djebali

Transcription

Similar documents

Exploring the ARUCC PCCAT Transcript and Transfer Guide

How to Request A Transcript for the NCAA Clearinghouse

Array-to-Go

View the full schedule - Genome Engineering 4.0 Workshop

Dragonfly genome project

How Journalists Explore (and Sometimes Get Lost In)

SOUTHWESTERN HIGH SCHOOL GUIDANCE DEPARTMENT

Debjit Ray, 5.2

What has changed - Center for Genetics and Society

El Proyecto Genoma Humano