PWM Genome Scan

Transcription

PWM Genome Scan

Max-Planck-Institut
für molekulare Genetik
Comparing Methods for Identifying
Transcription Factor Target Genes
Alena van Bömmel (R 3.3.73)
Matthew Huska (R 3.3.18)
Max Planck Institute for Molecular Genetics
Software Praktikum, 1.2.2013
Folie 1
Max-Planck-Institut
Transcriptional Regulation
TF not bound =
no gene expression
TF bound =
gene expression
Max-Planck-Institut
Transcriptional Regulation
TF not bound =
no gene expression
TF bound =
gene expression
Problem: There are many genes and many TF's,
how do we identify the targets of a TF?
Max-Planck-Institut
Methods for Identifying TF
Target Genes
PWM Genome Scan
Microarray
ChIP-seq
Max-Planck-Institut
PWM Genome Scan
•Purely computational method
•Input:
o
o
position weight matrix for your TF
genomic region(s) of interest
Score threshold
•Pros:
o
No need to do wet lab experiments
•Cons:
o
Many false positives, not able to take biological conditions into account
Max-Planck-Institut
PWM genome scan
1)
Download the PWMs of your TF
of interest from the database
(they might include >1 motif)
1)
Define the sequences to analyze
(promoter sequences)
1)
Run the PWM genome scan (hitbased method or affinity
prediction method)
1)
Rank the genomic sequences by
the affinity signal
Suggested Reading:
• Roider et al.: Predicting transcription factor
affinities to DNA from a biophysical model.
Bioinformatics (2007).
• Thomas-Chollier et al. Transcription factor
binding predictions using TRAP for the
analysis of ChIP-seq data and regulatory
SNPs. Nature Protocols (2011).
Folie 6
Max-Planck-Institut
PWM-PSCM
Stat3 pscm
Binding motif for the transcription factor:
Stat3
from ChIP-seq experiment in mouse
(Jaspar ID: MA0144.1)
Folie 7
Max-Planck-Institut
TRAP
1) Convert the PSSM(position
2)
3)
4)
5)
specific scoring matrix) to PSEM
(position specific energy matrix)
Scan the sequences of interest
with TRAP
Results in 1 score per
sequence=binding affinity
Doesn’t separate the exact TF
binding sites (easier for ranking)
Sequences must have the same
length!
ANNOTATE=/project/gbrowse/Pipeline/ANNOTATE_v3.02/Release
TRAP trap.molgen.mpg.de/cgi-bin/home.cgi
Folie 8
Max-Planck-Institut
Matrix-scan
1) Use directly the PSSM
2) Finds all TFBS which exceed a predefined threshold (e.g. p-value)
3) More complicated to create ranked lists of genomic sequences (more
hits in the sequence)
4) Exact location of the binding site reported
matrix-scan http://rsat.ulb.ac.be/
Folie 9
Max-Planck-Institut
Finding the target genes
•
target genes will be the top-ranked genes (promoters)
•
which are the top-ranked genes? (top-100,500,1000...?)
•
There’s no exact definition of promoters, usually 2000bp upstream,
500bp downstream of the TSS
Folie 10
Max-Planck-Institut
Microarrays
→ R/Bioconductor (details later)
Max-Planck-Institut
Folie
12 Genetik
für molekulare
Microarrays (2)
•Pros:
o
o
o
There is a lot of microarray data already available (might not have to
generate the data yourself)
Inexpensive and not very difficult to perform
Computational workflow is well established
•Cons:
o
Can not distinguish between indirect regulation and direct regulation
Max-Planck-Institut
ChIP-seq
Map reads to the genome
Call peaks to determine most likely TF binding locations
Max-Planck-Institut
Folie
14 Genetik
für molekulare
ChIP-seq (2)
•Pros:
o
Direct measure of genome-wide protein-DNA interaction(*)
•Cons:
o
o
o
o
o
Don't know whether binding causes changes in gene expression
More complicated experimentally and in terms of computational analysis
Most expensive
Need an antibody against your protein of interest
Biases are not as well understood as with microarrays
Max-Planck-Institut
ChIP-seq analysis
1) Download the reads from
given source (experiments and
controls)
2) Quality control of the reads
and statistics (fastqc)
3) Mapping the reads to the
reference genome
(bwa/Bowtie)
4) Peak calling (MACS)
5) Visualization of the peaks in a
genome browser (genome
browser, IGV)
6) Finding the closest genes to
the
peaks(Bioconductor/ChIPp
eakAnno)
Visualised peaks in a genome browser
Suggested Reading:
• Bailey et alPractical Guidelines for the
Comprehensive Analysis of ChIP-seq Data.
PLoS Comput Biol (2013).
• Thomas-Chollier et al. A complete workflow
for the analysis of full-size ChIP-seq (and
similar) data sets using peak-motifs. Nature
Protocols (2012).
Folie 15
Max-Planck-Institut
Sequencing data
•
•
raw data=reads usually very
large file (few GB)
format fastq (ENCODE) or SRA
(Sequence Read Archive of NCBI)
Analysis
1) Quality control with fastqc
2) Filtering of reads with adapter
sequences
3) Mapping of the reads to the
reference genome (bwa or Bowtie)
Example of fastq data file
Folie 16
Max-Planck-Institut
Quality control with fastqc
•
•
•
•
•
•
•
•
per base quality
sequence quality (avg. > 20)
sequence length
sequence duplication level
(duplication by PCR)
overrepresented
sequences/kmers (adapter
sequences)
produces a html report
manual (read it!)
Example of per base seq quality scores
software at the MPI
FASTQC=/scratch/ngsvin/bin/chip-seq/fastqc/FastQC/fastqc
Folie 17
Max-Planck-Institut
Mapping with bwa
•
•
•
1)
2)
mapping the sequencing reads to a reference genome
manual (read it!)
map the experiments and the controls
reference genome in fasta format (hg19)
create an index of the reference file for faster mapping (only if not
available)
3) align the reads (specify parameters e.g. for # of mismatches, read
trimming, threads used...)
4) generate alignments in the SAM format (different commands for
single-end and pair-end reads!)
software and data at the MPI:
BWA = /scratch/ngsvin/bin/executables/bwa
hg19: /scratch/ngsvin/MappingIndices/hg19.fa
bwa index: /scratch/ngsvin/MappingIndices/BWA/hg19
Folie 18
Max-Planck-Institut
File manipulation with samtools
•
•
1)
2)
3)
utilities that manipulate SAM/BAM files
manual (read it!)
merge the replicates in one file (still separate experiment and control)
convert the SAM file into BAM file (binary version of SAM, smaller)
sort and index the BAM file
now the sequencing files are ready for further analysis
software at the MPI:
SAMTOOLS = /scratch/ngsvin/bin/executables/samtools
Folie 19
Max-Planck-Institut
Peak finding with MACS
•
find the peaks, i.e. the regions with a high density of reads,
where the studied TF was bound
•
manual (read it!)
1) call the peaks using the experiment (treatment) data vs. control
2) set the parameters e.g. fragment length, treatment of duplication reads
3) analyse the MACS results (BED file with peaks/summits)
software at the MPI:
MACS = /scratch/ngsvin/bin/executables/macs
Folie 20
Max-Planck-Institut
Finding the target genes
•
•
•
find the genes which are in the closest distance to the
(significant) peaks
how to define the closest distance? (+- X kb)
use ChIPpeakAnno in Bioconductor or bedtools
Scale
chr10:
69,200,000
78 _
GM12878 c-Myc Sg
100 kb
hg18
69,250,000
69,300,000
69,350,000
UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics)
DNAJC12
SIRT1
DNAJC12
SIRT1
SIRT1
HERC4
HERC4
HERC4
HERC4
HERC4
KIAA1593
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-Myc in GM12878 cells)
HERC4
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-Myc in GM12878 cells)
0_
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-Myc in K562 cells)
78 _
ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-Myc in K562 cells)
K562 c-Myc Sig
0_
RepeatMasker
Repeating Elements by RepeatMasker
Folie 21
Max-Planck-Institut
Methods for Identifying TF
Target Genes
PWM Genome Scan
Microarray
ChIP-seq
Threshold
s
Max-Planck-Institut
Bioinformatics
•
•
•
•
Read mapping
(Bowtie/bwa)
Peak Calling
(MACS/Bioconduct •
or)
•
Peak-Target
Analysis
(Bioconductor)
Microarray data
analysis
(Bioconductor)
Differential Genes
(R)
GSEA
•
•
•
•
PWM Genome
Scan
(TRAP/MatScan)
Statistics (R)
Data Integration
(R/Python/Perl)
Statistical
Analysis (R)
Folie 23
Max-Planck-Institut
Bioinformatics tools
READ THE MANUALS!
•
•
•
•
•
•
Bowtie bowtie-bio.sourceforge.net/manual.shtml
bwa bio-bwa.sourceforge.net/bwa.shtml
MACS github.com/taoliu/MACS/blob/macs_v1/README.rst
TRAP trap.molgen.mpg.de/cgi-bin/home.cgi
matrix-scan http://rsat.ulb.ac.be/
Bioconductor www.bioconductor.org/ (more info in R course)
Databases
•
•
•
•
GEO www.ncbi.nlm.nih.gov/geo/
ENCODE genome.ucsc.edu/ENCODE/
SRA www.ncbi.nlm.nih.gov/sra
JASPAR http://jaspar.genereg.net/
Folie 24
Max-Planck-Institut
Schedule
•
•
•
•
•
•
•
03.03. Introduction lecture, R course
04.03. R & Bioconductor homework submission
11.03. Presentation of the detailed plan of each group
(which TF, cell line, tools, data, data integration, team
work ) 10:30am, 11:30am
every Tuesday 10:30am, 11:30am progress meetings
17.04. Final report deadline
24.04. (tentative) Presentations
28.04. Final meeting, discussion of final reports
Folie 25
Max-Planck-Institut
GR Group
•
Expression and ChIP-seq data: Luca F, Maranville JC,
et al., PLoS ONE, 2013
• PWM database: jaspar.genereg.net
Folie 26
Max-Planck-Institut
c-Myc Group
•
Expression data: Cappellen, Schlange, Bauer et al.,
EMBO reports, 2007
• Musgrove et al., PLoS One, 2008
• ChIP-seq data: ENCODE Project
• PWM database: jaspar.genereg.net
Folie 27
Max-Planck-Institut
Additional analysis
Binding motifs
binding motifs
• are the overrepresented
motifs in the ChIP-peak
regions different?
• do we find any co-factors?
Recommended tool:
RSAT rsat.ulb.ac.be
binding motifs
binding motifs
Folie 28

PWM Genome Scan

Transcription

Similar documents

Capillary Electrophoresis Experts for Analytical Solutions Know what‘s inside. LABORATORY GROUP

Fact Sheet “How to Find an Internship in Germany” General Information:

chungking express - Duke University | Program in Arts of the Moving

josephine meckseper

Connie Palmen

A warm welcome to Erlacher Höhe!

Reference as PDF

Fact Sheet “How to Find an Internship in Germany”