Sample exam questions from previous years 1. Next-gen sequencing

Transcription

Sample exam questions from previous years 1. Next-gen sequencing
Sample exam questions from previous years
1. Next-gen sequencing
(3pts) To confidently call a base when sequencing all of the exons from an individual, one
must get at least 8x coverage. At 13x coverage what fraction of bases are sequenced at
least 8 times. Assume the size of the exome is 45 million bases.
(3pts) At heterozygous positions, one allele is maternal and the other is paternal. If a
position is sequenced 8 times, what is the probability that all 8 reads come from the mother?
What would this mean?
(2pts) What are the advantages, if any, of using paired end reads to perform genome
sequencing? Exome sequencing?
(2pt) Your friend, Seeann Vee, wants to map structural variation in cancer. She decides to
use microarrays instead of sequencing because they are not as costly. What classes of
structural variants would she miss with this approach?
(3pts) A company claims to have a developed a sequencing technology with perfect
accuracy and 10,000 base pair reads generated from sheared genomic DNA. Because of
this, they claim that they can sequence a human genome at 1x average coverage and get
good results (e.g. high consensus accuracy and good coverage of the genome), instead of
at the normal 30-40x. Assuming the accuracy is indeed perfect, provide two reasons why
they will need more than 1x coverage.
(3pts) You need to sequence a library of S.cerevisiae strains that are tagged by a molecular
barcode. Assume that 1) there are 6000 strains in the library , 2) each strain is uniformly
represented (this is the control population), and 3) Each read gives you the sequence of 1
barcode. How many reads do you need to ensure a 95% probability that each barcode is
sequenced at least once?
(2pts) Oxford nanopore just announced that they have a sequencing technology that has a
4% error rate and a 100,000 bp read length and works on single DNA molecules. How
might these long reads impact the sequencing of human genomes (name at least one way)?
(2pts) The Oxford technology is a single molecule technology. How would you expect the
accuracy of the method to change with the position in the read? How does it change in
Illumina technology?
2. Synthetic Biology
A) (2 points) Which of the following statements about synthetic genetic circuits is true?
A.
B.
C.
D.
Synthetic circuits rival natural circuits in their complexity.
Genetics elements exhibit excellent modularity in new contexts.
DNA synthesis capabilities are sufficient for synthetic circuits.
Tuning of promoter strengths and mRNA stability is rarely needed for circuit elements.
B) (3 points) Which of the following network features were not optimized to generate an
artemisin biosynthetic pathway in microbes?
A.
B.
C.
D.
Ribosome binding site sequences.
Enzyme active site configurations.
Protein stoichiometry in enzymatic complexes.
Codon compatibility with host organism.
C) (3 points) You are planning to initiate a synthetic biology project in which you will create
variants of an antibiotic synthetic pathway in order to produce a diverse library of modified
antibiotic molecules. Suggest a host organism to use and give a justification for your selection.
D) (2 points) Name one technique for generating a diverse library of genes when you have
access to a family of related genes, and a technique for when you only have one gene.
2. You are working on a bioremediation project and are attempting to engineer an enzyme to
degrade the toxic chemical leemealone. You discover in the literature a bacterial protein
(leftalonase) capable of converting leftalone, a molecule related to leemealone, to harmless
products. No other enzymes have yet been reported to carry out similar chemical reactions.
You decide that you will use the directed evolution of bacteria to generate the enzyme you
desire, and you clone the leftalonase gene into a bacterial expression vector.
A. (1 pts) Describe a simple petri-dish based selection for an active leemalonase.
B. (1 pts) Describe one way to generate variation in the leftalone gene as input for your
selection.
C. (1 pts) How might you go about finding starting material for directed evolution by DNA
shuffling without leaving the comforting glow of your computer screen?
D. (2 pts) Assume you have no access to computers. How might you obtain a diverse supply of
promising genetic material with only the knowledge of the DNA sequence of leftalonase, a
coupon for two free oligonucleotides, and a sample of soil? Where would you obtain the soil
most likely to be of use to you?
3. Homology
A) (4 pts) Your colleague professor Arthur Loggi has found a mysterious peptide sequence from
a Na’vi sample. The Na’vi genome is only 50% complete. He asks you to look for an ortholog of
this peptide in the Human genome because you have a program that can do so by finding
reciprocal best Blast hits. Please tell professor Loggi what are the key differences between
orthologs and paralogs, and why your program may or may not work in this case. Make sure
professor Loggi understands the caveats of using best Blast hits to find orthologs, and suggest
some alternatives to using best Blast hits.
B) You run your program for Professor Loggi using the peptide sequence:
RVVNLVPSFWVLDATYKNYAINYNCDVTYKLY
The top three alignments you get by Blasting the sequence to human protein database are:
Yours: R V V N L V P S - - F W V L D A T Y K N Y A I N Y N C D V T Y K
L Y
SEQ 1: Q F F P L M P P A P Y W I L A T D Y E N L P L V Y S C T T F F W
L F
Yours: R V V N L V P S - - F W V L D A T Y K N Y A I N Y N C D V T Y K
L Y
SEQ 2: Q F F P L M P P A P Y W I L D A T Y K N Y A L V Y S C T T F F W
L F
Yours: R V V N L V P S - - F W V L D A T Y K N Y A I N Y N C D V T Y K
L Y
SEQ 3: R V V P L M P S A P Y W I L D A T Y K N Y A L V Y S C D V T Y K
L F
(4 pts) Calculate the score of each of these three alignments using the default parameters for
Blast, which use the Blossum62 substitution matrix (shown below), a gap opening penalty of
–11, and a gap extension penalty of -1. For each score you calculate also compute the P-value
of getting a score greater or equal to these scores, assuming EVD (extreme value distribution):
! (!!!)
𝑃 𝑆 ≥ 𝑥 = 1 − 𝑒[!!
]; use λ 0.693, and µ 50. Finally, give professor Loggi a
recommendation.
A
R
N
D
C
Q
E
G
H
I
L
K
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
M
F
P
S
T
W
Y
V
-1
-2
-1
1
0
-3
-2
0
-1
-3
-2
-1
-1
-3
-2
-3
-2
-3
-2
1
0
-4
-2
-3
-3
-3
-1
0
-1
-4
-3
-3
-1
-2
-3
-1
-1
-2
-2
-1
0
-3
-1
0
-1
-2
-1
-2
-2
-3
-1
0
-1
-3
-2
-2
-3
-3
-2
0
-2
-2
-3
-3
-2
-1
-2
-1
-2
-2
2
-3
1
0
-3
-2
-1
-3
-1
3
2
0
-3
-2
-1
-2
-1
1
-1
-3
-1
0
-1
-3
-2
-2
5
0
-2
-1
-1
-1
-1
1
0
6
-4
-2
-2
1
3
-1
-2
-4
7
-1
-1
-4
-3
-2
-1
-2
-1
4
1
-3
-2
-2
-1
-2
-1
1
5
-2
-2
0
-1
1
-4
-3
-2
11
2
-3
-1
3
-3
-2
-2
2
7
-1
1
-1
-2
-2
0
-3
-1
4
C) (2 pts) In primates, the amino acid Histidine is 8% of the total amino acid content of their
proteomes. In Na’vi-related species Histidine is 2% of the total proteome. Assume that the
probability of observing two Histidines aligned to each other in alignments of orthologs from
within both groups of species is the same. If you were to construct two BLOSUM Matricies, one
from primates and one from Na’vi related species, in which group would the score for aligning
Histidine to Histidine be higher? Why?
4. Epigenetics
A) (2pts) Your colleague professor Eugene Mathew Lateed generated a genome-wide DNA
methylation map for normal colon cells using MRE-seq and MeDIP-seq. In an intergenic region,
he found an interesting locus. This locus is about 20kb. On one end of the locus, there is a 2kb
CpG rich stretch that has both intermediate MRE-seq and MeDIP-seq signals. The rest 18kb
has high level of MeDIP-seq signals. Based on what you learnt in class, why might you suspect
that this region encodes for a novel gene?
(2 pts) You decide to look at histone modification patterns across this region for more evidence.
There are several genome-wide datasets available for this cell type: H3K4me1, H3K4me3,
H3K27me3, H3K9me3, H3K36me3, and H3K9Ac. Which histone mark would you investigate for
this locus and why? Suggest at least one other source of data that may help you, and why you
think it may help?
B) You decide to use bisulfite sequencing to validate the methylation status of the 2kb region
that has both intermediate MRE-seq and MeDIP-seq signal. Bisulfite treatment converts
unmethylated C to T. You do this experiment in both normal colon cells and a colon cancer cell
line. On the next page is the data after aligning reads from bisulfite treated DNA to the region
from both normal colon and colon cancer cell lines. For simplicity, we only consider one strand.
Template:
GATCGTGCACGATCTCGGCAATTCGGGATGCCGGCTCGTCACCGGTCGCT
Reads(normal)
GATTGTGTATGATTTTGGTAATTTGGG
GATTGTGTATGATTTTGGTAATTTGGG
GATTGTGTATGATTTTGGTAATTTGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATTGTGTATGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATTGTGTATGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GTATGATTTTGGTAATTTGGGATGTTGGTTTG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTATGATTTTGGTAATTTGGGATGTTGGTTTG
GTATGATTTTGGTAATTTGGGATGTTGGTTTG
GTATGATTTTGGTAATTTGGGATGTTGGTTTG
GTATGATTTTGGTAATTTGGGATGTTGGTTTG
GTATGATTTTGGTAATTTGGGATGTTGGTTTG
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
TGGGATGTTGGTTTGTTATTGGTTGTT
TGGGATGTTGGTTTGTTATTGGTTGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
TGGGATGTTGGTTTGTTATTGGTTGTT
TGGGATGTTGGTTTGTTATTGGTTGTT
TGGGATGTTGGTTTGTTATTGGTTGTT
Template:
GATCGTGCACGATCTCGGCAATTCGGGATGCCGGCTCGTCACCGGTCGCT
Reads(cancer)
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GATCGTGTACGATTTCGGTAATTCGGG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
GTACGATTTCGGTAATTCGGGATGTCGGTTCG
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
CGGGATGTCGGTTCGTTATCGGTCGTT
(2 pts) calculate the level of methylation, defined as percentage methylated, of the 8 CpG sites
in both normal cells and in cancer cells.
(2 pts) Professor Lateed suspects this locus may be imprinted. Does the bisulfite data support
his idea? In other words is there evidence for an extended methylated or unmethylated
haplotype across this region? Describe how you might obtain additional support for the
imprinted status of this promoter.
(2 pts) If this novel gene plays a role in tumorigenesis, do you think it is a tumor suppressor
gene or an oncogene? Why? Propose at least one mechanism that could lead to the observed
change in the cancer cell line.
5. Expression profiling
A) (2 pts) What are the two kinds of generally accepted RNA editing?
B) (3 pts) What are three kinds of normalization that we discussed? [hint, there were two for
microarray, and one in the context of RNAseq] What kind of biases do they each correct for?
C) (1 pt) Under what circumstances does correlation fail to detect co-regulated genes?
D) (1 pt) What is an advantage of using Gene Ontologies for analyzing gene lists?
E) (1pt) What is a disadvantage?
F) (1 pt) What is the difference between biological variation and technical variation?
G) (1 pt) You have 200 cancer samples and decide to cluster then with a k-means clustering
algorithm. You use k=3. The algorithm returns that 100 of the samples are in a cluster A, 72 in
B, and 28 in C. What can you conclude from this analysis about the different subtypes of
cancer within your sample?
1. (3 pts) Compare and contrast the advantages of microarrays vs. RNAseq (0.5pts per
difference between platforms)
2. (3 pts) When should a Fisher’s exact test be used? Describe two examples of analytical
situations where a Fisher’s exact test could be used.
3. (3pts) What does FPKM stand for? What normalization does it provide? Calculate the FPKM
for Your Favorite Gene1 (YFG1), given that:
Yfg1 is has three exons of 500 bp each, and 2 introns of 1000 bp, and codes for a protein of
125a.a. (Thus has substantial UTR regions). Yfg1 has no alternative splicing.
Yfg1 has a 57% GC content
Your library has 3million reads in it, and is a paired end library.
3000 reads map to YFG1. Pairs are perfectly matched.
4. (1 pt) What is your favorite kind of clustering, and why? Mention at least two kinds, and why
you like one more than the other.
6. DNA Protein interaction: motif finding
Consider a peculiar family of modular transcription factors that recognize a single base pair per
module. For two members of this family, TF-1 and TF-2, the base pair frequencies across all
known binding sites have been determined:
Frequency of:
A
C
G
T
TF-1
1/2
1/4
0
1/4
TF-2
0
1/4
1/2
1/4
Orthologs of TF-1 and TF-2 proteins are found in different genomes with wildly different base
frequencies in the sense strands of promoter regions. The base frequencies for two such
genomes, G-1 and G-2, are given here:
Background
Frequency of:
A
C
G
T
G-1
1/4
1/4
1/4
1/4
Using the equation I seq =
G-2
1/4
1/2
1/8
1/8
∑∑ f (b, j ) log
j
b
2
f (b, j )
, where b is the base at a particular position j
p (b)
in the binding motif, answer the following questions.
(5 pts) Does the binding module TF-1 have higher information content in one genome or the
other?
(5 pts) Which transcription factor module contains more information in genome 2?