iGenetics A Molecular Approach Peter J. Russell Third

Transcription

iGenetics A Molecular Approach Peter J. Russell Third
iGenetics
Russell
Third Edition
ISBN 978-1-29202-633-6
9 781292 026336
iGenetics
A Molecular Approach
Peter J. Russell
Third Edition
Pearson Education Limited
Edinburgh Gate
Harlow
Essex CM20 2JE
England and Associated Companies throughout the world
Visit us on the World Wide Web at: www.pearsoned.co.uk
© Pearson Education Limited 2014
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the
prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom
issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS.
All trademarks used herein are the property of their respective owners. The use of any trademark
in this text does not vest in the author or publisher any trademark ownership rights in such
trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this
book by such owners.
ISBN 10: 1-292-02633-2
ISBN 13: 978-1-292-02633-6
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Printed in the United States of America
Genomics: The Mapping and Sequencing of Genomes
with chemicals (e.g., alkaline conditions) and/or heat
is critical to many methods used to produce and analyze cloned DNA. Give three examples of methods that
rely on complementary base pairing, and explain what
role complementary base pairing plays in each of these
methods.
3 Restriction endonucleases are naturally found in bacteria. What purposes do they serve?
*4 A new restriction endonuclease is isolated from a bacterium. This enzyme cuts DNA into fragments that average 4,096 base pairs long. Like many other known restriction enzymes, the new one recognizes a sequence in
DNA that has twofold rotational symmetry. From the information given, how many base pairs of DNA constitute
the recognition sequence for the new enzyme?
*5 An endonuclease called AvrII (“a-v-r-two”) cuts
DNA whenever it finds the sequence 5¿-CCTAGG-3¿ .
3¿-GGATCC-5¿
a. About how many cuts would AvrII make in the
human genome, which contains about 3!109 base
pairs of DNA and in which 40% of the base pairs are
G–C?
b. On average, how far apart (in base pairs) will two
AvrII sites be in the human genome?
c. In the cellular slime mold Dictyostelium discoidium,
about 80% of the base pairs in regions between genes
are A–T. On average, how far apart (in base pairs)
will two AvrII sites be in these regions?
6 About 40% of the base pairs in human DNA are G–C.
On average, how far apart (in base pairs) will the following sequences be?
a. two BamHI sites
b. two EcoRI sites
c. two NotI sites
d. two HaeIII sites
*7 The average size of fragments (in base pairs) observed after genomic DNA from eight different species
was individually cleaved with each of six different restriction enzymes is shown in Table B.
a. Assuming that each genome has equal amounts of A,
T, G, and C, and that on average these bases are
uniformly distributed, what average fragment size is
expected following digestion with each enzyme?
b. How might you explain each of the following?
i. There is a large variation in the average fragment
sizes when different genomes are cut with the
same enzyme.
ii. There is a large variation in the average fragment
sizes when the same genome is cut with different
enzymes that recognize sites having the same
length (e.g., ApaI, HindIII, SacI, and SspI).
iii. Both SrfI and NotI, which each recognize an 8-bp
site, cut the Mycobacterium genome more frequently than SspI and HindIII, which each recognize a 6-bp site.
*8 What features are required in all vectors used to propagate cloned DNA? What different types of cloning vectors
are there, and how do these differ from each other?
9 The plasmid pBluescript II is a plasmid cloning vector
used in E. coli. What features does it have that makes it
useful for constructing and cloning recombinant DNA
molecules? Which of these features are particularly useful during the sequencing of a genome?
*10 A colleague has sent you a 2-kb DNA fragment excised from a plasmid cloning vector with the enzyme PstI
(see Table 1 for a description of this enzyme and the
restriction site it recognizes).
a. List the steps you would take to clone the DNA fragment into the plasmid vector pBluescript II (shown in
Figure 4), and explain why each step is necessary.
b. How would you verify that you have cloned the
fragment?
*11 E. coli, like all bacterial cells, has its own restric-tion
endonucleases that could interfere with the propagation
of foreign DNA in plasmid vectors. For example, wild-
Table B
Enzyme and Recognition Sequence
254
Species
ApaI
GGGCCC
HindIII
AAGCTT
SacI
GAGCTC
SspI
AATATT
SrfI
GCCCGGGC
NotI
GCGGCCGC
Escherichia coli
Mycobacterium tuberculosis
Saccharomyces cerevisiae
Arabidopsis thaliana
Caenorhabditis elegans
Drosophila melanogaster
Mus musculus
Homo sapiens
68,000
2,000
15,000
52,000
38,000
13,000
5,000
5,000
8,000
18,000
3,000
2,000
3,000
3,000
3,000
4,000
31,000
4,000
8,000
5,000
5,000
6,000
3,000
5,000
2,000
32,000
1,000
1,000
800
900
3,000
1,000
120,000
10,000
570,000
no sites
1,110,000
170,000
120,000
120,000
200,000
4,000
290,000
610,000
260,000
83,000
120,000
260,000
Genomics: The Mapping and Sequencing of Genomes
type E. coli has a gene, hsdR, that encodes a restriction endonuclease that cleaves DNA that is not methylated at certain A residues. Why is it important to inactivate this enzyme by mutating the hsdR gene in strains of E. coli that
will be used to propagate plasmids containing recombinant DNA?
12 E. coli is a commonly used host for propagating DNA
sequences cloned into plasmid vectors. Wild-type E. coli
turns out to be an unsuitable host, however: the plasmid
vectors are “engineered,” and so is the host bacterium. For
example, nearly all strains of E. coli used for propagating recombinant DNA molecules carry mutations in the recA
gene. The wild-type recA gene encodes a protein that is central to DNA recombination and DNA repair. Mutations in
recA eliminate general recombination in E. coli and render
E. coli sensitive to UV light. How might a recA mutation
make an E. coli cell a better host for propagating a plasmid
carrying recombinant DNA? (Hint: What type of events involving recombinant plasmids and the E. coli chromosome
will recA mutations prevent?) What additional advantage
might there be to using recA mutants, considering that
some of the E. coli cells harboring a recombinant plasmid
could accidentally be released into the environment?
*13 Genomic libraries are important resources for isolating genes and for studying the functional organization of
chromosomes. List the steps you would use to make a genomic library of yeast in a plasmid vector. In what fundamental way would you modify this procedure if you were
making the library in a BAC vector?
14 Three students are working as a team to construct a
plasmid library from Neurospora genomic DNA. They
want the library to have, on average, about 4-kb inserts.
Each student proposes a different strategy for constructing the library, as follows:
Mike: Cleave the DNA with a restriction enzyme
that recognizes a 6-bp site, which appears about
once every 4,096 bp on average and leaves sticky,
overhanging ends. Ligate this DNA into the plasmid vector cut with the same enzyme, and transform the ligation products into bacterial cells.
Marisol: Partially digest the DNA with a restriction
enzyme that cuts DNA very frequently, say once
every 256 bp, and that leaves sticky overhanging
ends. Select DNA that is about 4 kb in size (e.g.,
purify fragments this size after the products of
the digest are resolved by gel electrophoresis).
Then, ligate this DNA to a plasmid vector cleaved
with a restriction enzyme that leaves the same
sticky overhangs and transform the ligation products into bacterial cells.
Hesham: Irradiate the DNA with ionizing radiation,
which will cause double-stranded breaks in the
DNA. Determine how much irradiation should be
used to generate, on average, 4-kb fragments and
use this dose. Ligate linkers to the ends of the
irradiated DNA, digest the linkers with a restriction enzyme to leave sticky overhanging ends, ligate the DNA to a similarly digested plasmid vector, and then transform the ligation products into
bacterial cells.
Which student’s strategy will ensure that the inserts are
representative of all of the genomic sequences? Why are
the other students’ strategies flawed?
*15 Some restriction enzymes leave sticky ends, while
others leave blunt ends. It is more efficient to clone DNA
fragments with sticky ends than DNA fragments with
blunt ends. What is the best way to efficiently clone a set
of DNA fragments having blunt ends?
*16 The human genome contains about 3!109 bp of
DNA. How many 200-kb fragments would you have to
clone into a BAC library to have a 90% probability of including a particular sequence?
17 A biochemist studies a protein with antifreeze properties that he found in an Antarctic fish. After determining
part of the protein’s amino acid sequence, he decides he
would like to obtain the DNA sequence of its gene. He has
no experience in genome analysis and mistakenly thinks
he needs to sequence the entire genome of the fish to obtain this information. When he asks a more knowledgeable colleague about how to sequence the fish genome,
she describes the whole-genome shotgun approach and
the need to obtain about 7-fold coverage. The biochemist
decides that this approach provides far more information
than he needs and so embarks on an alternate approach
he thinks will be faster. He decides to sequence individual
clones chosen at random from a library made with genomic DNA from the Antarctic fish. After sequencing the
insert of a clone, he will analyze it to see if it contains an
ORF with the sequence of amino acids he knows are present in the antifreeze protein. If it does, he will have found
what he wants and will not sequence any additional
clones. If it does not, he plans to keep obtaining and analyzing the sequences of individual clones sequentially
until he finds a clone that has the sequence of interest. He
thinks this approach will let him sequence fewer clones
and be faster than the whole-genome shotgun approach.
He must decide which vector to use in building his
genomic library. He can construct a library made in the
pBluescript II vector with inserts that are, on average, 7
kb, a library made in the vector pBeloBAC11 with inserts
that are, on average, 200 kb, and a library made in a YAC
vector with inserts that are, on average, 1 Mb. He assumes that any library he constructs will have an equally
good representation of the 2!109 base pairs in a haploid
copy of the fish genome, that the antifreeze gene is less
than 2 kb in size, and that (somehow) he can easily obtain the sequence of the DNA inserted into a clone.
a. Given the biochemist’s assumptions, what is the
chance that he will find the antifreeze gene if he
255
Genomics: The Mapping and Sequencing of Genomes
sequences the insert of just one clone from each
library? Based on this information, which library
should he use if he wants to sequence the fewest
number of clones?
b. When he tries to sequence the insert of the first clone
he picks from the library by a calleague suggested by a
colleague in (a), he realizes that he does not enjoy this
type of lab work. So, he hires a technician with experience in genomics, assigns the project to her, and goes to
Antarctica to catch more fish. He tells her to sequence
the inserts of enough clones to be 95% certain of obtaining at least one insert containing the antifreeze gene
and says he will analyze all of the sequence data for the
presence of the antifreeze gene after he returns. How
many clones should she sequence to satisfy this requirement if he constructed the genomic library in a
plasmid vector? a BAC vector? a YAC vector?
c. What advantages and disadvantages does each of the
different vectors have for constructing libraries with
cloned genome DNA?
d. Suppose the Antarctic fish has a very AT-rich genome
and the biochemist propagated the genomic library
using E. coli. Will the library be representative of all
the sequences in the genome of the fish?
*18 When Celera Genomics sequenced the human
genome, they obtained 13,543,099 reads of plasmids
having an average insert size of 1,951 bp, and
10,894,467 reads of plasmids having an average insert
size of 10,800 bp.
a. Dideoxy sequencing provides only about 500–550 nucleotides of sequence. About how many nucleotides of
sequence did cetera obtain from sequencing these two
plasmid libraries? To what fold coverage does this
amount of sequence information correspond?
b. Why did they sequence plasmids from two libraries
with different-sized inserts?
c. They sequenced only the ends of each insert. How
did they determine the sequence lying between the
sequenced ends?
*19
a. What features of pBluescript II facilitate obtaining the
sequence at the ends of an insert?
b. Devise a strategy to obtain the entire sequence of a
7-kb insert in pBluescript II.
c. Devise a strategy to obtain the entire sequence of a
200-kb insert in pBeloBAC11.
20 Explain how the whole-genome shotgun approach to
sequencing a genome differs from the biochemist’s approach described in Question 8(c). What information
does it provide that the biochemist’s approach does not?
What does it mean to obtain 7-fold coverage, and why
did his colleague advise him to do this?
*21 In a sequencing reaction using dideoxynucleotides
that are labeled with different fluorescent dyes, the DNA
256
chains produced by the reaction are separated by size
using capillary gel electrophoresis and then detected by
a laser eye as they exit the capillary. A computer then
converts the differently colored fluorescent peaks into a
pseudocolored trace. Suppose green is used for A, black
for G, red for T, and blue for C. What pattern of peaks do
you expect to see on a sequencing trace if you carry out
a dideoxy sequencing reaction after the primer
5¿-CTAGG-3¿ is annealed to the following singlestranded DNA fragment?
3¿-GATCCAAGTCTACGTATAGGCC-5¿
22 How does pyrosequencing differ from dideoxy chaintermination sequencing? What advantages does it have
for large-scale sequencing projects?
23 Do all SNPs lead to an alteration in phenotype? Explain why or why not.
24 Researchers at Perlegen Sciences sought to identify
tag SNPs on human chromosome 21. After determining
the genotypes at 24,047 common SNPs in 20 hybrid cell
lines containing a single, different human chromosome
21, they used computerized algorithms to identify haplotypes containing between 2 and 114 SNPs that cover
the entire chromosome. A total of 2,783 tag SNPS were
selected from SNPs within these blocks.
a. What is a SNP marker?
b. How do haplotypes arise in members of a population?
c. What is a hapmap?
d. What is a tag SNP?
e. What advantages were there for the researchers to use
hybrid cell lines instead of genomic DNA from 20 different individuals?
f. The 20 individuals whose chromosome 21 was used
in this analysis were unrelated and had different ethnic origins. Do you expect the haplotypes and number of tag SNPs to differ if
i. the cell lines were established from blood samples
drawn at a large family reunion.
ii. the cell lines were established from unrelated individuals, but their ancestors originated in the same geographical region.
*25 A set of hybrid cell lines containing a single copy of
the same human chromosome from 10 different individuals was genotyped for 26 SNPs, A through Z. The SNPs
are present on the chromosome in the order A, B, C, . . .
Z. Table C lists the SNP alleles present in each cell line.
State which SNPs can serve as tag SNPs, and which haplotypes they identify. What is the minimum number of
tag SNPs needed to differentiate between the haplotypes
present on this chromosome?
26 Some features that we commonly associate with racial
identity, such as skin pigmentation, hair shape, and facial
morphology, have a complex genetic basis. However, it
turns out that these features are not representative of the
Genomics: The Mapping and Sequencing of Genomes
do the steps used to clone a cDNA differ from the steps
used to clone genomic DNA? How are cDNA sequences
used to help annotation of a sequenced genome?
Table C
Cell Line
1
2
3
4
5
6
7
8
9
10
A1
B1
C3
D4
E1
F2
G3
H1
I3
J2
K1
L2
M1
N2
O1
P2
Q2
R3
S1
T1
U2
V2
W2
X1
Y2
Z1
A1
B1
C3
D4
E1
F1
G2
H1
I1
J1
K1
L1
M1
N2
O1
P1
Q2
R1
S2
T1
U1
V2
W3
X2
Y1
Z1
A2
B2
C1
D3
E2
F2
G3
H1
I3
J2
K1
L2
M2
N1
O1
P2
Q2
R3
S1
T1
U2
V2
W1
X1
Y4
Z2
A3
B3
C2
D2
E2
F2
G3
H1
I3
J2
K1
L2
M1
N2
O1
P1
Q2
R1
S2
T1
U1
V2
W2
X1
Y2
Z1
A1
B2
C1
D1
E3
F2
G1
H2
I2
J2
K2
L1
M1
N2
O1
P2
Q2
R3
S1
T1
U2
V2
W1
X3
Y3
Z2
A3
B3
C2
D2
E2
F1
G2
H1
I1
J1
K1
L1
M2
N1
O2
P1
Q1
R2
S1
T1
U2
V2
W3
X2
Y1
Z1
A2
B2
C1
D3
E2
F2
G1
H2
I2
J2
K2
L1
M2
N1
O1
P1
Q2
R1
S2
T1
U1
V2
W1
X3
Y3
Z2
A3
B3
C2
D2
E2
F2
G3
H1
I3
J2
K1
L2
M1
N2
O1
P1
Q2
R1
S2
T1
U1
V2
W1
X1
Y4
Z2
A1
B1
C3
D4
E1
F2
G1
H2
I2
J2
K1
L2
M2
N1
O1
P2
Q2
R3
S1
T1
U2
V2
W3
X2
Y1
Z1
A2
B2
C1
D3
E2
F2
G3
H1
I3
J2
K1
L2
M1
N2
O2
P1
Q1
R2
S1
T1
U2
V2
W1
X3
Y3
Z2
genetic differences between racial groups—individuals
assigned to different racial categories share many more
DNA polymorphisms than not—supporting the contention that race is a social and not a biological construct.
How could you use DNA chips to quantify the percentage
of SNPs that are shared between individuals assigned to
different racial groups?
*27 Mutations in the dystrophin gene can lead to
Duchenne muscular dystrophy. The dystrophin gene is
among the largest known: it has a primary transcript that
spans 2.5 Mb, and it produces a mature mRNA that is
about 14 kb. Many different mutations in the dystrophin
gene have been identified. What steps would you take if
you wanted to use a DNA microarray to identify the specific dystrophin gene mutation present in a patient with
Duchenne muscular dystrophy?
28 Three of the steps in the analysis of a genome’s sequence
are assembly, finishing, and annotation. What is involved in
each step, and how do they differ from each other?
29 What is a cDNA library, and from what cellular material is it derived? How is a cDNA synthesized, and how
*30 Eukaryotic genomes differ in their repetitive DNA
content. For example, consider the typical euchromatic
50-kb segment of human DNA that contains the human b
T-cell receptor. About 40% of it is composed of various
genome-wide repeats, about 10% encodes three genes
(with introns), and about 8% is taken up by a pseudogene. Compare this to the typical 50-kb segment of yeast
DNA containing the HIS4 gene. There, only about 12% is
composed of a genome-wide repeat, and about 70% encodes genes (without introns). The remaining sequences
in each case are untranscribed and either contain regulatory signals or have no discernible information. Whereas
some repetitive sequences can be interspersed throughout
gene-containing euchromatic regions, others are abundant near centromeres. What problems do these repetitive
sequences pose for sequencing eukaryotic genomes?
When can these problems be overcome, and how?
31 What is the difference between a gene and an ORF?
Explain whether all ORFs correspond to a true gene, and
if they do not, what challenges this poses for genome annotation.
*32 Once a genomic region is sequenced, computerized
algorithms can be used to scan the sequence to identify
potential ORFs.
a. Devise a strategy to identify potential prokaryotic
ORFs by listing features accessible by an algorithm
checking for ORFs.
b. Why does the presence of introns within transcribed
eukaryotic sequences preclude direct application of
this strategy to eukaryotic sequences?
c. The average length of exons in humans is about
100–200 bp, while the length of introns can range
from about 100 to many thousands of base pairs. What
challenges do these findings pose for identifying exons
in uncharacterized regions of the human genome?
d. How might you modify your strategy to overcome
some of the problems posed by the presence of introns in transcribed eukaryotic sequences?
33 Annotation of genomic sequences makes them much
more useful to researchers. What features should be included in an annotation, and in what different ways can
they be depicted? For some examples of current annotations in databases, see the following websites:
http://www.yeastgenome.org/
http://flybase.org (Drosophila)
http://www.tigr.org/tdb/e2k1/ath1/ (Arabidopsis)
http://www.ncbi.nlm.nih.gov/genome/guide/human/
(humans)
http://genome.ucsc.edu/cgi-bin/hgGateway (humans)
http://www.h-invitational.jp/
257
Genomics: The Mapping and Sequencing of Genomes
*34 One powerful approach to annotating genes is to
compare the structures of cDNA copies of mRNAs to the
genomic sequences that encode them. Indeed, a large collaboration involving 68 research teams analyzed 41,118
full-length cDNAs to annotate the structure of 21,037
human genes (see http://www.h-invitational.jp/).
a. What types of information can be obtained by comparing the structures of cDNAs with genomic DNA?
b. During the synthesis of cDNA (see Figure 15), reverse
transcriptase may not always copy the entire length of
the mRNA and so a cDNA that is not full-length can
be generated. Why is it desirable, when possible, to
use full-length cDNAs in these analyses?
c. The research teams characterized the number of loci
per Mb of DNA for each chromosome. Among the autosomes, chromosome 19 had the highest ratio of 19
loci per Mb while chromosome 13 had the lowest
ratio of 3.5 loci per Mb. Among the sex chromosomes,
the X had 4.2 loci per Mb while the Y had only 0.6 loci
per Mb. What does this tell you about the distribution
of genes within the human genome? How can these
data be reconciled with the idea that chromosomes
have gene-rich regions as well as gene deserts?
d. When the research teams completed their initial analysis, they were able to map 40,140 cDNAs to the available human genome sequence. Another 978 cDNAs
could not be mapped. Of these 978 cDNAs, 907
cDNAs could be roughly mapped to the mouse
genome. Why might some (human) cDNAs be unable
to be mapped to the human genome sequence that was
available at the time although they could be mapped to
the mouse genome sequence? (Hint: Consider where
errors and limited information might exist.)
*35 How has genomic analysis provided evidence that
Archaea is a branch of life distinct from Bacteria and
Eukarya?
36 The genomes of many different organisms, including
bacteria, rice, and dogs, have been sequenced. Choose
three phylogenetically diverse organisms. Compare the
rationales for sequencing their genomes, and describe
what we have learned from sequencing each genome.
37 In which type of organisms does gene number appear
to be related to genome size? Explain why this is not the
case in all organisms.
38 The C-value paradox states that there is no obvious
relationship between an organism’s haploid DNA content
and its organizational and structural complexity. Discuss,
citing data from the genome sequencing, whether there is
also a gene-number paradox or a gene-density paradox.
39 In the United States, 3–5% of public funds used to
support the Human Genome Project were devoted to research to address its ethical, legal, social, and policy implications. Some of the results are described in the website
http://www.ornl.gov/sci/techresources/Human_Genome/
elsi/elsi.shtml. After exploring this website, answer the
following questions.
a. Summarize the main ethical, legal, social, and policy
issues associated with the human genome project.
b. Why is legislation necessary to protect an individual’s genetic privacy? What such legislation currently
exists?
c. What are the pros and cons of gene testing?
d. Both presymptomatic and symptomatic individuals
are subject to gene testing for an inherited disease.
How are gene tests used in each situation, and how
do the concerns about using gene testing differ in
these situations?
e. Are laboratories that conduct genetic testing regulated by law?
Solutions to Selected Questions and Problems
2
Examples of methods that utilize the hydrogen bonding
in complementary base pairing include: (1) the binding of complementary sticky ends present in a cloning vector and a DNA
fragment prior to their ligation by DNA ligase; (2) the annealing of a labeled nucleic acid to a complementary singlestranded DNA fragment on a microarray; (3) the annealing of
an oligo(dT) primer to a poly(A) tail during the synthesis of
cDNA from mRNA; and (4) the annealing of a primer to a template during a DNA sequencing reaction. In each case, base
pairing allows for nucleotides to interact in a sequence-specific
manner essential for the procedure’s success. For example, the
binding of a primer to a template at the start of a DNA sequencing reaction requires complementary base pairing between the
sequences in the primer and the template, which in turn defines
where the DNA sequencing reaction will start.
4
The average length of the fragments produced indicates
how often, on average, the restriction site appears. If the DNA is
258
composed of equal amounts of A, T, C, and G, the chance of
finding one specific base pair (A–T, T–A, G–C, or C–G) at a
particular site is 1/4. The chance of finding two specific base
pairs at a site is (1/4)2. In general, the chance of finding n
specific base pairs at a site is (1/4)n. Here, 1/4,096=(1/4)6, so the
enzyme recognizes a 6-bp site.
5
a. Since 40% of the genome is composed of G–C
pairs, P(G)=P(C)=0.20 and
fore, P(CCTAGG)=(0.20)4!(0.30)2=0.000144. A
with 3!109 base pairs will have about 3!109 different groups
of 6-bp sequences. Thus, the number of sites is (0.000144)!
(3!109)= 432,000.
b. 3!109 bp/432,000 sites=1/0.000144=6,944 bp between sites.
c. P(CCTAGG)=(0.10)4!(0.40)2=0.000016, so two
AvrII sites are expected to be about 1/0.000016=62,500 bp
apart.