Barton, G. J. and Sternberg, (1987)

Transcription

Barton, G. J. and Sternberg, (1987)
Protein Engineeringvol.l no.2 pp.89-94, 1987
Evaluation and improvementsin the automatic alignment of
protein sequences
from chicken alpha and beta haemoglobin,varying the usersuppliedgap-penaltyparameterswill producea numberof difLaboratory of Molecular Biology, Department of Crystallography, Birkbeck
ferent
optimalalignments(Fitch and Smith, 1983).
College, Malet Street, London WCIE 7HX, UK
workershave notedthat alignmentsobtainedby the
Several
'To whom reprint requestsshould
be sent
superpositionof homologoushigh-resolutionprotein structures
The accuracy of protein sequencealignment obtained by
can be quite different to tlose producedby automaticsequence
applyrng a commonly used global sequencecomparison basedmethods(e.g.Dickersonet al., 1976;Chothiaand[.esk,
algorithm is assessed.
Alignmentsbasedon the superposition
1982;Delbaereet al.,1979).In this paper,the qualityof an
of the three-dimensionalstructures are used as a standard
automaticallyobtained sequencealignment is systematically
for testing the automatic, sequence-based
methods. Alignassessed
by referenceto alignmentsbasedon structuresuperments obtained from the global comparisonof five pairs of
position. Furthermore,in order to model more effectivelythe
homologousprotein sequencesstudied gave54Voagreement observedevolutionarypreferencefor insertions/deletions
to ocoverall for residuesin secondarystructures. The inclusionof
cur in the loop regionswhich join secondarystructures(e.g.
information about the secondarystructure of one of the pro.
Perutzet al., 1965),we haveintroduceda secondarystructure
teins in order to limit the number of gapsinserted in regions
function(Q) whichhastheeffectof reducingthepenaldependent
of secondary structure, improved this figure to 68Vo. A
structuralregions.A similar apty for a gap in non-secondary
similarity scoreof greater than six standard deviation units
proachhasbeenindependently
appliedby A.M.Lesk, M.lrvitt
suggeststhat an alignment which is greater than TSVocorand C.Chothia(I*sk et al., 1986)to align proteinswithin the
rect within secondarystrucfural regionscan be obtainedauto
globin and serineproteinasefamilies.
matically for the pair of sequences.
In thispaperfive pain of structurallyhomologous
proteinswere
Key words: alignment/protein/sequence/secondary/structure
consideredand the sequencealignmentstestedover residues
whichareclearlyin homologoussecondary
structures.Theinclusion of Q derived from a knowledgeof the threedimensional
Introduction
structureof one protein in eachpair resultedin an overall improvementin alignmentaccuracy.
Automatic sequencecomparisons(e.g. Needlemanand Wunsch,
GeoffreyJ.Bartonl and Michael J.E.Sternberg
1970; Sellers, 1974) are used to searchfor homology between
a newly determinedsequenceand one in the data bank. Recent
developmentshave aimed at identi$ing local homologies (e.g.
Sellers, 1979: Goad and Kanehisa, 1982; Boswell and
Mclachlan, 1984)or at increasingspeedand reducing memory
requirements(e.g. Gotoh, 1982; Taylor, 1986; Fickett, 1984;
Wilbur and Lipman, 1983). This paper concentrateson the accuracy of the alignment.
From a sequencealignment of two (or more) proteins, conserved regions are identified which may provide pointers to functionally and/or structurally important regions of the molecules.
This sequenceinformation alone can be the basis for studies such
as site-directedmutagenesis(Winter and Fersht, 1984) or the
selectionof peptidesagainstwhich antibodiesare raised (Sutcliffe
et al., 1983). Furthermore, a newly determined sequencemay
show sequencehomology with a protein whose structure has been
obtained crystallographically and this can lead to a threedimensional atomic model by model-building techniques(e.9.
Browne et aL, 1969;Blundell et al., 1983;Traverset al., l9M).
Most sequencecomparison methodsare basedon dynamic programming algorithms such as those originally applied in
molecular biology by Needlemanand Wunsch (1970) and formalized by Sellers (1974) and Waterman et al. (1976). These
methods carry out a global comparison of the sequencesand the
result is an optimal score for the comparisonand an alignment
with that score. Although the score may be valuable for identi$ing homology againsta background of randomized sequences,
the actual sequencealignment may not be correct. Even for two
very short sequences( ( 20 amino acids when translated) taken
@ IRL Press Limited, Oxford, England
Materials and methods
Proteinsaligned
Five pairsof proteinsanddomains,for whichalignments
based
on three-dimensional
structuresareavailable,weretakenastest
data.For eachpair ofproteinsa setofunambiguously
assigned
equivalentresidueswereselectedfrom centralportionsof homologoussecondarystructures(TableI). The percentage
accuracy
of eachautomaticsequencealignmentwas calculatedfrom the
numberof aminoacidpairswithin the definedzoneswhich were
alignedas in the structuralalignment.
Needlemanand Wunschalgoithm
The computerprogram written for this study implementeda
variant of the Needlemanand Wunsch(1970)algorithm.
(i) A matrix of amino acid pair scoresD is chosen:in its
simplestform this may indicatea scoreof 1 for identity and 0
for all otherstates,while moresophisticated
systemsincorporate
informationaboutconservative
substitutions
by usinga weighting
scheme.In this studythe MDM259matrix was used(Dayhoff,
1972, 1978)with a constantof 8 addedto removeall negative
terms. Preliminarystudiesshowedthis pair scorematrix to be
superior to either identity or genetic code types. Alternative
matrices(e.g.Mclachlan,1972;FengetaL, 1985)althoughnot
includedin this studymaybe expectedto performat leastaswell
as the MDM256matrix (Fenger al., 1985).
(ii) The proteinsequences
are definedasA1,^,B1,nwherem
z{, B, respectively.
andn arethe numberof residuesin sequence
(iii) A matrix R.,n is generatedby referenceto D whereeach
89
GJ.Barton and M.J.E.Sternberg
Table I. Proteinpain usedto test alignmentmethods
Protein/domainA
(abbreviation)
(numberof residues)
Protein/domainB
(abbreviation)
(numberof residues)
Immunoglobulinlight chain
variablercgion(FABVL) (103)
Immunoglobulinheavychain
variableregion(FABVH) (l17)
Immunoglobulinheavychain
variableregion(FABVH) (ll7)
Immunoglobulinlight chain
constantregion(FABCL) (105)
Plastocyanin(PLASTO) (99)
Source of structural alignment
Number of
residues
taken for
test
Number of
zones
Cohenet a/. (1981)
4l
Azurin (AZURIN) (128)
Cohener aJ. (1981)
Chothiaand Lesk (1982)
38
48
7
Human alphahaemoglobin
(HAHU) (l4l)
Root noduleleghaemoglobin
(r.EGHE)(156)
Lesk and Chothia(1980)
Trypsin (P2PTN) (223)
(PIEST) (24O)
Tosyl-elastase
M.Zvelebil (personalcommunication)
100
63
ll
7
For eachproEin pah, a numberof zoneswerc identified from the central regionsof homologousbeta-shandsand alpha-helices(FABVL:FABVH,
Tyr8l-Tyr86:Tyr93-Asn98;
Ser64-Ile70:Asn76l-erfi2;
His34-Gln39:Tyr33-Arg38;
Ser58-Ser62:Thr68-Asn72;
lruzt-Pro8:l,eu4{1y8,Thrl9-Gly24:Serl9-Val24;
Thr68-Asn72:Gly5l-Thr55;
Tyr33-Val37:Thr38-Lys43;
Val9-Thr97:VallOGSerl11.FABVH:FABCL, l,eu,l-Gly8:Val8-Prol2;
Serl9-Va124:Thr2,1-Ile29;
Alal3-Ile2l:Glnl,t-Val22;
Tyr93-Asr08:Tyr8zl-Thr89;
Vall0GSerlll:Val95-Val9. PI-ASTO:AZURIN,Asp2-Gly6:Sert-Gln8;
AsnTGl:u82:AIa67-Leu73;
Me92-Asn9:Metl2l-Lys128.
Val$Asp42:Val49-Ser51;Gly67-I*u74:Lys92-Yal99:
Gly78-Cys84:Glul06Cysl12;
Glu25-Asn32:Lys27-Ser34;
Pro77-His89:Asp89-Serl0l;
Thr38-Thr4l:Ata39,Asf,2;Ala53-Val70:Pro6GAla77;
HAHU:LEGH, Pro4-Lys16:Glu54lul7;Ala2l-Ser35:1e22-fle36;
Trp34-Ser37:Trp39:Thr42;
His23-Ile30:Hi98-Ile35;
Glnl5-Asn19:Gln15-Glnl9;
ho95-Hisl12:AsplM-Yall24: holl9Thrl37:Glul3l-Lysl49. P2PTN:P1EST,
Met86-Leu90:Ala95-ku9;Cysl lGGlyl20:Cysl27-Gly13l; Lysl36Alal,l0:Glnlzl6l,eul50; Metl6GAlal63:Met172-Alal75;
Gln63-Val72:Gln7GVal79;
Automaticalignmentswere tesEdby considering
how many
Gly204-Lys208:Thr22l-Arg225).
Prol8GCysl83:Prol9l-CysllX; Lysl8GTrpl93:Ala201-Phe208;
residueswithin thesezoneswere equivalencedas expectedfrom the structur€basedalignment.The secondarystructuredependentfunction @ was derivedby
referenceto the frst sequencein eachpair suchthat Q = Q, within eachzoneand Q = h outsidethe zone.
elementR;u representsthe scorefor z4;versusB;.
(iv) R.,n is actedon to generateS.,n whereeachelementSrj
holdsthe maximumscorefor a comparisonof Ai,mwith 87,2.
(v) Suitablepointersare recordedin (iv) to enablean alignment with the maximum score for 241,.v€rSUS
J|l,n to be
generated.
In order to limit the total numbrof gapsintroduced(residues
in onesequence
alignedwith blanl$), a gap-penaltyis subtracted
during the processof generatingS..n whenevera gap is introduced.We followed the recommendations
of Fitch and Smith
(1983) by using a gap-penaltyfunction having both lengthdependentand length-independent
terms of the form:
P:GtxL+Gz
whereZ is the length of gap and Gl ar\dG2 are userdefined
constants.The programalso allows gapsat the endsof the sequences(terminalgaps)to be optionallyweighted,a featurenot
inherent to the standard Needlemanand Wunsch (1970)
algorithm.
hteruion of gap-pernltyfunaion to includesecondarystructural
infornntion
The secondarystructuredependentfunctionQ modifiesthe gap
penaltyfunction to the form:
P"r:Qx(GlxL+G2)
where 0 = I s I and the suffix ss denotesthe inclusion of
secondarystructuralinformation. In its most generalform, Q
may be derivedfrom a propertyof the sequence
which exhibits
a maximumfor regionslikely to be involvedin secondarystructures or other conservedregions,and a minimum for regions
likely to be subjectto greatervariability. One miglrt therefore
denveQ from a secondarystructurcpredictionprcfile (e.g. Garrier et aI., 1978;Chouand Fasman,1977),a smoothedprofile
(e.g.[.evitt, 1976),or a profileof likebasedon hydrophobicity
ly buriedresidues(e.g. Janin, 1979).For the purposesof this
study,however,a stepfunctionwasderivedthrougha knowledge
of the tertiary structureof one of the pair of proteinssuchthat
structure,andQ :
Q - 1.0(0r) in regionsof clearsecondary
90
612
0
l
2
3
.
!
!
t
3
!
0
0
0
t
1
2
2
3
!
3
9
9
0
o
0
!
3
!
!
t
1
2
a
e
2
3
!
3
a
3
2 !
2 t
2 t
2 t
2 2
t
2
2
z
9
!
3
2
2
2
2
2
9
9
9
qt
!r
!o
!e
!2
ze
ze
5
3t
!0
l!
t2
tr
lr
? ! t t 6 l ! t ! t t ! r
l t 6 1 6 r t t ! l ! l !
9 t 6 t 6 r t l ! l r t a
lo
16
t6
t!
l!
l!
l!
0
0
I
2
3
.
Gs
I
6
?
E
9
t0
!
0 2 1
!t
ll
l r l t
t2
!r
t 2 1 3
!!
!o
ts
!3
! 6 1 6
! 6 ! 6
3
6
!6
2
t
1
6
16
t
a
5
a
|
|
I
t0
t2
29
z9
29
19
21
It
2t
2t
2?
et
2t
It
2t
Zt
It
29
lt
29
tt
tt
t!
tt
r!
a,
29
2'
29
2t
21
tt
tt
tt
lt
t!
29
2t
2e
tt
It
tt
rt
lt
29
2t
2t
tt
la
tl
t!
tr
?
t
t
t0
16
36
ta
!5
36
36
t5
16
!6
!6
36
!a
t6
!6
!6
It
It
Ir
lt
G7
-
6
tol
t2
tt
36
36
!6
3!
36
!6
t6
36
!6
t6
36
36
16
IC
!6
!c
3C
lc
36
36
!6
16
36
rc
!6
!6
rc lbl
!r
aa
36
16
llt
36
16
!6
t6
!c
lrt
16
lrl
16
t6
t6
36
!6
36
t6
36
t6
!c
,a
!6
16
!6
t5
36
!6
16
16
!6
!6
t6
!c
3c
16
t6
!6
l6
!6
!6
t6
t6
29
29
!0
16
36
!t
?2
3!
36
36
Fig, l. Re.sultof varying constantsGt and Gz for a comparisonof FABVL
with FABVH. (a) Standardalgorithm (no penaltyfor end gaps).O) As (s)
but including secondarystructuralinformation (Qt : 0.25, q : 1.0).
Each scorercpresentsthe total numberof correctly equivalencedresidues
within the sevenspecifiedzones.All numben are thereforeout of a
maximumof 41 (seeTable I and Figure 2).
0.25 elsewhere(Q) (seelegendto TableI). The valueof 0.25
wasfoundin a preliminary studyto be optimal for all five pairs
of proteins,by varyingQ2ftom0 to 0.75 in stepsof 0.25.Note
that a valueof Q, : Qt: 1.0 setsPss: P. Q thereforehas
the effect of making the formation of a gap more likely in the
regionslinking secondarystructures.
Automatic dignment of protein scqucmcl
l -----A-----
Z
?
6
V
V
O
L
L
T
E
O
O
i -----A-----
|
P
B
P
O
S
P
B
L
V
O
O
V
A
R
P
P
O
S
O
O
R
T
V
L
I
t -------B----T I S C T
E L T C T
-----I -------B
---------
I ---------C-
I
O
V
|
9
g
8
O
S
S
N
T
I
F
O
B
i
; ; ;t------c-----; ; ; ; ;I ; ; i 3 x iI [ I ; ; v i F i l S 3
l 3 + b
! --------E--------
t----D-----t
L
T
I
P
F
L
H
R
N
S
N
R
A
V
R
T
F g V
1 . LI V
r-----D----! --------F-----------
A
A
D
V
Y
Y
Y
Y
C
C
| -------F------
Z
z
S
v
V
o
A
N
O
-
N
D
D
E
T
O
A
S
R
A
5
C
K
S
C
L
I
R
D
I
I
N
t
P
o
S
P
V
o
S
L
C A P E
s
v n i
O
o
R
r
A
O
P
L
I T L L ' C
E } I t
S
K
|
I --------E-------A
A
T
L
A
I
T
R L s
N o F 6 L
------E---------|
t
O L
O P
P
P
O
O
T
R
A E
A A
D E
D T
i -------c------i
V F O O O T K L T V L R
V A
S
V I . OI O O S L V T
t------c------l
R
C
|
F
v
O L O
A V T
D
A
O
R
R
T
A
Y
I
H V K I . IY
Y Y T I I V
i -------c-------l
.
I
S S A T I - AI
N O F S L R L
|
i --------E--------
L
i ------c-------
R
R
K
T
I
t -----A----L T O P
L E o B
t-----A-----l
V
L
t ------B-----T I S C
B L T c
l------B------l
Y
I
V
T
T
F
F
l
O
v
S
s
Y
S
o
H
H
S N I
6 T F
N
O
T
N
B
O
B
A
D
T
tbl
t -----D----B
V
A
X
S
t l L v n i
r
t
t -----D-----
T
p
A
A
t
I -------F-----D
Y Y
C
O
S
V Y Y C A R N
i
i -------F------
L
S
N
lol
|
Y
D
l ' I
I
O
6
O
s
L
v
O
T
AA AE D D
|
t -- ----o------R
S
L
R
V
F
O
G
O
T
N
L
t ' l OO O S L V T
D V
I
A O C
------o------i
!
Tv
VS SL
R
Itg.2. The effect of includingsecondarysFucturalinformationinto the alignmentof FABVL (top sequence)with FABVH (bottom sequence).The seven
= 3' Q, =
zoies A-G correspondto bela-strandsin dre immunoglobulindomains(Cohenet al., l98l). (a) Standardalgorithm (no penaltyfor endlgaPs,$
=
=
(9"
F
0.25)
the
C
and
1.0,
information
structural
O)'
secondary
Qt
On
including
are
incorrect.
F
strands
and
D
and
of the C
5). The alignments
strandalignmentsare correctedbut D is still displacedby two residues.
Alignrnmtscarried out
runs usingthe following
For eachpair of proteinsequences,
gap
and
weighting,
end
for
values
Qy werc performed:
0,
Run
End gaPs
weighted?
I
2
3
4
Yes
Yes
o
o
N
N
Q,
Qt
I
0.25
I
0.25
G1and G2werevariedfrom 0 to l0
For eachmn, parameters
in integerstepsproviding l2l comparisonsfor analysis.A
preliminarystudyindicatedthat little moreinformationwasgained by explorng G/G2 spaceto Gr : Gz: 20. In addition,
werecarhomologytestsfor eachpair of sequences
conventional
ried out by randomizingthe sequences100times, obtainingan
andexpressalignmentscorefor eachrandompair of sequences
in standard
ing the scoreobtainedfrom the original sequences
comparisons.
randomized
of
the
the
mean
from
units
deviation
Resuttsard discussion
of the standnrdNeedlemanand Wunschalign,4n assesstnent
mmt ncthod
andWunschalgorithm
Thesensitivityof the standardNeedleman
to changesin G1 and G2 is illustratedby Figure la, whilst
Figure2a illustratesone possiblealignment.Figure la shows
the resultof 121alignmentscarriedout on two immunoglobulin
variabledomains(FABVL, FABVH) with Q : QL: 1.0 and
no penaltyfor end gaps.Even the best alignmenthasonly 32
whilst the
out of the41 selectedresiduescorrectlyequivalenced
worst scores16/41.Although the valuesof G1 and G2 corto thesealignmentsaredistant,thescoresof 16border
responding
of 31, 30, 32 md 18/41.A changeof oneunit in
on-regions
G1 and/or G2 canthereforelead to a great reductionin alignmentquality. Furthermore,this patternof scoresis not common
to all five protein pairs examinedindicating that a universal
recommendationfor gap-penaltyvalues is not possible. The
choiceof G1and G2 availablemay be, however,considerably
studiedhere,
reducedsincefor every pair of protein sequences
at leastoneexampleof thebestalignmentobainablefor the pair
may resultwith a valueof G1 : 0 (e'g. seeFigure la, val99s
of lZt+t). This findingdiffersfrom thatof FirchandSmith(1983)
who showedthat the expectedalignmentof two shortsequences
from chickenalphaandbetahaemoglobincould only be obtained by including a length dependentgap-term(G1).
Figure3 includesthe resultof runs for Q1= 1.0, with and
andWunsch
withoutend-gapweighting.The standardNeedleman
algorithm doesnot explicitly include gap weightsfor terminal
gaps.However,it is naturalto weightterminalgapswhenaligfng
proteinsequences
sincegenerallythereareequivalent
homologous
the end gapsleadsto an imWeighting
the
ends.
near
residues
9l
G.J.Barton and M.J.E.Sternberg
l-
Sso
g
40
uJ
o
= 3 0
ur
E
H20
0 ! : 1 0 0 ' 2 5I 0 0 ? 5
ENO6APS:
W
FABVl
FAB\IX
UW
'l 0 06
10 025
w u
FAB\IX
FABGl
w
,t.00.25 ,100.25
1.0 0.25 1.0 025
w u w
PLASTO
AZURIN
w u
HAHU
LEGH
1.0 0.25 .'0 025
w
w
uw
P2PTN
P TEST
Fig. 3. Summaryof alignmentaccuracyfor five pairsof proteins/domains.
Blocksrepres€nt
the meanof l2l alignmentsfor Gs = 0- 10, G2 = 0- l0 in
integersteps.Small circles and small squaresindicatethe best and worst alignmentobtainedin each l2l comparisons,respectively.Hatchedareasare the
resultsof the standardalgorithm(Q, = Qt : 1.0), plain areasthe resultof includingsecondary
structuralinformation(0" : 1.0, Qt = 0.25). W = endgapspenalized,UW = end-gaps
not penalized.
provementin meanalignmentquality for all five pairs of se(54-62%).In addition,thebestalignmentobtainedfor
quences
eachpairimprovedfor P2PTNversusPIESTandPLASTOversus AZURIN, whilst it stayedthe samefor FABVL versus
FABVH, HAHU versusLEGH and FABVH versusFABCL.
In all five pairs,theworstalignmentobainedimproved(32-51%
overall). Furthermorefor FABVH versusFABCL alignments
in whichno residueswerecorrectlyequivalenced
wereremoved.
TheproteinpairsFABVL versusFABVH, HAHU andLEGH
and P2PTN versus PIEST score much better overall (90 as
against39% with weightedend gaps)than FABVH versus
FABCL and PLASTOversusAZURIN. In order to align the
lattertwo pairscorrectly,a long gapmustbe introducedand/or
the algorithmmustignoreregionsof poor similarity. A long gap
will only beallowedby a Needleman
andWunschtypealgorithm
if the sequencesimilarity betweeninsertedresidues,and the
residuesborderingthe potentialgap is very weak. Normally,
severalshortergapswill be introducedinsteadin order to match
regionsof theinsertionwith segments
of theothersequence.
This
smearingof the alignmentis inherentto the global alignment
method,althoughadoptionof an algorithmwhich weightsstrings
of insertionsor deletionslower thanmultiple isolatedsingleinsertionsand deletionsmay overcomethis deficiency(Krushkal
and Sankoff,1983).
Given any hvo sequencesto align for which the threedimensionalstructuresareunknown.onewould like to discover
92
Table II. Resultof similarity tests
Comparison
Gl
G2
Percent
correctly
aligned
Significance
score
(SD unirs)
P2PTNversusPIEST
HAHU versusLEGHE
FABVL versusFABVH
PLSTO versusAZURIN
FABVH versusFABCL
0
0
0
0
0
5
9
6
98
92
78
50
26
17.5
6.5
6.1
5.1
1.7
2
'Percentcorrectly aligned' refers to the percentage
of residuesaligned
within the selectedzonesas expectedfrom the structurebasedalignment.
Values for Gt ^nd Gz were taken to give the best alignmentpossible(see
Figure 3). The significancescorewas calculatedas follows: a value (V) was
calculatedfor the comparisonof the two sequences;valuesfor the comparisonof 100 randomizedsequences
of the samecompositionand length
were then obtainedand the mean(M) and standarddeviation(SD)
calculated;the significancescorc quotedis equal to (V - M)ISD.
how successful the alignment is likely to be. The scores obtain-
(asdescribedin Materials
ed by a conventional
testfor relatedness
andmettrods)shouldfollow thesamefrendsasthestmchrre-based
testusedhereincludingany deficienciesin thetreatmentofgaps
asdiscussedabove.Tabletr illustratesthe resultsof suchhomology testsand showsa correlationbetweenthe percentageof
residueswithin secondarystructurescorrectly aligned,and the
similarity score(which is basedon the whole sequence).This
Automaticalignmentof proteinsequences
suggeststhat an alignment is likely to be good if a score greater
than 6 SD can be obtained. However, preliminary trials (data
not included), have shown that the significancescore obtained
for a comparisoncan vary by * I SD dependingon the particular
valuesof gap-penaltiesused.We suggestthat a signihcancescore
greater than 7 SD for a comparison means that most of the
residueswithin secondarystructureswill be aligned correctly.
A score of below 5 SD indicatesthat the alignment is poor,
Inclusion of secondarystructural information into the alignment
when one X-rq) structure is known
The effect ofintroducing secondarystructural information from
one sequenceof known X-ray structure is illustrated in Figure
lb. Not only has the maximum alignment score increasedto
36/41, but the methodhasconvergedon this valuewith increasing G1 and G2. This pattern is observed for FABVL versus
FABVH, HAHU versusAZURIN and P2PTN versusPIEST,
with and without end-gapweightingand indicatesthat the observed dependenceupon G1 and G2 is reducedwhen reliable secondary structural information is included from one sequence.Figure
2b illustratesone alignmentfor FABVL versus FABVH with
36/41 residuescorrectly equivalenced,this correspondsto the
correct alignmentof A, B, C, E, F and G strands.However,
the standardmethod using the same values of G1 and G2 misalignedthe C-, D- and F-strandsgiving a scoreof only 29141
(Figure 2a).
The result of including secondarystructural information into
the alignmentof all five protein pairs is presentedin Figure 3.
There is an improvement in mean alignment accuracy on including Qt = 0.25 from 54 to 68% overall (unweightedend
gaps)and 62 to 70% (weightedend gaps).The best alignment
obtainedfor eachpair also increases,except for FABVH versus
FABCL wherethe valuesstay the same.No alignmentshaving
zero residuescorrectly equivalencedwere obtainedwhen secondary structural information was included.
The improvementsin alignment within secondarystructural
regionsare of value in providing a more reliable automaticstarting point for protein model-building studies. They are also
valuablewhen aligning whole families of proteins since less
manual intervention is required. Such multiple alignmentshelp
to increaseunderstandingof the structuraland functional importanceof particular residuepositions. However, Needlemanand
Wunsch type algorithms cannot easily be extended to the
due to prosimultaneousalignmentof more than three sequences
hibitive memory and computer-timerequirements(Murata e/ a/.,
1985).Taylor (1986)has developedan algorithmwhich aligns
'templates' correspondingto conserved
on the basisof predefined
regions (e.g. beta-strands,alpha-helices)in the proteins. This
methodalthoughnot as inherentlyflexible as the Needlemanand
Wunsch approach, allows multiple alignments which conserve
structural motifs to be carried out in a reasonabletime.
Conclusions
In this study we have uged sequencealignments based on the
superpositionof known three-dimensionalprotein structuresas
a standardagainstwhich to test: (i) The Needlemanand Wunsch
(1970) global sequencecomparisonmethod, and (ii) an extended Needlemanand Wunsch algorithm which includes information basedon known secondarystructures. The following are the
main findings.
(i) If the user defined length-dependentand length-independent
gap-penaltiesare varied, a number of different alignmentscan
be obtained, many of which are only partly correct, or sometimes
completely different to the alignment expectedfrom the superposition of three-dimensionalstructures.
(ii) The best alignment using the standard Needleman and
Wunsch alignment for each pair of proteins studiedmay be ob:
tained without including a length dependentgap-penalty.
(iii) Protein sequencecomparisonswhere one sequencehas a
long insertion, or non-structurally homologousregion are not well
aligned by this method (38% overall). Treating the sequences
in sectionsbordered by known key residuesmay help the overall
alignment for these protein pairs (Smith et al., l98l).
(iv) The standarddeviation used in Needlemanand Wunsch
homology testscorrelatesvery well with the quality of alignment
as indicatedby referenceto alignmentsbasedon structuresuperposition. For those proteins tested, a score greater than 6 SD
suggeststtrat the alignment(s) obtained for the two sequenceswill
be good (>75% correct within secondarystructures).
(v) The modification of the gappenalty function to include information about the positions of secondarystructuresfrom one
of the proteins being aligned, and thus limit the number of gaps
introducedin helix/sheetregions,improvesthe overall alignment
quality and reducesthe sensitivity to changesin length-dependent
and length-independentgap-penaltyconstants.
The problems of aligning sequenceswhere there are regions
of poor structuralhomology,or large insertions(e.g. immunogobulin variableversusconstantdomains),is unlikely to be overcome by global methodsusing a simple gap-penaltyfunction as
describedhere. The inclusion of secondarystructural information from a knowledgeof the three-dimensionalstructureof one
ofthe proteins to be aligned by using a variable gap-penaltycan
provide, however, improvementsin alignmentaccuracy.In particular, for protein sequenceswhich though distantly related in
evolution do not exhibit large insertions relative to each other
(e.g. human alpha-haemoglobinversus root nodule leghaemoglobin) a useful improvement in alignment accuracycan be obtained. The inclusion of secondarystructural information into
the alignment is therefore to be recommendedwheneverpossible.
Acknowledgements
We would hke to thank Dr J.Thornton for her comments on the manuscript,
Mark6ta Zvelebil for her alignment of trypsin with elastaseand Professor
T.L.Blundell for his continuedsupport. This work was supportedby the Science
and Engineering ResearchCouncil.
References
L. and Pearl,L.H. (1983) Naure, 3M, 273 -27 5.
Blundell,T.L., Sibanda,B.
Boswell,D.R. and Mcl-achlan,A.D. (1984)Nucleic Acids Res., 12, 457-46/..
Browne,W.J., North,A.C.T., Phillips,D.C., Brew,K., Vanaman,T.C. and
Hill,R.C. (1969) J. Mol. Biol., 120, 97 -120.
Chothia,C. and [.esk,A.M. (1982\ J. Mol. Biol.,tffi,3@-323.
C h o u , P . Y . a n d F a s m a n , G . D (. 1 9 7 7 )J . M o l . B i o l . , l l 5 , 1 3 5 - 1 7 5 .
Cohen,F.E.,Novotny,J.,Sternberg,M.J.E.,Campbell,D.G.and Williams,A.F.
(1981\ Biochem. J., f95, 3l -40.
Dayhoff,M.O. (1972) ln Dayhoff,M.O. (d.), Atlas of Protein kqunce and Strucare. National Biomedical Research Foundation Washington, DC, Vol. 5,
pp.89-110.
and StrucDayhoff,M.O. (1978)In Dayhoff,M.O. (d.), Atlas of Protein Sequence
rure. National Biomedica.lResearchFoundation Washington, DC, Vol. 5, suppl.
3., pp. 34s-3s8.
N.G. (1979)Nature,279, 165- 168.
D. andJames,M.
T.J., Brayer.G.
Delbaere,L.
Timkovich,R.and Almassy,R.l.(1976)J. Mol. Biol., 100,
Dickerson,R.E.,
4'13-491.
andDoolittle,R.F.(1985)J. Mol. Ewl., 21, I 12- 125.
Feng,D.F., Johnson,M.S.
(1984)NucleicAcidsRes.,12,175-l'19.
Fickett,J.W.
Fitch,W.M.andSmith.T.F.(1983)Proc.Natl.Acad.Sci.UM, 80, 1382- 1386.
(1978)./.Mol. Biol.,l20,9'l-12O.
andRobson,B.
Gamier,J.,
Osguthoqpe,D.J.
(1982)NucleicAcidsRes.,10,247-263.
Goad,W.B.andKanehisa,M.I.
Gotoh,O.(1982)J. Mol. Biol., 162, 705-708.
93
G.J.Barton and M.J.E.Stcrnbery
Janin,J.(1979)Nuurc, 2Tl, 491-4E2'
Krushkal,J.B.ard SankotrD. (1983)In SankotrD. andKrushkal'J.B'(eds)'Tine
WarpsStringHits and Macronnlecales:the Theoryand Praaice of SeEtence
Addison-Wesley,p. 296-299.
C.omparison.
(1980)J. Mol. Biol.,136'225-270Lesk.A.M.andChothia,C.
Irsk,A.M., lrvitt,M. and Chothia,C.(1986)Prot. hg.' 1,71-78Levitt,M.(1976)J. MoL Biol.,104, 59-107.
. (1972\J. Mol. Biol., &, 417-437 .
Mct achlan,A.D
NdL Acad. Sri. USA,
ad $rssnran,J.L.(1985)Pr?oc.
Murata,M.,Richardson,J.S.
n,3037-3s77.
andWunsch,C.D.(1970)J' Mol. 8io1.,8,43-453.
Needleman,S.B.
59- lO7'
Perutr,M.F.,IGndrew,J.C.ard WaEon,H.C.(1965)J. Mol. Biol., 1114,
-193Muh.,zlt,787
Appl.
J.
Sellers,P.Onq SIAM
Sellen,P.(199) Proc. Nul. Acad. Sci' USA,76'3MI'
andFirch,W.M.(1981)J.MoL Ewl.' lt,38-'16'
Smith,T.F.,Wat€nnan,M.S.
219,
M., Grcen,N.andlrrner,R.A. (1983)Science,
Sutcliffe,J.G.,Shinnick,T.
ffi-ffi.
Taylor,W.R.(19E6).L MoL Biol., lBE, 233-258.
andBodmer,W.F'(1984)Naure,
Travers,P.,Blundell,T.L.,Stemberg,M.J.E.
310.235-238.
Smith,T.F.andBeyer,W.A.On6, Mv. Mdh., m, 367-387 .
Waterman,M.S.,
Wilbur,W.J.andLipman,D.J.(1983)Prw. Natl-Acad.Sci-USA,n,726-1n.
Winter.G.andFersht,A.R.(l9V) TrendsBiotech.,2,ll5-119.
Receivedon August27, 1986
94