Linkage Analysis for Diseases with Variable Age of



Linkage Analysis for Diseases with Variable Age of
Original Paper
Hum Hered 2000;50:205–210
Received: April 27, 1998
Revision received: February 17, 1999
Accepted: March 1, 1999
Linkage Analysis for Diseases with
Variable Age of Onset
Kimberly D. Siegmund a Alexandre A. Todorov b
a Department
b Department
of Preventive Medicine, University of Southern California, Los Angeles, Calif. and
of Psychiatry, Washington University School of Medicine, St. Louis, Mo., USA
Key Words
Linkage analysis W Sib pair methods W Censored age of
onset W Cox model
early onset of disease. Lastly, the LR test had more power than the t2 test to detect linkage in the presence of
gene by environment interactions.
Copyright © 2000 S. Karger AG, Basel
We present a method for the multivariate linkage analysis of the age of onset of a disease. The approach allows
the incorporation of covariates for the study of gene by
environment interactions. It is applicable to general pedigrees. The likelihood of the data is expressed as a function of the number of alleles identical by descent at a
marker, the censored ages of onset and disease status,
and environmental exposures. In a simulation study, we
compare the power to detect linkage under different
sampling schemes for either a dominant or recessive
trait when approximately 10% of individuals are gene
carriers. The majority of the linkage information from a
sample of randomly selected sib pairs was retained
when the analyses were limited to sibships with one sibling having early-onset disease (! 59 years old). Incorporating parental phenotypes could improve the power to
detect the gene. When the sample consists of affected
sib pairs (ASPs) having variable age of onset, the likelihood ratio (LR) test had higher power than the means (t2)
test for detecting a locus with a large genetic relative risk
(Rg = 20). However, the power of the two tests was similar when ASPs are selected so that the proband has an
© 2000 S. Karger AG, Basel
Fax + 41 61 306 12 34
E-Mail [email protected]
Accessible online at:
Robust sib pair methods are often used in the linkage
analysis of diseases with unknown modes of inheritance.
Typically, these methods compare the distribution of the
numbers of alleles identical by descent (ibd) at a test
marker locus to that expected under the null hypothesis of
no linkage [1–3]. Unfortunately, the power of these robust
‘model-free’ methods will drop dramatically when sibling
resemblance is inflated by factors other than a gene near
the marker, such as shared environmental influences [4],
or another disease locus [3, 5]. In addition, traditional
model-free linkage analysis often ignores many aspects of
the complex diseases it was meant to analyze. In particular, many such diseases are defined not only by the degree
of expression (e.g., affected vs. mildly affected) but also by
the age at which expression occurs (e.g., breast cancer,
Alzheimer’s disease). Further, these methods fail to incorporate parental phenotypic data, except in an ad-hoc
manner [6]. The cost of incorporating these complexities
is the loss of the model-free nature of the analysis, with a
move toward ‘marker-based’ segregation analysis.
Kimberly D. Siegmund, PhD
Department of Preventive Medicine, University of Southern California
1540 Alcazar Street, Suite 220
Los Angeles, CA 90033-9987 (USA)
Tel. +1 323 442 1310, Fax +1 323 442 2349, E-Mail [email protected]
Multivariate survival models have recently been applied to the analysis of family data allowing the joint estimation of the strength of genetic, environmental, and possible gene by environment interaction effects upon age of
onset [7–9]. These papers provide extensions to the traditional Cox model for the purpose of genetic analysis of
family data. These genetic frailty models regard the unobserved genotype as a random effect which acts multiplicatively on the baseline risk of disease. Gauderman et al.
[10] and Siegmund et al. [11, 12] illustrate their use in the
segregation analysis of lung cancer and coronary heart disease. In the present, we extend these methods for the purpose of linkage analysis. Although we focus on nuclear
families, the method can be applied to general pedigree
data. We compare its power to detect linkage to that of the
means (t2) statistic [1]. This comparison is made under
different sampling schemes, where families are incorporated in the analysis depending upon the age at which a
proband expressed the disease.
Frailty Model
The genetic frailty model was developed to analyze dependent
censored age-of-onset data in families, expressing the hazard of disease as a function of shared but unobserved genes and measured
environment [7–9]. The conditional hazard of disease is the instantaneous probability that given the measured covariate and unmeasured
genotype data, an individual, unaffected at age a, will develop the
disease. Across family members, the conditional hazard functions are
We make the following assumptions regarding the mode of transmission of the disease. First, there is a single di-allelic disease locus
and a finite number of measured environmental factors (e.g., gender,
race). We let D denote the disease allele, N the wild-type allele, and q,
the population frequency of D. Second, the population is panmictic
so that the gene frequencies are at their Hardy-Weinberg equilibrium
values and the probabilities of the genotypes DD, DN, and NN are q2,
2q(1 – q), and (1 – q)2. Third, the genetic relative risk, the ratio of the
hazard of disease for a susceptibility gene carrier relative to a noncarrier, is constant across all ages.
Let Ï0(a) denote the baseline hazard of disease, as a function of
age, for unexposed individuals. Under the genetic frailty model, the
conditional hazard for individual j given a vector of measured covariates, zj, and a genotype gj (= 0, 1, 2) is:
Ïj (a A zj, gj) = Ï0(a)Rgje z j (· + Á
The parameters Rg, ·, and Ág denote, respectively, the genetic relative
risk for genotype g in the absence of gene by environment interaction, and vectors of regression coefficients measuring the effects of
covariates and the effects of interactions between gene and covariates. By definition, R0 = 1. When the gene is dominant, R2 = R1 1 R0
and when it is recessive, R2 1 R1 = R0.
Hum Hered 2000;50:205–210
The likelihood for censored age-of-onset data is a function of the
conditional hazard of disease. Let yj = (aj, ‰j) denote the observed age
at diagnosis and disease status, for individual j, and zj the covariates.
For an affected individual, ‰j = 1 and aj is the age of onset. For an
unaffected individual, ‰j = 0 and aj is either the age at last contact or
the age of death. Then, omitting the probability of censoring, the partial likelihood of the data is
P (yj A gj, zj; ∑) = Ï0(aj A zj, gj)‰j e –§0(aj A zj,gj),
where ∑ = (q, Rg, ·, Ág, §0(W)) summarizes the complete set of segregation parameters and §0(a A z, g) = ea0 Ï0(t A z, g)dt. Let c denote the current estimate of the recombination fraction between the disease locus
and the marker. The partial likelihood for a single family is
L(c, ∑ A y, z,  ) =
™™ P (ym A zm, gm; ∑)P (gm; q)P (yf A zf, gf ; ∑)P (gf ; q)
gm gf
! ™ P (yo A zo, go; ∑) P (go A , gm, gf ; c),
where  denotes the number of alleles that the two offspring share ibd
at a marker locus, and (m, f, o = (o1, o2)) index the mother, father, and
two offspring respectively. This is the form of the likelihood first proposed by Elston and Stewart [13]. The partial likelihood of the data
for a sample is obtained by taking the product over all families of the
above quantity. Estimating the segregation parameters from this likelihood requires that, conditional on the unobserved genotypes, censoring is independent and not a function of the underlying genotypes.
Under the additional assumption that the censoring distribution is
noninformative of the segregation parameters, this would be a fulllikelihood approach.
Power Computations
We focus the present analysis on estimating the power to reject
H0: c = 1/2 at the true disease locus (c = 0) under various sampling
schemes. To that purpose, we use a likelihood ratio (LR) test, fixing
the segregation parameters at their true values, and maximizing the
likelihood over 0 ! c ! 1/2. This approach provides a valid test when
the segregation parameter estimates converge asymptotically to constants [14, 15].
∑ A y,)
Our test statistic is then, Z(ĉ) = 2lnLR, where LR = L(ĉ,
L(∑ A y)
∑ = ∑true;  is the ibd status of the siblings at the marker linked with
the disease locus; ĉ is the maximum likelihood estimate of the recombination fraction. Under the null hypothesis, c = 1/2 and the distribution of Z(ĉ) is a 50:50 mixture of point mass at zero and ¯2 on 1 d.f.
[15–17]. Under that condition, the power to detect a locus given ∑
but not c, is calculated by the proportion of times in our sample of
simulated datasets Z(ĉ) 1 Z21–2· = 9.55, for a significance level · =
0.001. We estimate the false-positive rate by simulation, assigning
random ibd configurations r to the siblings (c = 1/2), re-estimating c
for this realization of r, and proceeding as described above.
Sampling Designs
This paper was motivated in part by the need for efficient linkage
analysis of data from the National Heart, Lung, and Blood Institute
Family Heart Study (NHLBI FHS) [18]. Phase I of that study collected detailed family histories on coronary heart disease from four
population-based genetic epidemiology studies. Two subsamples
were obtained in this study in order to collect families which would
be informative for estimating the allele frequencies and genetic effect
Table 1. Distribution of the number of affected individuals in nuclear families when probands are sampled at age 58
(n = 10,000 simulated families)
Rg = 10
Rg = 20
Rg = 10
Rg = 20
Dominant model (q = 0.05)
847 (45.1)
1,163 (6.4)
2,037 (10.2)
1,403 (72.4)
1,163 (6.4)
2,566 (12.8)
1,786 (88.5)
4,832 (26.9)
6,618 (33.1)
1,919 (95.1)
4,832 (26.9)
6,751 (33.8)
Recessive model (q = 0.31)
909 (47.8)
1,159 (6.4)
2,068 (10.3)
1,358 (71.5)
1,159 (6.4)
2,517 (12.6)
1,673 (88.4)
4,873 (26.9)
6,546 (32.7)
1,806 (95.5)
4,873 (26.9)
6,679 (33.4)
q = Susceptibility allele frequency. Figures in parentheses are percentages.
sizes as well as for detecting linkage. The first was a random sample
of families selected from the parent studies and the second was a
sample of families enriched with coronary heart disease. Currently,
in phase III, genotyping is underway for sibships selected through
either affected sib pairs (ASPs) or unaffected individuals predicted to
be at high risk for coronary heart disease.
In this paper, we consider three selection schemes: choosing families (a) at random, (b) through one sibling affected at an early age, or
(c) through ASPs, with possible restrictions on age of onset. In the
first design, the less realistic, 1,500 two-offspring families are sampled at random and all 6,000 individuals are genotyped. A random
sample is not efficient for linkage analysis, so the results are only
provided as a benchmark for comparison. Under the second study
design, families are selected for genotyping when one sibling is
affected before the age of 59 (early onset for coronary heart disease),
an 80% reduction in genotyping. This method may be preferable
when sampling ASPs yields unduly small sample sizes. The sizes used
here would be those obtained in the NHLBI FHS when pooling ASPs
and high-risk individuals (275 sibships, 957 individuals). The third
design collects ASPs only. These sample sizes considered (n = 100 or
150) would not be feasible from the random sample alone, but might
be if supplemented with ASPs collected through an additional
enriched sample. In the NHLBI FHS, 82 sibships had two or more
affected sibs (175 individuals).
We study the effect of age of onset in the ASP study by computing
power for samples where all probands are sampled at a given age.
These designs restrict the age of onset in the probands to be less than
the age at which they are sampled. Finally, we study the effect of
gender as an important risk factor which acts either independently or
as an effect modifier.
replacement from these two databases. This approach has two advantages. First, it is fast once the families have been generated. Second, it
is more realistic given that, in practice, researchers would be sampling from a limited population. Given the large size of each population, we are virtually assured that all meaningful combinations of
variables are represented.
The disease status and age of diagnosis data were simulated to be
compatible with the crude rates of coronary heart disease observed in
the NHLBI FHS [12] (see Appendix). The distributions of affected
individuals by gene carrier status for the first 10% of the families in
our data base are given in table 1. The offspring and parents have
follow-up data of approximately 58 and 82 years, respectively.
Approximately 10% of the family members carry the disease allele
for each model. The larger genetic relative risk yields a greater number of affected subjects among gene carriers only. The number of
affected noncarriers remains constant under both genetic effect sizes.
The sibling relative risks are slightly higher for a genetic relative risk
of 20 compared to 10 in both the dominant and recessive models
(2.07 vs. 1.69 and 1.79 vs. 1.58, respectively. For the genetic relative
risk of 20, the proportion of pairs with both sibs affected are 3.4 and
2.9% for the respective transmission models. The proportion of ASPs
is 1.7% for both models under the genetic relative risk of 10.
We simulated genetic and phenotypic data for nuclear families
consisting of two sibs and their parents. The phenotypic data consisted of the ages of diagnosis, disease status and gender of the four
family members. We generated two large populations (n = 100,000),
one containing random families and the second, families with an
ASP. The simulations themselves were conducted by sampling with
Nearly all the linkage information in the large random
sample is retained by the subsample of families in which
the proband expressed the disease before age 59 (table 2).
The false-positive rate is slightly low under the genetic relative risk of 10 (range 0.02–0.08%). Under the genetic relative risk of 20, it ranges from 0.06–0.15%. In general, the
power is as great or greater for the recessive as compared
to the dominant model. For a genetic relative risk of 20,
we see a substantial gain in power when parental phenotypes are included. For the genetic relative risk of 10, an
Linkage Analysis of Age of Onset
Hum Hered 2000;50:205–210
Table 2. Power comparisons of the likelihood ratio test when pro-
Table 4. Power comparisons when probands are sampled at age 58
bands are sampled at age 58 (10,000 simulated data sets; · = 0.001)
(10,000 simulated data sets; n = 150 families with affected sib pairs;
· = 0.001)
(Rg, ·, Ág)
number of data
families1 only
Sib +
(10, 0, 0)
early onset
early onset
early onset
early onset
(20, 0, 0)
(10, 0, 0)
(20, 0, 0)
(Rg, ·, Ág)
Sib data only
Sib +
(10, 0, 0)
(10, ln(2), 0)
(10, ln(2), ln(2))
(10, 0, 0)
(10, ln(2), 0)
(10, ln(2), ln(2))
For early onset: average number of families drawn from a simulated random sample of size 1,500 with 1+ sib showing onset prior to
age 59.
Table 3. Power comparisons when probands are sampled at differ-
ent ages (10,000 simulated data sets; n = 100 families with affected
sib pairs; · = 0.001)
(Rg, ·, Ág)
(10, 0, 0)
(20, 0, 0)
(10, 0, 0)
(20, 0, 0)
Sib data only
Sib +
0% false-positive.
0.01% false-positive.
Hum Hered 2000;50:205–210
effect size which is more likely for complex traits, power is
extremely low all around. For this small genetic effect, the
power is higher in samples of 100 ASPs than in these samples of approximately 300 sib pairs where the second sib
could be unaffected (table 3 vs. table 2).
The power of the t2 and LR test for samples of 100
ASPs under different restrictions on the proband’s age of
onset are given in table 3. For both genetic relative risks,
the power to detect linkage is greater under the recessive
compared to the dominant model and the power of the
LR increases when parental phenotypes are incorporated
into the likelihood equations. Under the larger genetic relative risk of 20, the LR for both dominant and recessive
models can have substantially larger power than the t2 test
when the proband’s age of onset is not restricted to being
less than age 59. The greater the restriction on age, the
more similar the power of the two tests. For the smaller
genetic relative risk, we see little difference in power for
the t2 test and LR using sib pair data only. As with the
larger genetic relative risk, the power of both tests is
increased when the proband’s age of onset is restricted to
a young age. In general, the rate of false positives in table 3
ranged from 0.06 to 0.17%. However, for the small genetic relative risk, the false-positive rate of the LR was low
under the dominant model.
Table 4 shows the power to detect linkage in samples of
150 ASPs when gender is a risk factor for disease. The
power to detect linkage in the presence of the additional
risk factor is smaller for both test statistics than it is in the
gene only model. When being male increases the genetic
relative risk in carriers, modeling the interaction in the
LR increases the power to detect linkage. This increase in
power is not observed for the nonparametric t2 test. This
suggests different strategies for selecting samples based on
additional risk factors which we will address in the discussion. Overall, the false-positive rate in table 4 ranged from
0.06 to 0.14%.
We examined the performance of multivariate survival models in the linkage analysis of traits with variable age
of onset under various sampling schemes. These models
have an advantage over traditional model-free linkage
methods in that they allow specific modeling of covariates
and gene by covariate interaction effects. We found that
the power of the LR test is substantially increased in a
joint analysis of offspring and parental data. If both sibs
are affected however, the increase in power is less striking.
For studies where the number of ASPs represents only a
portion of the data available for analysis, the LR showed
there was additional information to be gained by including data from pairs with only one affected individual.
Thus for the NHLBI FHS, the LR test given in this paper
would allow us to combine data families ascertained
through either affected sib pairs or high-risk individuals
who may themselves be unaffected.
For large shifts in the age of onset distribution caused
by a gene (Rg = 20), the LR test had greater power than the
means test to detect linkage. When the variability in age of
onset was restricted through the sampling scheme (e.g.,
only probands with onset prior to age 59 are ascertained),
the two tests had similar power. The power of the two tests
was also similar for the small genetic relative risk (Rg =
10). For the smaller effect, different age restrictions on the
proband yielded the same results. This reflects the smaller
shift in age of disease onset attributed to the gene. For
these small effects, nearly all the linkage information is
contained in the disease status and nothing is gained by
using age of onset.
We studied the effect of disease heterogeneity on our
linkage results using male gender as either an independent
risk factor or an effect modifier. When being male was an
independent risk factor for disease, power decreased for
both test statistics. This is likely to be a consequence of
ascertaining more male pairs that were affected due to
gender and not their underlying genotype. Thus simply
adjusting for independent risk factors in the LR will not
recover the power lost due to the smaller number of pairs
informative for linkage. When male gender modified the
genetic effect and the genetic relative risk was larger in
men than in women, the power of the LR was larger than
Linkage Analysis of Age of Onset
when there was no environmental effect at all. At the same
time, the power of the means test remained similar. This
suggests that power can be gained by correctly incorporating interaction effects into the LR test for linkage. Although the means test did not increase in power, it was
reassuring to know that it did not lose much power over
that from the gene-only model.
The results under disease heterogeneity suggest the
optimal sampling design to detect linkage will be based on
exposure status to observed environmental risk factors.
When environmental exposure increases the genetic relative risk, we may consider sampling exposed individuals
in whom the genetic effect is larger. However, when there
is no interaction and the environmental factor is itself a
risk factor for disease, sampling affected pairs from exposed individuals will add noise to the data and individuals who are unexposed to the environmental risk factors
should be sought. There is great importance in finding
designs which would be the least sensitive to unknown
gene by environment effects.
In order to make valid comparisons of the information
for linkage under different sampling designs we fix the
segregation parameters at their true values in each model.
Liang et al. [15] show that this approach provides an
appropriate pseudo-likelihood test for linkage. In practice, this is feasible when parameter estimates may be
obtained a priori from a segregation analysis conducted
on a large random sample. This would be possible for
instance in the NHLBI FHS. That being said however, the
power to detect linkage in the NHLBI FHS may be lower
than suggested in this paper since coronary heart disease
is a complex trait with many important risk factors.
Overall, our method presents a single test of linkage
using all genotyped offspring regardless of whether they
are affected or not and phenotypes on their relatives.
Nonaffected individuals will contribute to the likelihood
of the probability of their surviving disease-free until their
age at last contact. Although we have assumed the segregation parameters are obtained from a separate analysis, our
methods may be extended to perform a joint segregationlinkage analysis. For the more general model, the age-ofonset distribution may be estimated nonparametrically.
We thank Duncan C. Thomas for his many helpful suggestions.
K.D. S. was supported in part under grants HL-56567, GM-28719
and CA-52852. A.A. T. was supported in part by grant AA-07728.
Hum Hered 2000;50:205–210
Birth years, age at death, and age of onset are created for each
individual as well as gender. Offspring are created with equal probability of being male or female. Siblings are created with a median age
difference of 3.3 years (range: 1–12.2 years). Parents are 24 years
older on average than their oldest child (range: 14–43 years).
Approximately 10% of the individuals in our samples carry a susceptibility genotype. We consider both dominant gene and recessive
gene models with allele frequencies of 0.05 and 0.31, respectively.
Additionally, we consider the baseline risk of heart disease to be
increasing with age. Individuals with a genetic susceptibility have a
10-fold increase in risk for all ages over nonsusceptible individuals.
For comparisons of results, a genetic relative risk of 20 is also considered.
We use a Weibull to model gender-specific distributions for age at
death. In a random sample of 20,000 individuals, the survival probability estimates at ages 65 and 85 are S(65) = 0.82 and S(85) = 0.26 in
males and S(65) = 0.90 and S(85) = 0.47 in females. We also use a
Weibull distribution to simulate the baseline disease incidence rate.
For non-genetically susceptible individuals the median age of onset is
91 years (R0 = 1). Under genetic relative risks of 10 and 20, the
median ages of onset for genetically susceptible individuals are 58
and 51.5, respectively.
The observed data consist of the disease status and age of each
individual and the number of alleles shared ibd between each sib
pair. For affected individuals, age denotes the age of disease onset,
for unaffected individuals it is their age at death if they are dead or
their current age. Age is calculated for all individuals at the end of
study. For the families sampled at random, the study ends on December 31, 1993. At this time, the probands are all age 58, the median age
of the probands in the NHLBI FHS. To create the different age distributions under the ASP sampling design, we change the end of study
date to: December 31, 1983, 1993, 2003, 2013. When determining
the age of diagnosis and disease status data, we assume that we have
complete health records on all probands and their family members so
that the probands need not have survived until that age in order to be
1 Blackwelder WC, Elston RC: A comparison of
sib pair linkage tests for disease susceptibility
loci. Genet Epidemiol 1985;2:85–97.
2 Kruglyak L, Daly MJ, Reeve-Daly MP, Lander
ES: Parametric and nonparametric linkage
analysis: A unified multipoint approach. Am J
Hum Genet 1996;58:1347–1363.
3 Risch N: Linkage strategies for genetically complex traits. II. The power of affected relative
pairs. Am J Hum Genet 1990;46:229–241.
4 Todorov AA, Siegmund KD, Genin E, Rao
DC: Power of the affected sibpair method in
the presence of environmental factors. Genet
Epidemiol 1997;14:541.
5 Todorov AA, Borecki IB, Rao DC: Linkage
analysis of complex traits using affected sibpairs: Effects of single-locus approximations on
estimates of the required sample size. Genet
Epidemiol 1997;14:389–401.
6 Olson JM, Elston RC: Using family history
information to distinguish true and false positive model-free linkage results. Genet Epidemiol 1997;14:535.
7 Gauderman WJ, Thomas DC: Censored survival models for genetic epidemiology: A gibbs
sampling approach. Genet Epidemiol 1994;11:
8 Li H, Thompson EA: Semiparametric estimation of major gene and family-specific random
effects for age of onset. Biometrics 1997;53:
9 Siegmund K, McKnight B: Modeling hazard
functions in families. Genet Epidemiol 1998;
10 Gauderman WJ, Morrison JL, Carpenter CL,
Thomas DC: Analysis of gene-smoking interaction in lung cancer. Genet Epidemiol 1997;14:
11 Siegmund KD, Province MA, Higgins M, Williams RR, Keller J, Todorov AA: Modeling disease incidence rates in families. Epidemiology
12 Siegmund KD, Todorov AA, Province MA: A
frailty approach for modelling diseases with
variable age of onset in families: The NHLBI
Family Heart Study. Stat Med, in press.
Hum Hered 2000;50:205–210
13 Elston RC, Stewart J: A general model for the
genetic analysis of pedigrees. Hum Hered
14 Liang K-Y, Self SG: On the asymptotic behavior of the pseudo-likelihood ratio test statistic. J
R Stat Soc Ser B 1996;59:785–796.
15 Liang K-Y, Rathouz PJ, Beaty TH: Determining linkage and mode of inheritance: Mod
scores and other methods. Genet Epidemiol
16 Self SG, Liang K-Y: Large sample properties of
maximum likelihood estimator and the likelihood ratio test on the boundary of the parameter space. J Am Stat Assoc 1987;82:605–610.
17 Ginsburg Ekh, Axenovich TI, Goodman DW:
On estimation of linkage test power. Genet Epidemiol 1996;13:355–365.
18 Higgins M, Province M, Heiss G, Eckfeldt J,
Ellison RC, Folsom AR, Rao DC, Sprafka JM,
Williams R: NHLBI Family Heart Study: Objectives and design. Am J Epidemiol 1996;143:

Similar documents