Asthma - copsac



Asthma - copsac
Translating inter-individual genetic variation to
biological function in complex phenotypes
Rachita Yadav
14th April, 2014
�A grain in the balance will determine which individual shall live
and which shall die - which variety or species shall increase in
number, and which shall decrease, or finally become extinct.�
Charles Darwin, The Origin of Species
This thesis was prepared at the Center for Biological Sequence Analysis
(CBS), Department of Systems Biology, at the Technical University of Denmark (DTU), under the supervision of Associate Professor Ramneek Gupta.
This thesis is a partial fulfilment of the requirements for acquiring the Ph.D.
degree. The Ph.D. was funded by the Danish Council for Strategic Research
and DTU.
This thesis is based on work carried at CBS in collaboration with Copenhagen Prospective Studies on Asthma in Childhood (COPSAC), The Faculty
of Health Sciences, University of Copenhagen; Sino-Danish Breast Cancer
Research, Centre at Faculty of Life Sciences, University of Copenhagen; Department of Biology, University of Copenhagen and UCSF Diabetes Center
and Department of Cell and Tissue Biology, University of California, San
Francisco. This thesis presents five main projects and one auxiliary project
based on common theme of understanding variations in biological data.
Due to the lack of space in the co-author statements, my contributions to
the multi disciplinary projects are further explained in the introduction to
chapter 2 and chapter 8.
Preface . . . . . . . . . . . . . .
Contents . . . . . . . . . . . . . .
Abstract . . . . . . . . . . . . . .
Dansk resumé . . . . . . . . . . .
Acknowledgements . . . . . . . .
Papers included in the thesis . .
Papers not included in the thesis
Abbreviations . . . . . . . . . . .
I Introduction
1 Tools, Techniques and Data Analysis
1.1 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . .
Past, Present and Future . . . . . . . . . . . . . . . . . .
The Pioneer: Microarrays . . . . . . . . . . . . . . . . . .
The Exciting Present: Next Generation Sequencing . . . .
The Promising Future . . . . . . . . . . . . . . . . . . . .
Applications of Sequencing . . . . . . . . . . . . . . . . .
1.2 Processing of Sequencing Data . . . . . . . . . . . . . . .
1.3 Genome Variation Analysis . . . . . . . . . . . . . . . . .
Variant Calling . . . . . . . . . . . . . . . . . . . . . . . .
Genome Wide Association Study . . . . . . . . . . . . . .
Imputation . . . . . . . . . . . . . . . . . . . . . . . . . .
Targeted Sequencing . . . . . . . . . . . . . . . . . . . . .
1.4 Gene Expression Profiling . . . . . . . . . . . . . . . . . .
Microarray Based Expression Profiling . . . . . . . . . . .
Sequencing Based Expression Profiling . . . . . . . . . . .
1.5 Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . .
1.8 Translating High Throughput Variation Data to Function
Effects of Genomic Variations . . . . . . .
Enrichment Analysis . . . . . . . . . . . .
Pathway Analysis . . . . . . . . . . . . . .
Integrative Analysis . . . . . . . . . . . .
Pathway Based Prediction Tool . . . . . .
Challenges of Next Generation Sequencing
Complex Phenotypes . . . . . . . . . . . .
II Childhood Asthma
Asthma Aetiology
2 Paper I - Genome-wide association analysis of childhood
3 Childhood asthma candidate gene study
4 Paper II - Machine learning based prediction of childhood
III Obesity
Obesity Aetiology
5 Paper III - Brown to white adipose tissue transition
6 Paper IV - Epigenetic changes in obesity
IV Genotype to Phenotype
7 Discovering phenotypes
7.1 Danish Pan-genome . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Ancient Genome . . . . . . . . . . . . . . . . . . . . . . . . . 121
V Epilogue
Summary and perspectives
VI Appendix
8 Paper V - Role of TIMP-1 in chemotherapy resistant
breast cancer
The key objectives of this thesis work are to decipher and prioritise observed variations among different phenotypes. With advancements in high
throughput technology leading to a surge in biological data, it is imperative
to analyse and interpret this information. Consequently, this thesis work
examines epigenetic, genetic, transcriptomic and proteomic variations within
different multifactorial diseases and this pivotal information is then annotated and associated to its corresponding phenotype. Childhood asthma and
obesity are the two main phenotypic themes in this thesis.
In the first section, Chapter 1 provides an introduction to various
methodologies utilised in this thesis work. Subsequently, chapters 2, 3
and 4 in the second section, address finding causal variations in childhood
asthma. Chapter 2 focuses on a genome wide association study (GWAS)
performed on asthma exacerbation case cohort. This study reports a new
susceptibility locus within the gene CDHR3 for exacerbation phenotype of
childhood asthma. Chapter 3 of the thesis presents a pilot study, which
aims at designing a candidate gene panel for childhood asthma to identify
the causal variants from known asthma genes. Chapter 4 describes artificial
neural network (ANN) based methodology of selecting genetic and clinical
features with predictive power for childhood asthma. The goal of these
studies is to understand the complex genetics of childhood asthma.
The third part of this thesis (chapters 5 and 6) focuses on various
mechanisms involved in adipose depots, which is a major tissue implicated
in obesity. Chapter 5 sheds light on different mechanisms that result in the
replacement of metabolism efficient brown fat with the storage-type white fat
in large mammals (including human) especially within the first few months
following birth. The project work discussed in chapter 6 is aimed towards
understanding the various underlying differences in obesity responses in
fat cells from different white adipose tissue depots under diet-induced and
genetic obesity by decoding the global epigenetic modifications.
The fourth section of this thesis work (chapter 7) comprises of two
studies that are aimed towards genotype to phenotype mapping. The
first section of chapter 7, details the usage of variations from the Danish
pan-genome pilot project to comprehend the common phenotypes of the
population and attempt to establish its kinship with European populations.
Next, the second portion of this chapter describes a personalised genome
study of an ancient genome which was conducted by calculating the genetic
risk scores to unravel phenotypes.
Appendix section (Chapter 8) comprises of an integrative functional analysis study of the changing proteome and phosphor-proteome in chemotherapy resistant breast cancer cell lines with high TIMP-1 gene expression.
In summary, this thesis work demonstrates applications of various
�omic� variations at different levels of complexity and their integration using
systems biology based methodologies to associate them to multifactorial
phenotypes. These studies help in revealing pivotal mechanistic details
concerning the phenotypes, which can be further utilized in drug designing
and disease management.
Dansk resumé
Hovedformålet med denne afhandling er at afkode og prioritere de observerede variationer blandt forskellige fænotyper. De seneste års betydelige
fremskridt i high-throughput teknologier har medført en eksplosion i mængden af biologisk data, der genereres fra mange forskellige kilder. For netop
at kunne afkode den biologiske fænotype fra det molekylære data, er det
vigtigt at kombinere data fra forskellige kilder i analyserne. Denne afhandling
beskæftiger sig derfor med hvorledes genetiske, epigenetiske, transkriptomiske, og proteomiske variationer påvirker multifaktorielle sygdomme. Disse
variationer annoteres og associeres med forskellige biologiske fænotyper. Der
er i denne afhandling primært fokuseret på fænotyperne astma hos børn og
I afhandlingens kapitel I gives en generel introduktion til de forskellige
metoder, der er benyttet i denne afhandling. Kapitlerne II-IV i afhandlingens
anden del omhandler identifikationen af kausale variationer i astma hos
børn. I kapitel II fokuseres der på analyser af genome-wide associations
studie (GWAS) data udført på en kohorte af børn med forværret astma.
Dette studie identificerede et nyt højrisiko locus placeret i genet CDHR3,
som øgede risikoen for at få forværret astma. I kapitel III præsenteres et
pilot-studie, som sigter efter at designe et panel af gener, som kan identificere den kausale varianter blandt gener, der er kendt for at forårsage astma.
Kapitel IV beskriver en artificial neural network (ANN)-baseret metode til
at vælge genetiske og kliniske faktorer, der kan forudsige sygdomsforløbet
for børneastma. Disse studier er designet til at øge forståelsen af mekanismerne bag sygdomsforløbet af børneastma, hvilket kan lede til forbedrede
prognoseværktøjer, samt til bedre behandling af sygdommen.
Den tredje del af afhandlingen er kapitlerne V-VI, som fokuserer på de
forskellige mekanismer involveret i fordeling af fedtdepoter, der har stor
indflydeles på overvægt. Kapitel V belyser hvordan forskellige mekanismer
i større pattedyr, inklusiv mennesker, kan resultere i at det metabolisk
effektive brune fedt erstattes af det hvide oplagringsfedt særligt indenfor de
første par måneder efter fødslen. Kapitel VI omhandler de underlæggende
forskelle i fedme-responset i fedtceller fra forskellige hvide fedtcelledepoter,
både ved diæt-relateret fedme og genetisk fedme, via afkodning af globale
epigenetiske ændringer.
Den fjerde del af afhandlingen (kapitel VII) består af to studier, der er
målrettet mod genotype-til-fænotype mapping. Den første del af kapitel VII
beskriver et personligt studie lavet på et antikt genom, som blev udført ved
at beregne den genetiske risikoscore. Anden del af dette kapitel detaljerer
hvorledes variationer mineret fra det danske pan-genom projekt kan benyttes
til at forstå de gængse fænotyper i befolkningen og undersøge hvordan den
danske befolkningen er relateret til øvrige europæiske nationers befolkninger.
Tillæg afsnit beskrives i kapitel VIII og er en integreret analyse af
hvordan proteomet og phospho-proteomet ændres i kemoterapi-resistente
brystkræftcellelinjer med høj ekspression af TIMP-1 genet.
Denne afhandling beskriver forskellige metoder til at arbejde med “omics”
data i stor stil og i forskellige grad af kompleksitet, og hvordan de forskellige
datatyper kan integreres ved at benytte systembiologiske metoder til at
associere dem med multifaktorielle fænotyper. Disse studier er medhjælpende til at afsløre centrale mekanismer, som er vigtige for udviklingen eller
videreudviklingen af forskellige sygdomsfænotyper, hvilket kan være af stor
vigtighed i den fremtidige udviklingen af nye typer medicin, samt i den
generelle sygdomsbehandling.
I take the oppurtunity to express my sincere thanks to all the people who
have directly or indirectly inspired and helped me during my PhD. I would
like to express my gratitude to my supervisor Ramneek Gupta for being
supportive, encouraging and giving freedom of thoughts and work . It has
been a great learning journey.
I have been very fortunate to collaborate with many different groups namely
the Copenhagen Prospective Studies on Asthma in Childhood, Sino-Danish
Breast Cancer Research and the Department of biology, University of
Copenhagen. The work presented in this thesis was possible because of your
expertise in field and critical assessments. I would like to express my special
gratitude towards Hans Bisgaard, Klaus Bønnelykke, Eskil Kreiner-Møller,
Karsten Kristiansen, Jacob B. Hansen and Si Brask Sonne. It has been an
extreme pleasure to work with all of you.
It has been a pleasure to be surrounded by many helpful people from CBS
who always engaged in scientific discussions and provided me with many
helpful insights and guidance. A special thanks to Thomas Nordahl Petersen,
Thomas Sicheritz-Ponten, Simon Rasmussen, Aron Eklund and Nicolai Juul
Birkbak. A special thanks to DTU Multi-Assay Core (DMAC) and especially to Marlene Damsgaard for all the experimental work that too in tight
The CBS system administration team has always been forthcoming with
technical support. Thanks to John Damm Sørensen, Peter Wad Sackett,
Kristoffer Rapacki and Hans Henrik Stærfeldt. The CBS administration
never hesitated in helping with any official work. Thank you for your help
- Lone Boesen, Dorthe Kjœrsgaard, Martin Lund, Marlene Beck, Annette
Vibeke Uldall and Karina Sreseli.
Special thanks to the members and guest members of Functional Human
Variation group. I have enjoyed all our scientific discussion as well as teambuilding events. I would also like to thank all the people who gave invaluable
comments on my thesis or its parts, especially Tammi Vest, Kisrtine Belling
and Henrik M. Geertz-Hansen.
It has been a pleasure to share the office space with Dave Userry and his
group. I would like to thank my PhD colleagues particularly Asli, Kalliopi,
Bent, Ali, Ida, Agata, Dhany and to my late-lunch companions Arcadio,
David, Khoa and Grace for all laughs and gossips. Thanks to all other former
and present colleagues for contributing to the friendly working environment
and great parties.
Finally, I would like to thank all my friends especially few old ones, Bhanu
and Rounak for keeping me company though miles apart. This thesis would
not be possible without the support from my mamma and daddy, who
always had faith in me and supported me. Special thanks to Mohita who
is best at the art of infusing positive enthusiasm during difficult times. A
special thanks to the special person of my life, Sudhir, for all the support,
encouragements, patience and also for copy writing the thesis.
Papers included in the thesis
• Klaus Bønnelykke∗ , Patrick Sleiman∗ , Kasper Nielsen∗ , Eskil KreinerMøller, Josep M Mercader, Danielle Belgrave, Herman T den
Dekker, Anders Husby, Astrid Sevelsted, Grissel Faura-Tellez, Li
Juel Mortensen, Lavinia Paternoster, Richard Flaaten, Anne Mølgaard, David E Smart, Philip F Thomsen, Morten A Rasmussen,
Silvia Bonàs-Guarch, Claus Holst, Ellen A Nohr, Rachita Yadav,
Michael E March, Thomas Blicher, Peter M Lackie, Vincent W V
Jaddoe, Angela Simpson, John W Holloway, Liesbeth Duijts, Adnan
Custovic, Donna E Davies, David Torrents, Ramneek Gupta, Mads V
Hollegaard, David M Hougaard, Hakon Hakonarson, Hans Bisgaard
A genome-wide association study identifies CDHR3as a susceptibility
locus for early childhood asthma with severe exacerbations. Nat Genet.
2014 Jan;46(1):51-5.
• Rachita Yadav, Thomas Nordahl Petersen, Eskil Kreiner-Møller,
Hans Bisgaard, Kluas Bønnelykke, Ramneek Gupta. Ranking genetic
and clinical features for prediction of asthma at age 7. Manuscript in
• Astrid L. Basse∗ , Karen Dixen∗ , Rachita Yadav∗ , Malin P. Tygesen, Klaus Qvortrup, Karsten Kristiansen, Bjørn Quistorff, Ramneek
Gupta, Jun Wang, Jacob B. Hansen Global gene expression profiling
of brown to white adipose tissue transformation in sheep reveals novel
transcriptional components linked to adipose remodeling. Manuscript
• Rachita Yadav, Si Brask Sonne, Yin Guangliang, Ramneek Gupta,
Jun Wang, Karsten Kristiansen, Shingo Kajimura Adipose-depot specific gene regulation by DNA-methylation in obesity. Manuscript in
• Omid Hekmat∗ , Stephanie Munk∗ , Louise Fogh∗ , Rachita Yadav,
Chiara Francavilla,Heiko Horn, Sidse Ørnbjerg Würtz, Anne-Sofie
Schrohl, Britt Damsgaard, Maria Unni Rømer, Kirstine C. Belling,
Niels Frank Jensen, Irina Gromova, Dorte B. Bekker-Jensen, José M.
Moreira, Lars J. Jensen, Ramneek Gupta, Ulrik Lademann, Nils Brünner, Jesper V. Olsen, Jan Stenvang. TIMP-1 Increases Expression and
Phosphorylation of Proteins Associated with Drug Resistance in Breast
Cancer Cells. J. Proteome Res., 2013, 12 (9), pp 4136�4151.
These authors contributed equally.
Papers not included in the thesis
• Christina Bjerre, Lena Vinther, Kirstine C. Belling, Sidse. Würtz.
Ø, Rachita Yadav, Ulrik Lademann, Olga Rigina, Khoa Nguyen
Do, Henrik J. Ditzel, Anne E. Lykkesfeldt, Jun Wang, Henrik Bjørn
Nielsen, Nils Brünner, Ramneek Gupta, Anne-Sofie Schrohl, Jan Stenvang. TIMP1 overexpression mediates resistance of MCF-7 human
breast cancer cells to fulvestrant and down-regulates progesterone receptor expression. Tumor Biology December 2013, Volume 34, Issue 6,
pp 3839-3851.
• Morten Rasmussen, Sarah L. Anzick, Michael R. Waters, Pontus
Skoglund, Michael DeGiorgio, Thomas W. Stafford Jr, Simon Rasmussen, Ida Moltke, Anders Albrechtsen, Shane M. Doyle, G. David
Poznik, Valborg Gudmundsdottir, Rachita Yadav, Anna-Sapfo
Malaspinas, Samuel Stockton White V, Morten E. Allentoft, Omar
E. Cornejo, Kristiina Tambets, Anders Eriksson, Peter D. Heintzman,
Monika Karmin, Thorfinn Sand Korneliussen, David J. Meltzer, Tracey
L. Pierre, Jesper Stenderup, Lauri Saag, Vera M. Warmuth, Margarida
C. Lopes, Ripan S. Malhi, Søren Brunak, Thomas Sicheritz-Ponten,
Ian Barnes, Matthew Collins, Ludovic Orlando, Francois Balloux, Andrea Manica, Ramneek Gupta, Mait Metspalu, Carlos D. Bustamante,
Mattias Jakobsson, Rasmus Nielsen, Eske Willerslev. The genome of a
Late Pleistocene human from a Clovis burial site in western Montana.
Nature 506, 225�229 (13 February 2014).
• The Genome Denmark Consortium. Deep whole-genome sequencing of
Danish parent-offspring trios determines private variation, de novo mutation rates and allows population wide de novo assembly. Manuscript
in preparation.
• Qin Hao, Rachita Yadav, Sidsel Petersen, Si B. Sonne, Simon
Rasmussen, Qianhua Zhu, Zhike Lu, Jun Wang, Karine Audouse,
Ramneek Gupta, Lise Madsen, Karsten Kristiansen and Jacob B.
Hansen. Transcriptome profiling of brown and white adipose tissues
during cold exposure provides evidence for extensive regulation of glucose metabolism in brown adipocytes Manuscript in preparation.
Adjusted p-value
Allele frequency
Artificial neural network
Adenosine triphosphate
Cadherin-related family member 3
Complementary DNA
Chromatin immoprecipitation
Deoxyribonucleic acid
Encyclopedia of DNA Elements
Expressed sequence tag
Formaldehyde-assisted isolation of regulatory elements
False discovery rate
Gene ontology
Genetic risk scores
Genome wide association study
High fat diet
HUGO (Human Genome Organisation ) gene nomenclature committee
Insertions and deletions
Kyoto Encyclopedia of Genes and Genomes
Linkage disequilibrium
Mapping quality
Matthews correlation coefficient
Methylated DNA immunoprecipitation sequencing
messenger RNA
Next generation sequencing
Odd ratio
Principal component analysis
Pearsons correlation coefficient
Polymerase chain reaction
Protein-protein interactions
Post-translational modification
Regular diet
Ribonucleic acid
Ribonucleic acid sequencing
RNA interference
Single nucleotide polymorphism
Single nucleotide variation
Type 2 diabetes
Tag sequencing
Transcription factor
White blood cell
Whole genome amplified
Part I
Chapter 1
Tools, Techniques and Data
All living organisms are made of smaller units called cells, which are governed
by a central rule called “the central dogma of molecular biology”. The central
dogma was first articulated by Francis Crick in 1958 [1] and restated in an
article published in Nature in 1970 [2]. According to the originally proposed
central dogma, information in biological systems only flow from DNA to
RNA to proteins. However later developments showed that RNA can be
converted to DNA (Figure 1.1). The central dogma provides a framework
to understand biological information and relationship between different biological components and mechanisms. A living cell is a heterozygous mixture
of polymers, which are nothing but sequential organisation of individual
repetitive monomer units. The three most important biological polymers
that govern and regulate all cellular mechanisms are Deoxyribonucleic acid
(DNA), Ribonucleic acid (RNA) and proteins. Nucleotides and amino acids
are monomers for DNA/RNA and proteins, respectively. In biology, information is stored and transferred in the form of these three sequential
molecules. The central dogma defines the transfer of information between
these sequential polymers and thus they are responsible for the existence
of life. A lot has been discovered about these polymers and their role in
the organism development, growth and sustainability. As these polymers
regulate all biological mechanisms, any variation in these polymers from the
steady state, leads to changes in the vital status of the organism and these
are reflected as phenotype and some crucial differences result in diseases.
The total content of DNA, RNA and protein of a cell makes the genome, the
transcriptome and the proteome, respectively.
Almost 150 years ago, Gregor Johann Mendel discovered the basis of
Figure 1.1. Adapted from Crick’s version of “central dogma” of
biology [2].
genetic heredity, which could explain genetic basis of many diseases running
in families. He described these genetic discoveries in a set of three laws
known as Mendel’s laws. These laws were although sufficient to explain
diseases that are caused by a single gene, however they fell short in explaining the phenotypes that are a result of either accumulation of multiple
genetic defects or genetic changes occurring in response to external stimuli.
The changes in DNA causing these defects are called the genetic variations.
These defects can be as small as a change of single nucleotide in DNA called
the single nucleotide polymorphism (SNP) or insertion or deletions of bigger
chunks called chromosomal aberrations. Interactions between the genetics
of the organism and the environment lead to complex phenotypes. All the
molecules within cells and the cells themselves are interacting complex systems and it is hard to predict the property of individual systems separately.
To understand them, it is required to quantitatively measure the behaviour
of these groups and their interacting partners. Systematic measurement
technologies measuring these individual components of the cell are called as
genomics, transcriptomics and proteomics. Based on these measurements,
systems biology methods use mathematical and computational models to
imitate the cell components and their interactions using computers. This
includes interactions of the genes with each other, genes and proteins, RNA
with DNA, RNA with proteins and the interactions between the cellular
components and the environment. Accordingly, combining the genetic information with the transcriptome and proteome information will lead to
deeper understanding of the basic biological mechanism and also provide
new insights into disease states.
1.1 Genomics
Past, Present and Future
In order to understand the complex genetic mechanisms that result in or
regulate complex phenotypes a branch of genetics evolved in late 1980s
is referred to as Genomics. The term genomics was coined by Dr. Tom
Roderick and it describes the comprehensive study of the entire genetic
material of an organism [3]. Genomics also provides new scientific basis
to study complex diseases, which may result in new possibilities for therapies and treatments for some diseases, as well as new diagnostic methods
[]. Genomics is relatively new field of
science, originated with the description of structure of the DNA helix that
was made by James D. Watson and Francis H. C. Crick in 1953 [4]. It was
also discovered that the sequences of the two strands define the structure of
DNA molecule and also its function. Later technology advances led to the
development of DNA sequencing and polymerase chain reaction, which were
extended to other molecules like RNA.
DNA, RNA and proteins harbour a sophisticated and unique code in
their sequences which facilitate accurate deciphering and transformation of
the coded information. This in turn allows them to control and administer the different activities of a cell. Therefore, to understand how these
molecules control cellular activities, it is required to unravel the actual primary sequence of these molecules and this process is called as “sequencing”.
Once the genetic sequence of an organism is decoded, this can be compared
with individual from same species or from different species. The process of
comparing the genetic code of different species/individuals to determine its
genetic variants is called as genotyping. Finding the genotype reveals the
specific alleles inherited by an individual, which is particularly useful when
more than one genotypic combinations drive the clinical manifestations in
The Pioneer: Microarrays
Microarray is a solid platform used to assay molecular contents of biological
samples. A microarray, also called as chips, has numerous wells, each acting
as an experiment in itself. In a microarray, a fluorescent tagged nucleic acid
sample (target) is hybridised (annealing of two complementary sequences) to
the probes, which are attached to a solid surface. The fluorescence generated
by this hybridisation is used to determine variations or expressions of genes.
First microarrays were introduced in 1995 [5] to compare the messenger RNA
(mRNA) content of cells for finding the differences in gene expression of two
Genotyping arrays
Genotyping arrays are DNA microarrays used to detect genetic variations in
an organism. DNA microarrays can be used to identify genotypic differences
between individuals or between normal and diseased state. These differences
can be assessed by several means, among them one of the very informative
tools is Single Nucleotide Polymorphism (SNP) microarrays. These arrays
have probes that can bind to different alleles of the SNP, and the hybridisation of these two probes to the genome gives the allele counts for the SNPs.
They are designed to capture the genome wide polymorphisms assuming a
uniform distribution of variations throughout all chromosomes. To examine
the functional regions of the genome, exome arrays are designed to capture
the variation in the coding section of the genes. Chips can also be custom
designed to capture low frequency variation (Minor allele frequency (MAF)
0.5-5%), variation known in a pathway or the prior known SNPs in either
a disease or drug metabolism. SNP arrays can be applied to detect very
small variation between individuals that can be further used to determine
disease susceptibility and for assessing genetic variation linked to efficacy and
toxicity of drugs. In chapter 2 and 4 of this thesis, we have used genotyping
data from asthma cohorts to ascertain childhood asthma risks.
Though, these flexibilities of customising arrays are available for the SNP
chips, they are still unable to capture rare (MAF < 0.5%) or novel variation.
Therefore, with the advent of Next Generation Sequencing (NGS) in the last
decade, it has been successfully applied to assess SNPs in vast population
studies [6, 7, 8].
The Exciting Present: Next Generation Sequencing
Sequencing to decode the order of bases in DNA molecules was first developed in 1977 by Frederick Sanger and colleagues [9]. It works on the
principle of termination of synthesis at each possible base. The all possible
DNA fragments are synthesized by selective incorporation of modified chainterminating dideoxynucleotides [10]. These fragments are sorted size wise by
running on gel and the sequence is decoded by reading the terminating base
in the ascending order of size.
The revolution in sequencing started with the advent of NGS, in which one
can sequence tens of thousands of molecules in parallel. The method
starts with enrichment of molecule (DNA or RNA) from samples by
fragmenting it and creating a concentrated solution either in solution
or on array. These fragments are amplified by polymerase chain reaction (PCR) to increase the number of individual events being sequenced.
These amplified molecules are attached to a solid surface
called the flow cell and they are subjected to sequencing (Figure 1.2).
There are multiple methods for sequencing these amplified molecules:
Figure 1.2.
Next generation sequencing (NGS) workflow.
Adapted from [11]
1. Sequencing by Synthesis: Sequencing by synthesis uses DNA polymerase and ligase enzymes to extend many DNA strands in parallel
by incorporating fluorescently labeled modified nucleotides. These
incorporated modified nucleotides does not allow further extension and
thus serves as a terminator for polymerisation [12]. The fluorescent
dye is then imaged to identify the added bases. The last base is
enzymatically cleaved which allows further incorporation of nucleotide
and this process is repeated till the end of the sequence. Base calls are
directly made from signal intensities measured during each cycle and
this greatly reduces raw error rates when compared to other technologies. The Illumina sequencers like HiSeq and Miseq use this technology.
2. Pyrosequencing: The single-strand sequencing library fragments are
captured onto beads and these beads are immobilised on solid support.
The setup of stationary DNA is flushed with nucleotides and the incorporation of a nucleotide to the DNA by DNA polymerase results in
release of a pyrophosphate [13]. This pyrophosphate is converted to
light by adenylsulfuryltransferase (ATP sulfurylase) and luciferase enzymes, which in turn is captured by a camera and the signal strength is
proportional to the number of nucleotides incorporated. Pyrosequencing is used in Roche GS FLX 454 machine.
3. Sequencing by Ligation (SBL): Instead of using DNA polymerase,
the SBL technology uses DNA ligase to decode the sequence of fragment of interest with four fluorescent dyes to encode for all 16 possible
2 base combinations [14]. Amplified library undergoes multiple cycles of probe hybridization, ligation, imaging and analysis. The usage
of oligonucleotides increases the accuracy of sequencing but since the
data is produced by off-set steps, interpretation of raw data is complicated. Applied Biosystems SOLiD sequencer uses this method and
they provide the software LifeScope for data analysis.
The Promising Future
All these methods differ in the PCR amplification applied to the library,
read lengths they produce, time for sequencing and raw accuracies. With
the current research going on in the field, there are new technologies coming
up which would like to solve some problems of second generation sequencing
like short reads, library amplification requirement and cost. This is what
is termed as third generation sequencing. Multiple methods are under development at different stages, with few already been launched commercially.
Ion Torrent™ Technology directly translates chemically encoded information (A, C, G, T) into digital information (0, 1) on a semiconductor chip.
The Single Molecule Real Time (SMRT™) sequencing technology from Pacific Biosciences, enables faster results and longer read lengths and thus easy
alignment. Nanopore DNA sequencing, uses an exonuclease to cleave nucleotides from DNA.
Applications of Sequencing
In recent years, the high throughput technologies which produce
millions of short sequence reads are routinely being applied to
genomes, transcriptomes and epigenomes.
In this thesis, three
different types of sequencing data have been used (Figure 1.3).
1. DNA sequencing (DNA-seq)
2. RNA sequencing (RNA-seq)
3. Epigenetic mark sequencing (MeDIP-seq)
Differentially methylated regions
Transcript quantification
Figure 1.3. Various application of sequencing technologies used
in different projects in this thesis.
DNA Sequencing
DNA sequencing is the process of determining the nucleotide order of a given
DNA fragment. The first method of DNA sequencing was developed in 1977
using the chain termination method [10]. With the advancement of technology, DNA sequencing price is reducing and it is likely that sequencing will be
an integral part of regular clinical diagnosis and treatments in near future. In
this thesis whole genome DNA sequencing has been applied for pangenome
and ancient genome projects.
RNA Sequencing
RNA sequencing is performed to assess the presence and quantification of all
RNAs in a given cell. This technology was proposed by Nagalakshmi et al
in 2008 where they used it to define the transcriptional landscape of yeast
genome [15]. RNA-seq quantitatively determines steady-state RNA in a sample by generating cDNA and subjecting it to massively parallel sequencing.
Epigenome Sequencing
Similarly, sequencing can also be used for finding variations in epigenetic marks like DNA methylation (Methylated DNA immunoprecipitation
(MeDIP-Seq), Reduced representation bisulfite sequencing (RRBSeq)),
transcription factor binding site (Chromatin immunoprecipitation (ChIP))
followed by high-throughput DNA sequencing (ChIP-seq)), chromatin structure (DNase I hypersensitive sites sequencing (DNASE-seq), FormaldehydeAssisted Isolation of Regulatory Elements sequencing (FAIRE-Seq)) etc.
MeDIP-seq [16] uses an antibody specific to 5-methycytosine and retrieves
the methylated regions of the genome for sequencing where as RRBSeq uses
chemically modified methylated cytosines to capture methylated region for
sequencing [17]. ChIP-seq technology enables researchers to identify protein
binding sites across the entire genome [18]. DNAase-seq is used to identify the
location of open chromatin regions based on the activity of DNAase enzyme
[19]. Similarly, FAIRE-Seq is used for determining open genomic regions [20].
RNA-seq and MeDIP-seq are discussed in more details in the coming
sections 1.4 and 1.5 of this thesis respectively.
1.2 Processing of Sequencing Data
Over the past few years, there has been a huge increase in NGS and with the
price going further low, more and more sequencing data would be generated.
All NGS platforms generate millions of small strings of nucleotide sequence
called as reads. These reads are then assembled together by either mapping
the individual reads to the reference genome or assembling the reads into
continuous sequence to understand the variations in the target genome or
Quality Control
The sequencers are not very accurate and random errors occur while sequencing. The accuracy of the downstream data analysis depends on the quality of
reads. Thus, the first step in NGS data analysis is quality control. Each base
called by the sequencer is assigned a score, which reflects its quality. Developed by a group in Washington University in 1990, phred quality values [21]
determin the probability of error at each base.
QP RED = −10log10 P (error)
These scores are used to filter bad quality reads and also as quality checks
for further analysis. During sequencing the sequencing adapters and primers
are sequenced which needs to be removed from the real read. If quality
check analysis reveals that the quality score of bases towards the end of
the read are below the accepted threshold, it is recommended to remove
the bad quality bases from the end by trimming the reads. Standard data
analysis protocols suggest investigating per base quality, k-mers presentation
and the GC content to assess the overall quality of the data (Figure 1.4).
These quality controls keep a check on sample contaminations and prevent
alignment problems.
The quality controlled reads are mapped back to the reference genome by using specialised sequence alignment methods such as Bowtie [22] or BWA [23]
(Figure 1.4). Most genomes contain some repetitive regions, therefore, some
reads will map to multiple places in the genome. Hence, it is advisable to fine
tune multiple mapping parameters of read aligners. Another level of false positive mapping is PCR duplicates, that are artefacts of amplification during library preparation. Most sequencing pipeline recommends removing or marking such reads. It is better to remove such reads when the sequencing analysis
is quantitative. Picard’s “MarkDuplicates” (
or samtools “rmdup”” [24] can be used to either mark or delete the PCR duplicates. The mapping tools calculate the probability of overall correctness of
alignments is denoted by mapping quality (MAPQ). The misaligned regions
are either realigned or marked with alignment qualities per base called Base
Alignment Quality (BAQ). These two scores are used during subsequent steps
such variation, indel, copy number calling and expression analysis.
Input data from sequencing
[ raw reads in FASTQ file ]
Filtered and trimmed
reads in FASTQ file
BWA /Bowtie
Assembled or aligned
reads in BAM/SAM files
Samtools / PICARD
Filtered reads based
on mapping quality
in BAM/SAM files
SNP and Structural
variation calling
DESeq /
gene expression
Figure 1.4. Flowchart of data analysis steps for sequencing data.
1.3 Genome Variation Analysis
Variant Calling
After quality control, these filtered reads are aligned and these aligned
reads are then used for calling genotypic variants. Variant calling is finding mismatches in alignments with respect to the reference genome. Most
variant callers find the variants by comparing the probabilities of bases
occurring at that position in the mapped reads to the probabilities of bases
in the reference genome. When analysing single genome, the genotyping and
variant calling is more or less same, with the non-reference homozygous or
heterozygous calls being treated as a SNP [25]. However, when there are
multiple genomes, joint posterior probabilities or likelihood ratio test are
used for SNP calling. The variant callers assume diploid individuals and take
into account the Hardy-Weinberg equilibrium and linkage disequilibrium
(LD) information as well as previous information about the SNPs present in
the species and their allele frequencies. SAMtools [24] and GATK [26] are
the commonly used SNP callers (Figure 1.4).
When more than one base in the sequence is changed either by deletion
of bases or by additions, then these variation are called insertion and deletions (indel). Large indels causing disruption of functional protein domains
or regulatory region are called structural variations. There is no clear discrimination between indel and structural variations. Due to several technical
and analytical artefacts, all variations need to be filtered to avoid false
positives. The variant calling artefacts are minimised by checking the quality
score of the variation or sequencing depth of the region. All variant calling
tools generate results as variant calling file (VCF) format, where different
information available about each variant is presented in a single line. VCF
tools are widely used to manipulate these files e.g. merging them, extracting
regions or selected SNPs [27]. Even after extensive filtering of variations, the
number of called variants from sequencing data is overwhelming and thus
automated annotation is required. Details for assigning function to SNPs
and downstream analysis are described in the section 1.8 of this thesis.
Genome Wide Association Study
The completion of Human genome in 2003 and a pilot project of genotyping
healthy individual, called the HapMap project, in 2005, gave the researchers
an opportunity to find combination of genetic markers that can define and
segregate individuals from each other. It will be of special intereste to find
differences in genetic constitution of healthy and diseased individuals. To
find such discriminating genetic traits, a comprehensive genome wide association study (GWAS) is required. A GWAS is an approach that involves
scanning of multiple markers across the genomes of many individuals to find
genetic variations associated with a phenotype. By principle, the studies are
designed to associate variations to disease by comparing the allele frequencies
between case and control groups. Such studies guides to determine which loci
significantly differ between these two groups and which allele is significantly
associated with the phenotype.
Different models for example additive, multiplicative, recessive and dominant, can be used to determine the risk related alleles. To minimise random
SNP-phenotype associations, it is imperative to apply robust statistical
methods. For case-control studies, the association testing is done using
logistic regression or contingency table method. Contingency table method
tests the deviation from independence whereas logistic regression predicts
probability of having case status given a genotype class. There are other
factors like sex, age, race, ethnicity, disease severity etc. that influence
the SNP-phenotype association, thus the score of the GWAS need to be
adjusted for them. Since, in GWAS all SNPs detected in an experiment
are tested for the association with the phenotype, therefore the calculated
score or test statistics, that is generally “p-value”, needs to be corrected for
multiple testing [28]. The methods of multiple testing corrections applied
in GWAS are Bonferroni correction, Benjamini and Hochberg and false
discovery rate (FDR). Several software packages have been developed to
perform GWAS, the frequently used ones include PLINK [29], TATES [30],
SNPtest [31]. PLINK has been used in this thesis to perform GWAS in the
asthma exacerbation study described in chapter 2.
The current SNP arrays assay a dense set of markers across the genome
but to cover the total genome they need to be evenly distributed across the
genome. Since most of the genome is non-protein coding, GWAS tend to
find associations between SNP and a trait which lie within these region and
therefore it is difficult to assign a function to these SNPs. To over come
the problem of non-functional SNPs, exome chips are designed to account
for the functional aspect of the detected SNPs. However, they still fail to
detect the causative variant and miss SNPs located in non-coding regions
of genome. By Mendel’s law of segregation, all sites on the chromosome
undergo recombination and can segregate separately but that is not totally
true. Chromosomes are mosaic and different loci have different recombination rates. SNPs are not independent and there exists an association between
closely located SNPs leading to the coinheritance of certain alleles more often than would be expected by chance. This phenomenon is called as linkage
disequilibrium (LD). LD gradually declines with distance thus the farther
away the SNPs are the less is the chance of them being dependent on each
other. Thus, based on the above principle, a statistical method called imputation was designed [32]. This method learns the combinatorial patterns of
variations from known datasets and can accurately estimate the genotype of
an unobserved SNP based on the neighbouring SNPs present on the array.
Prediction of genotypes based on imputation is fairly accurate and provides
a detail view of the associated region to facilitate follow-up studies and also
allows correction and validation of the genotyped data [31]. Many software
are available for imputing the data e.g. PHASE [33], IMPUTE [34] and BEAGLE [35]. Imputation of genotypes while combining different datasets leads
to the identification of susceptibility loci but requires rigorous quality checks
at pre- and post-analysis stages [36]. In chapter 2, regional imputation was
carried out for the regions with top hits in the initial GWAS. This gave a
chance to find if there were any variations that were missed in genotyping
data and could be associated with the phenotype.
Targeted Sequencing
GWAS in the past years have suggested that common variants just explain a
modest percentage of total heritability of diseases. The remaining heritability can be explained by rare or novel variants. However genome wide capture
of these variants is still not in routine usage because it requires cohorts of
large sizes. Also, since multiple studies generally find different variations in
the same gene being associated with same phenotype, it still needs to be exploited if one of them is a real causal variant or a proxy for the real variant.
Thus, a study to sequence only selected region of the genome and detect phenotype associated variations in these target regions can be designed. There
are multiple methods of target capture however PCR-based procedures have
been the most widely used [37]. But the limitation that PCR requires individual primers for each selected target, led to the development of other
methods like by hybridisation to microarray using capture probe [38, 39, 40]
or using de novo synthesised microfluidic DNA chips [41]. The later methods
are cost effective, flexible and specific. Recent advancements have been made
to reduce the amount of starting DNA. At the same time, multiple samples
can be sequenced simultaneously by using multiplexing. For doing so, the sequencing libraries from different samples are tagged using barcodes, so that
they can be recognised and separated after sequencing [42]. The chapter 3
of this thesis describes a pilot study to target sequence samples from asthma
cases, using small quantity of DNA and multiplexing using custom designed
1.4 Gene Expression Profiling
The �transcriptome� consists of the complete set of transcripts, which include
both coding and non-coding RNAs. The coding RNAs that are translated
to proteins are called mRNAs, while there are various types of non-coding
RNAs. Quantification of RNAs in a biological sample is called as transcriptomics or expression profiling. The abundance of given RNA depends upon
the balance between transcription of the gene and RNA degradation. As only
the coding RNAs are translated into proteins therefore, expression of coding
RNAs should be proportional to the protein content in the cells. However,
there are multiple factors guiding the amount of protein translated from the
mRNA, and there is also a constant opposite process of protein degradation.
Thus, the mRNA quantification is only an indication of the protein content
of the cell. The quantity of different transcripts in a cell also varies between
developmental stages or physiological condition. Estimating the abundance
of the transcriptome benefits the understanding of the functional elements of
the genome and reveals the molecular state of the cell. Expression profiling
in different stages of cell life or different conditions like control and disease
can be used to study regulatory gene defects in diseases, cellular responses
to the environment, variations in cell cycle etc. There are various methods
for expression profiling. The two major high-throughput approaches are
microarrays and sequencing.
Microarray Based Expression Profiling
The microarray methods involve incubating fluorescently labelled complementary DNA (cDNA) with custom-made microarrays or commercially available high-density oligo microarrays. This technique was developed way back
in 1995 and has been under constant development [43]. Various commercial technologies and microarray platforms are available which differ in probe
designing, implementation of probes, density of probes, RNA isolation and
labelling [44]. I have used Agilent arrays in this thesis therefore I would be
discussing that in detail in the following section.
Microarray Experiment
Agilent arrays have long probes (60-mers) [45], which provide high hybridisation potential as well as are more tolerant to mis-matches [46]. However,
this reduces the space available on the array and thus Agilent has less probe
density as compared to other platforms [45]. Microarrays are used to detect
the quantity of transcripts in cell lysates. In the first step, total mRNA is
extracted from cell lysate, using the poly-A tail as the marker for mRNA
[44] in Eukaryotes. Oligo (dT) primers are employed for cDNA synthesis
[47]. The captured mRNA is amplified by reverse transcription polymerase
chain reaction (RT-PCR) and PCR. This generates a library of cDNA, which
are labelled to be recognised by the reader when they are hybridised to the
complimentary probes on the array.
Microarray Data Analysis
Analysis of the microarray data is to compare the signal intensities from
multiple arrays, which can be done either on case vs controls or on different
states of the cell or on different time points. The multiple processes involved
in the microarray experiment can lead to some noise in the data. Thus,
the intensity values from microarray experiment need to be corrected for
background noise and normalised within the arrays and between arrays so
that genes within an array and across arrays are comparable. We used Limma
[48] package in R for microarray data analysis. For each chip, negative probe
correction between the arrays was used for within chip background correction.
Negative probe correction subtracts the negative background intensities from
the foreground intensities. This is done prior to background correction, which
in turn is based on the exponential model from intensities of negative probes
on the array. To normalise arrays across samples the non-linear quantile
normalisation was applied between the arrays.
Statistical Testing of Differential Expression
Limma fits multiple linear models by generalised or weighted least squares.
The coefficients of the fitted models describe the differences between hybridisation of the probes in two experimental conditions. The results of the linear
fit model is the log fold changes of the genes between conditions and it also
includes moderated t-statistic [48] using the standard error. The p-value is
obtained using the tstatistic and applying adjustment for multiple testing.
The most common form of adjustment is “FDR”, which is Benjamini and
Hochberg’s method to control the false discovery rate.
In the study presented in chapter 4, Agilent whole genome microarray
for mouse was used to find gene expression differences in the inguinal and
epididymal tissues between the regular diet (RD) and high fat diet (HF) fed
Sequencing Based Expression Profiling
In contrast to microarray-based methods, sequence-based approaches directly
determine the cDNA sequence. During the developmental phase of sequencing based expression analysis methods, cDNA or EST libraries [49] were
subjected to Sanger sequencing. However, this had drawbacks of being low
throughput and non-quantitative. A more recent method of determining gene
expression is the high throughput sequencing of the RNA using NGS technologies. Nagalakshmi et al. described this new technology for the first time
in a landmark article published in 2008 [15]. In RNA-seq technology, we can
quantitatively determine steady-state RNA in a sample by generating the
cDNA and then subjecting it to massively parallel sequencing to generate
short reads. The sequencing, quality control and mapping of the generated
reads follow same principles as that for DNA sequencing which are described
in section 1.2. These reads are either mapped to an annotated reference
genome or assembled de-novo. After successful read alignment, quantification of expression per gene is done by calculating the number of reads mapped
to each gene.
Tag-seq Based Expression Profiling
The methods midway between microarray and RNA-seq are the Tag based
quantitative assessments. These include serial analysis of gene expression
(SAGE) [50], cap analysis of gene expression (CAGE) [51] and massively
parallel signature sequencing (MPSS) [52]. These methods are collectively
called digital gene expression (DGE). In the SAGE methods, the short sequences to be sequenced were concatenated to long clone for sequencing and
this led to high cost, low throughput in sequencing and complication related
to cloning step. Tag-seq is a tag-based variant of LongSAGE, where only 17
bases, called the tags, are sequenced from each transcript however, tag-seq
does not requires tags concatenation and cloning as in SAGE [53, 54, 55, 56].
Tag-seq has been used in the chapter 5 of this thesis and thus would be
discussed in the next section.
Tag-seq Experiment
Total RNA is extracted from the sample tissue and mRNA is isolated by capturing mRNA poly(A) tail using a magnetic oligo (dT) bead. The captured
mRNA is subjected to restriction enzyme digestion, resulting in 17 nucleotide
long tags. The 17 nucleotide long tags are PCR amplified and are subjected
to high throughput sequencing (Figure 1.5).
Tag-seq Data Analysis
The data analysis of the tag-seq follows the same principles of quality control, adapter removal, trimming and mapping as other sequencing methods.
The reads miss the 4 nucleotides from the restriction enzyme recognition
sites, thus a string of 4 bases �CATG� were added to 17 nucleotides reads,
which constitute a total of 21 nucleotides and helps in specific mapping to
the reference genome [58]. The number of reads mapping to each gene are
counted using HT-seq [59] or CuffDiff [60] and these counts can be used in
different DGE packages in R [61] for finding differentially expressed genes.
In chapter 5 of this thesis, we have used DESeq [62] for identification of
differentially expressed genes. In brief, DESeq estimates the variance in
count data from high-throughput sequencing assays and applies the test for
differential expression based on negative binomial distribution. Comparison
of the genes fitted in negative binomial distribution in the two conditions
under question, results in a set of differentially expressed genes, which could
be subjected to further downstream analysis.
Total RNA from cell lysate
Data analysis
Figure 1.5. Procedure of tag-seq after the RNA is extracted and
bound to the beads. The last product from the bead attached processing steps undergoes sequencing using the sequencing primers
annealed to it [57].
Differential Gene Expression of Sequencing
One of the most useful applications for transcriptomics studies using either
microarray or high throughput RNA-seq is comparison of expression levels
of transcripts between conditions or over different time points to identify differentially expressed genes. Differential gene expression (DGE) in a normal
physiological condition leads to a number of biological mechanisms in cells
that define basic cellular functions e.g. differentiation, growth, migration,
cell death etc. and is thus important for normal development. However, any
divergence of gene expression from the normal state usually leads to a disease.
The regulation of gene expression is inevitable for normal functioning and is
controlled by many factors. Differential gene expression is not necessarily
caused by loss or gain of genetic material, but more often by differential
regulation of transcription which is mediated by transcription factors, cofactors, genome accessibility and epigenetic regulators. For analysis of such
high throughput transcriptomics data, there are several programs that have
been developed and are widely used. For differential gene expression data
analysis, R packages like EdgeR [63], DESeq [62], cuffDiff [60] have been
designed which are all based on common principles of normalisation, background correction and significance calculation. However, they differ based
on their acceptable input data, experiment design and background statistics.
Concluding Remarks for DGE
Studies have found high concordance between microarray data and high
throughput sequencing data, thus the microarray technology still holds good
when the study is aimed on the known genes and transcripts [64]. However,
sequencing based transciptome analysis has few advantages like absolute
quantification, low background noise, larger dynamic range, suitable for
non-model organism and high sensitivity. Tag-seq has a major drawback
due to short reads, which results in unspecific and multiple mapping. Few
difficulties that RNA-seq poses include library preparation for different types
of RNAs, fragment length limitation, coverage of transcriptome etc. There
are some common limitations of gene expression profiling methods. Most
of them require PCR amplification, which is found to the major source of
noise in the data. Abundance of some transcripts in RNA-seq may lead to
skewed results. In all gene expression experiments, replicates are vital as
they provide statistically significant results.
In the chapter 5 of this thesis, tag-seq has been employed to study differential gene expression between the seven adipose tissue samples from
sheep. Since, sheep is not a model organisms therefore a sequencing based
method suited well for gene expression profiling. Tag-seq was used which substantially lowered the cost as compared to RNA-seq without compromising
too much on biological information.
1.5 Epigenetics
Factors affecting the genome of the cell other than the nucleotide sequence,
which are above (“epi”) genetics, are collectively termed as epigenetic factors.
Therefore, epigenetics involves the study of these epigenetic changes that
occurr above genome and the factors influencing them. Epigenetics is involved in normal cellular processes like cell differentiation, proliferation and
maintenance of steady state. Many epigenetic factors control the expression
of genes by altering DNA folding and its compactness. Examples of such
epigenetic factors are methylation of DNA, acetylation, methylation, phosphorylation of histones, RNA-induced silencing and nucleosome positioning
Figure 1.6. Different epigenetic events that occur in nucleus of a
cell [65].
In most cells epigenetic mark are established at the time of differentiation
and maintained throughout the life of the cell. Under certain conditions the
epigenetic marks become dynamic and reversible. They are also influenced
by environmental factors, which might lead to development of abnormal phenotype. In higher organisms mostly cytosine residues in DNA are modified
to 5-methylcytosine (Figure 1.6). Global hypomethylation has been observed
in multiple cancers [66] while site specific hypermethylation occurs in CpG
islands in gene regions [67]. One of environmental factors having epigenetics
effects is diet. Investigation by Wolff et al. revealed that maternal diet
could alter coat colour of the offspring in mice [68]. Disruption of epigenetic
mechanisms causes several pathologies including cancer, mental retardation,
obesity and diabetes etc. [69]. In chapter 6 of this thesis, we have studied
DNA methylation changes between tissues from lean and obese mice. Thus,
this epigenetic mark has been discussed in details in the following section.
DNA methylation is a complex process in terms of regulations and it
depends on time, tissue, DNA sequence, region of the genome, and a concoction of other regulatory enzymes and proteins. DNA methylation is one of
the highly studied epigenetic mark, which controls gene activity specifically
during development and differentiation [70]. The extent of DNA methylation
changes in an orchestrated way during mammalian development, starting
with a wave of demethylation during cleavage, followed by genome-wide de
novo methylation after implantation [71]. Different DNA methyltransferases
(DNMTs) are responsible for the methylation of DNA, with each having
specific function [72].
Multiple methods are available to map the DNA methylation on genome
scale. These methods combine the methylation analysis of DNA with either
microarray (methylation chips) or with sequencing. Chip based methods
work on the same hybridization principles of expression microarrays, but
use two probes one for methylated and other for unmethlayed region capturing. The ratio of these two probes gives the signal of a base being
methylated or not. Infinium HumanMethylation Bead chip from Illumina
is a widely used array platform for humans [73]. The different methods available for preparing enriched library of methylation sequencing are :
•MeDIP-seq - uses an antibody specific to 5-methycytosine and retrieves
the methylated regions of the genome for sequencing [16]
•MethylCap-seq - uses methyl-binding domain proteins for capturing the
methylated regions [74]
•MRE-Seq - uses methylation sensitive restriction enzyme enriched data for
sequencing [75]
•MethylC-seq or BS-seq - uses bisulfite chemical reaction to convert unmethylated cytosines into uracils, thus introducing methylation-specific
single nucleotide polymorphisms, which can be differentiated from methylated CpGs [76]
The enriched DNA from these methods is subjected to sequencing and
data analysis as described in section 1.2. There are differences in accuracy,
coverage and resolution of these methods. Bisulphide methods have higher
accuracy than any of the enrichment method and are free of CpG bias [77].
Reduced representation bisulfite sequencing (RRBSeq) is a type of bisulphide
sequencing and gives more coverage in less sequencing [17]. MethylCap-seq
and MeDIP-seq gives higher coverage of the genome. All of them are equally
efficient at detecting the differentially methylated regions. Thus, the good
practice would be use the enrichment or chip based method to find associations and validate the findings with bisulphide methods.
Epigenetic and their transgenerational inheritance are also seen as an answer
to the missing causality in complex traits. Study examining the short- and
long effect of dietary supplement on genetically identical mouse suggests
that diet supplements induces small but widespread epigenetic changes in
exposed mice [78]. Comparing DNA methylation patterns of high and low
responders to a hypo-caloric diet has identified novel potential epigenetic
biomarkers for weight loss [79]. Thus, based on these finding a study was
designed to elucidate how DNA methylation affects the weight gain in the
diet-induced obese mice and is different from genetic obesity. We used
MeDIP-seq for finding the methylation changes between lean mice and obese
mice. MeDIP-seq provides high-quality whole genome methylation status at
typically 100 to 300-bp and the cost is comparable to other capture-based
As obesity studies have seen the effect of environmental factors like diet
across generation [80], epigenetics marks can lead to risk prediction as well
interventions for obesity. Established and potential psychopharmacological
drug are already known to influence epigenetic mechanisms [81]. Although
the field is in the early stages of understanding of the complex epigenetic
regulatory mechanism, but the preliminary evidences suggest there are possibilities for the development of epigenetic therapy for some disorders.
1.6 Proteomics
The proteome is the set of proteins that is expressed by a genome and the
study of structures and functions of proteins expressed at a given time in
a sample is called proteomics. Proteomics studies play an important role
in medicine and biology as it links genetics to the active molecules in cells
under normal and pathophysiological states. Mass spectrometry (MS) based
quantitative methods attempt to quantify constituent proteins in a sample
[82]. Quantitative proteomics is important for disease biology as well as
drug discovery, because expressed mRNAs may not be equivalent to the
corresponding protein quantity [83] and therefore, protein quantification will
indicate the real biological state of the cell.
Proteins can be modified even after they have been translated from mRNA.
Post translation modifications (PTMs) of proteins include addition and
removal of small chemical groups like acetyl and methyl. One of the very
prevalent modifications is the addition of phosphate residue called as phosphorylation. Protein phosphorylation on serine, threonine and tyrosine
residues occurs on more than one-third of all cellular proteins [84]. Proteins called kinases are responsible for transferring phosphate groups from
a donor to proteins [85] while phosphatases [86] are the family of proteins
responsible for removing phosphate residues. Kinases and phosphatases are
part of signalling processes and are also regulated by signalling processes.
Therefore, along with expression of the protein, post-translational modifications also influence shape, function and cellular localisation of the proteins.
Differences in the abundance of proteins and PTMs between disease and
non-disease samples define cellular processes and pathways perturbed by the
disease. Stable Isotope Labelling by Amino acids in Cell culture (SILAC)
is a methodology of MS based quantitative proteomics. In this method,
the total cellular proteome is labeled with non-radioactive, heavy isotope
by supplementing the medium with labelled amino acid for substitution in
the cell proteome by the normal biological process of protein synthesis [87].
These heavy amino acids can be distinguished in MS and when a labelled
sample is compared to a control sample, the difference gives the relative
quantification of the proteome.
The study of the proteome and the phosphoproteome in the same sample assists in determining the effect of phosphorylation on the expressed
protein set. This can further illustrate the control of signalling processes by
phosphorylation of proteins by kinases [88]. Both kinases and phosphatases
recognise their substrate by motif recognition. The methods described here
have been employed in the finding differentially expressed proteins as well
as differentially phosphorylated proteins in a chemotherapy resistant breast
cancer cell line (see appendix chapter 8). The study aims at finding the
effect of high TIMP1 expression on global proteome and phosphoproteome
in the resistant cells.
1.7 Machine Learning
With all the big data generated in multiple fields, it is a complex process to
analyse it. Automated methods of data analysis are the demand of time and
machine learning helps in designing them. Machine learning can be defined
as a set of methods that can automatically detect patterns in a data, recognise them when seen next time. The goal of machine learning is to learn the
rules of mapping a set of inputs to a set of outputs. The method, which uses
input from one set of data for learning and applies the knowledge to classifying another dataset, is called predictive or supervised. On the contrary,
descriptive or unsupervised learning tries to find pattern within the same
dataset. Pattern classification and knowledge discovery requires a subset of
features to represent the pattern in question to the best. Selection of these
features defines the performance of the classifier. Genetic algorithms offer an
attractive approach to find near optimal solution to select best descriptors by
generating multiple combinations and testing their accuracy [89]. However,
the time required to converge to the best combination is long. Some other
popular machine learning methods for classification are support vector machines (SVM), artificial neural networks (ANN), classification and regression
trees, etc. ANN along with a combinatorial approach of feature selection has
been used in supervised model for predicting asthma as presented in chapter
4 of this thesis and thus discussed in details.
Artificial Neural Networks
ANN is a method of artificial intelligence based on human brain functionality. The method was originally invented in 1943 [90] and has been applied
extensively to solve non-linear problems. A typical ANN is comprised of
three types of layers made up of nodes (denoting neurons), input layer that
passes input data to other layers, an output layer that is layer that captures
the classification outcome and the hidden layers, which captures the data
from previous layers and passes the processed data to the next layer (Figure
1.7). The nodes of different layers are connected by edges (equivalent to
synapse in brain) and denote weights. An ANN design consists of three
cycles, learning, testing and decision making. A learning strategy is applied
to change the weights in order to optimise the error.
ANNs recognises patterns in the data from a known dataset called the training data and the main goal of the network is to make predictions on novel
inputs, called the test data. During learning cycle, a function is optimised
to maximise the capture of positives and rejection the negative data points.
In the iterations over a number cycles called “epochs”, every data point in
training data is fed to the ANN one after the other. The error in prediction
is calculated and weights are updated.
Figure 1.7. Simplified artificial neural network presenting the
three layers and the edges.
The error function used for binary classification problem is the aggregate of square differences of predicted output from ANN and the target
(known) value.
1 ∑(
ti − oi
2 i
Where t: target value and o: output from ANN.
The most widely used stopping criterion for the learning cycles is attainment of highest test correlation coefficient. There are two correlation
coefficients, (i) Matthew’s correlation coefficient (MCC) used for binary
classification where as (ii) Pearson correlation coefficient (PCC) used for
continuous output variable. The weights from cycle having the highest test
correlation coefficients are used for later classifications. In chapter 4, we
have used ANN for predicting disease outcome, which is a binary variable
and thus I used MCC as the stopping criteria.
The formula to calculate MCC can be represented by the equation:
TP ∗ TN − FP ∗ FN
M CC = √
(T P + F P )(T P + F N )(T N + F P )(T N + F N )
TP: number of True Positive (Predicted = True, Actual = True)
TN: number of True Negative (Predicted = False, Actual = False)
FP: number of False Positive (Predicted = True, Actual = False)
FN: number of False Negative (Predicted = False, Actual = True)
Sensitivity and specificity are two more prediction accuracy parameters
commonly use in machine learning. Sensitivity measures the ability of the
model in predicting positives as positives, while specificity measures accuracy
of the model in rejecting the negatives.
Sensitivity =
Specif icity =
Test set
Training set
4 fold cross validation for training and testing
Average evaluation
of the classier
Evaluation set
Figure 1.8. Four-fold cross validation of the training data. The
training data divided into 4 parts each act as the test set for stopping the training once.
The output from the ANN is generally a probability. Probabilistic predictions [91] are suitable for classifications as they assign a probability value to
each data point. The difference in these probabilities is an estimation of it
belonging to one class. This probability is converted to positive or negative
classification by applying a threshold value. All the above mentioned measures depend on this threshold and the threshold used in this work is 0.5.
Increasing the accuracy of the model to high extents may lead to over
fitted model. This model when tested against the test set cannot tolerate
a minor variation of data and results in inaccurate predictions. To avoid
such a pitfall of over training, every minor variation in the training dataset
needs not to be modelled and thus a method predicting few false positives,
is acceptable. If the data set is small and the data is partitioned in training
and test, there would not be enough data to train and test the models. A
simple but popular solution to this problem is to use cross validation (Figure
1.8). In this method, the total data is randomly partitioned into X parts.
Xth part is used for testing while (X-1) are combined to form the training
set. This method is repeated X times, each time using a different set as the
test set. When all sets are done, the results are averaged to give a single
measure of performance. Since, the test set is a part of the training set, thus
it is better to have an external validation set, never seen by the model for
selecting the model with best complexity.
In the field of biology, ANN has been successfully applied to prediction
problems, some well known are secondary structure [92], post-translational
modifications [93], epitope prediction [94] and recently in disease outcome
[95]. ANNs have been used to detect association of disease outcome and
multiple marker genotypes and this provides a simple and practical method
while allowing multiple markers to be analysed simultaneously.
1.8 Translating High Throughput Variation Data to
All individuals differ from each other and these differences in them are encoded in the variations of their genetic or epigenetic state. The hypothesis
most studies in disease biology follow is to find what makes patients more
susceptible to disease than controls. These variations might be due to underlying genomic variation or a variation in the controlling mechanism of
expression referred to as cellular signalling. In the association studies a test
is made to find which of these variation are more related to the phenotypic
condition than the others. Once a set of suspected variations is discovered
the next step is to find the mechanism of action for these variations in light of
the observed phenotype. This is done by applying the methodologies of functional analyses on these variations to uncover their mode of action. There are
two principle ways of doing this, first looking for the action of the variation
on the gene or the gene product thus analysing each variation individually.
The second method is a cumulative method, where all variations are mapped
to various genes and the functional analyses is done on this gene set and also
taking into account the interacting partners of the genes and proteins coded
by them (Figure 1.9).
Effects of Genomic Variations
The genomic variations are spread through out the genome and since 97%
of the human genome is non-protein coding [96] finding an effect of the
variation in these regions is difficult. Therefore, different methods need to
be applied for annotating coding and non-coding variations.
The most annotated share of variations located in the protein-coding region
is based on the evolutionary and biochemical evidences. These are classified
depending upon if the amino acid is altered, a stop codon is gained or lost
or if a coding frame has been changed. In a nutshell, the variations are
annotated according their effect on the protein. As it is known that a single
gene can be transcribed to form multiple isoforms of a proteins, affects of
these variations needs to be analysed on transcript level [97]. There are
computational tools based on location of the variation, its biochemical effect
along with its evolutionary history in different organism to predict the effect
of a polymorphism on the proteins as well as if or not these polymorphisms
are harmful for the organism. The most popular ones include SIFT [98]
and Polyphen-2 [99]. There are other meta analysis tools which take the
results from multiple predictors and produce a consensus score for each SNP
e.g. ANNOVAR [100], Condel [101], SnpEff [102], Variant Effect Predictor
(VEP) [97] etc. These tools vary in the number of information source they
use and the statistics they apply. ANNOVAR uses six different scores [103]
while others mainly rely on PolyPhen and SIFT. SNPs are annotated for
their effects and there are databases storing the predicted effect of the
common SNPs as well as their known association with disease traits. These
include Short Genetic Variations database (dbSNP) [104], Ensembl [105],
Human Gene Mutation Database (HGMD) [106], clinVAR [107] etc. This
knowledge base helps in filtering the pre-annotated SNPs before going to
the prediction phase. These databases are still far from being complete even
for the known variations and thus rechecking the top hits manually and
validating experimentally is always advised.
Figure 1.9. Annotations of SNPs, their relation to pathways,
roles in diseases and comparative genomics [108]. The green lines
show the solved translations of variations where as red connections
show the areas under developement.
Since the coding variations make only 1% of total genome variations, the
big portion of variations lie in the non-coding regions. Based on evolutionary studies, even these non-coding regions are found to be conserved and
many such conserved sequences are involved in regulating the expression of
neighbouring genes [109]. As a result, variations in these regions would have
a functional effect. Since, most GWAS are carried out on the genome wide
SNP-chips, a large amount of GWAS hits are non-coding. Such SNPs impart
regulatory effects either by coding for microRNA (miRNA) and long noncoding RNAs (lncRNA), or they harbour transcription factor binding sites
and regulate expression by modulating chromatin architecture. Therefore,
it is very important to associate such SNPs with their function. The most
prominent effort in annotating the non-coding variations is carried out by
the Encyclopedia of DNA Elements (ENCODE) Consortium [110].
Although the population or cohort based studies are designed based on
“common disease common variant” principle, they have only been successful in explaining a modest fraction of the genetic components of human
common diseases. This is because there exists rare variants which are less
than 1% but still polymorphic in certain human populations and few of
these have been found to be associated with common diseases [111]. These
variations can be detected with whole genome sequencing of the affected individuals along with the family members and detailed phenotypic knowledge.
The above mentioned methods of annotating variations from genomic data
have been applied in Chapter 2. Applications of these methods for analysing
Danish pan-genome and an ancient genome project data have been explained
in chapter 7. In brief, these methods were used to classifying the variations
into functional clusters. These clusters are further subjected to functional or
pathway enrichment analyses.
Enrichment Analysis
When a gene set is found to carry variations from a genetic study or differentially expressed in a transcriptome study, there is a need to find an enriched
biological functions in the group. Enrichment analysis is about exploring
the common feature, which can cover a big portion of the set rather than
studying all gene products individually.
The functional knowledge about any gene is either obtained from experiments or using sequence similarity approaches. Gene ontology (GO) is a
resource which classifies genes based on their known functions using a systematic vocabulary [112]. GO is a hierarchical classification of gene functions
where the lowest nodes represent most specific known function of the gene.
The GO terms are classified into 3 major categories: cellular component,
molecular function, biological process. A number of methods have been
developed to enrich gene sets for GO classes, e.g. Amigo [113], Gorilla
[114], EasyGO [115], Gene set enrichment analysis (GSEA) [116], DAVID
[117]. To ascertain the significance of enrichment, p-values are calculated
and corrected for multiple testing.
The proteins do not work independently in the cell and majority of them
interact physically with each other for proper functioning. The techniques
applied to detect protein-protein interactions (PPIs) in a cell include immunoprecipitation, selective protease digestions, western blotting, phage
display and two-hybrid analysis etc. There are numerous databases storing
known PPIs, for example Database of Interacting Proteins (DIP) [118], the
Molecular INTeraction database (MINT) [119], IntAct [120], Biomolecular
Interaction Network Database (BIND) [121], General Repository for Interaction Datasets (GRID) [122], Human Protein Reference Database (HPRD)
[123] etc. These databases store the interaction information as interacting
pairs. Mutations in different members of a protein complex lead to comparable phenotype. Based on this, protein complexes collection could be
associated to known disease, organs and GO classes [124].
Transcription factors (TF) are the regulatory proteins required for the
activation or deactivation of transcription by binding to specific DNA sequences called the TF binding sites. TF binding is sequence specific where
each TF has a specific binding motif. The JASPAR CORE [125] and Transfac [126] databases contain curated, non-redundant set of TF profiles from
experimentally derived TF binding sites. ChEA [127], CistromeMap [128],
CTCFBSDB [129] and CHIPBase [130] are the databases with genome scale
maps of TF binding. TFSEARCH [131], PROMO [132], MEME suite [133],
P-match [134], SiTAR [135], are computational tools, which can be used
to predict TF binding sites. TF binding site prediction tools were used
to identify TF enrichment in differentially expressed genes in differentially
regulated gene sets reported in chapter 5 and 6.
Pathway Analysis
A biological pathway is a series of events occurring among the molecules
within a cell that leads to a change in the cell physiology or morphology.
Pathway analysis gives insight into the underlying biology of differentially
expressed or polymorphic genes (Figure 1.9). Grouping of gene set into biological pathways reduces the complexity as well as assists in identifying
the mechanisms [136]. The biological pathway databases mainly used in the
functional analysis performed in this thesis are Kyoto Encyclopedia of Genes
and Genomes (KEGG) [137], Reactome [138] and Database of Cell Signaling
Integrative Analysis
Analysis of gene sets by complementing it with different types of data will
improve the functional relevance of the gene set [139]. It is due to the fact
that all genes are not affected at the same time and even all changes can
not be captured by a single experiment. So, the method of augmenting
results from one data type with other helps in filling these gaps. For example, differentially regulated gene sets can be further subjected to PPI
analysis using several tools like Ingenuity Pathway Analysis (IPA) [(Ingenuit®Systems,], Explain [140] (, GeneMANIA [141], STRING [142],
Enrichr [143]. Also, visualisation tools like cytoscape helps in better interpretation of these interactions [144].
All the tools mentioned in this section can be complementary to each
other with some having few advantages over the others. The GO enrichment
tools are good for getting a general idea of the functional impact of the gene
sets. The tools like Explain provides manually curated GO from published
functional studies have higher confidence but less coverage. DAVID, GSEA,
IPA, GeneMANIA, EXPLAIN and Enrichr have a long range of background
data sets against which the test set can be queried. GeneMANIA provides
an integrative network based on multiple sources. IPA and Explain are
commercial tools with high confidence manually curated data. They provide information about signalling & transcriptional networks and specific
cancer pathways. GSEA also has multiple sources and tools with varied
functionalities, available as modules in GenePattern [145].
Pathway Based Prediction Tool
Pathway-based methods group the variations or genes into pre-selected
subset allowing the testing of joint effects. Pathway based GWAS have
higher power than several other approaches to find pathway and disease
associations [146]. This approach of combining SNPs in pathway subset
is utilised in the asthma risk prediction tool presented in chapter 4 of the
thesis. With the hypothesis that a specific combination of genetic factors
when integrated with certain clinical or environmental features increases the
disease risk [147], the genetic and clinical features were tested in combinations for childhood asthma prediction. For this prediction tool, a variation
of genetic algorithm coupled with ANN, was designed for feature selection
and prediction. The results from this pathway based approach are further
discussed in the manuscript following the chapter.
These pathway based approaches are used to complement single SNP
studies to uncover the underlining biological mechanisms. However, these
approaches suffers from drawbacks of the pathway knowledge base not yet
been fully developed. There are few genes, which are very well studied while
others are still to be included into any pathway (e.g.ARID5B). None of
the pathway resource is complete as they are developed from different perspectives. Thus, they all complement each other and but individually they
lack a comprehensive understanding of all biological processes. Therefore,
the success of pathway based methods depend on the future development
of pathway resources. Therefore, there is need to increase the resolution of
databases and to complete and correct the information in them [108]. On the
methodology side, the additional and precise benchmark datasets generated
from real biological sets, would increase sensitivity and specificity of pathway
based enrichment analysis methods.
Challenges of Next Generation Sequencing
Most of the data used the projects come from high throughput sequencing,
it is important that we discuss the difficulties faced with this data. The base
calling is the most critical step in the NGS data interpretation. Technology
differences between platforms and use of different base calling algorithms lead
to platform specific errors. Considering raw sequencing error rates, accurate
mapping of the reads is a major bottleneck. During mapping, the multiple
mapped reads are either discarded, if not, either one of random alignments
or a user defined maximum number of alignments can be reported. The raw
error rates and possibility of multiple alignments introduces mis-alignments.
These mis-alignments can be efficiently reduced by using longer or paired
end reads. Due to methodological differences in aligners and variant caller,
they impact potential variant calls. However, these differences do not affect the robust calls but still a small portion of the variant calls may turn
out to be false [148]. To reduce the false calls, it has been suggested to use
variants called by multiple variant calling pipelines [148]. Also, using multigeneration familial data increases the accuracy of de novo variant calls [148].
Finally it is recommended to validate a SNP or indels by another method,
which will substantiate the call made by the program. Above all this, sequencing generates massive amount of data which poses a big bioinformatics
challenge for storing, quality control, alignment, assemble and annotation of
all these million and billions of reads. Along with the technological worries,
there are certain biological concerns of sequencing samples that are treated in
non-standard ways. For example, the formalin-fixed and paraffin-embedded
(FFPE) samples, which is a common method of storage in hospitals, are
prone to degradation during sample preparation. The degraded samples lead
to high error rates in sequencing and low coverage. Thus, new NGS technology as well as data analysis methods need to take into account these effects.
Sequencing tumour samples pose another problem as tumours are very heterogeneous. Even if the sequencing is done on a single sample it is generally
a population of non-identical cells which has been sequenced. More precise
results can be obtained for such samples with the development of single cell
sequencing. However, it is still in a developing phase and also there are no
specialised data analysis tools for such sequencing. There is vast variety of
NGS technologies and tools available, generally tied together to form an NGS
pipeline, usage and choice of which of them depends on the biological problem
in question.
1.9 Complex Phenotypes
The internally coded heritable information called “genotype”, found in all
living organisms, contains the instructions regarding the structures and
processes of the organism. These instructions are interpreted by the cellular
machinery to manifest the external appearance and other complex phenomena like metabolism, tissues, organs, functions and behaviours, which
are collectively called “phenotype”. On cellular level, phenotype can be
defined as observable physical and/or biochemical characteristics of the
genes expressed within a cell. It is known that phenotype is the result of
genotype interacting with the environment. Thus, the phenotypes can be
predicted from genotypes and vise-a-versa. The mechanisms of DNA, RNA
and proteins interactions active inside the cell affect the observable traits of
the cell. Most of the phenotypes are complex as they are a ensemble result
of multiple interactions between different cellular components. When the
balance between these complex interaction within a cell or an organism is
disturbed, it leads to disorder or disease state.
Diseases having multiple causative factors and that represent different symptoms in different individuals are termed as “complex diseases”. These diseases
do not obey the standard Mendelian patterns of inheritance. The disease
causing factors can be genetic, environmental or a combination of both [149].
Some examples of the well-known complex diseases are Alzheimer’s disease,
scleroderma, asthma, Parkinson’s disease, multiple sclerosis, diabetes, obesity and cancer. These diseases differ in the symptoms amongst individual
and thus can be divided into sub disease with overlapping symptoms, called
as endophenotyes [150], which are also found to differ in the causal factors.
Some individuals are predisposed for certain diseases. Genetic predisposition
means that the genetic makeup makes the person susceptible to the disease
but that does not mean the person will have the disease. The gene products
interact with the environment at the molecular level. Similarly the environmental factors, which could potentially lead to a condition, might not be
able to affect an individual because the macromolecules within the cells do
not support action of environment on the body. Thus, the gene-environment
coordination plays an important role in determining the course of disease
and as we cannot change our genes, environmental and lifestyle changes may
help in prevention of some of the diseases.
Studies involving two complex diseases, childhood asthma and obesity,
are part of this thesis and thus discussed in further details in the Part II and
III respectively.
Part II
Childhood Asthma
Asthma is one of the most common non-communicable diseases. According to
WHO in 2013 approximately 235 million people are currently suffering from
asthma. Asthma is one of the most common chronic diseases of childhood
and the most frequent reason for paediatric hospitalisations [151]. Asthma is
a disease characterised by recurrent attacks of breathlessness and wheezing.
Majority of asthmatics are also atopic as they are allergic to aeroallergens
and food elements. IgE, the central player in the allergic response is found to
be elevated in asthmatic individuals. Asthma has significant heterogeneity
in phenotypes that led to multiple classifications. The phenotypic classification of asthma into early, transient, late onset and persistent wheeze by the
Tucson group has been widely popular [152]. In asthma, airway inflammation contributes to airway hyperresponsiveness and airflow limitation due to
mucus hypersecretion or smooth muscle hypertrophy. Evidence also suggests
a key role for respiratory infections in these processes.
Childhood asthma
When asthma occurs at an age less than 18 years, it is treated as childhood
asthma and one occurring in infants is called as early onset asthma. It is
known that sensitisation by microbial infections in early life reduces the
risk of asthma [154]. Asthma cases have risen in the last few decades due
to the absence of multiple infections in the early age [155]. Asthma in
children is hard to diagnose though it manifests similar symptoms as adults.
There are multiple risk factors found to be associated with the childhood
asthma [156]. Exposure of fetus to maternal smoking[157], maternal atopy,
preeclampsia and hypertension are associated with asthma and similar phenotypes in newborns [158]. Early sensitisation to aeroallergens has been
shown to have predictive power for wheeze, bronchial responsiveness and loss
of lung functions [159]. Low birth weight of newborns is also a risk factor
Figure I.1 Genome wide spread of asthma related gene [153]
of asthma. Racial disparities prevail in asthma with other socio-economic
factors like size of the family, size of house, poverty and mother’s age at
the time of birth being associated with increased risk of childhood asthma
[160]. Environmental factors play protective as well as causative role in
childhood asthma [161]. Exposure to microbes in the early stages of life
may be sufficient to stimulate pattern-recognition receptors of the innate
immune response and thus sensitising the body in form of memory T cells,
to protect from more severe microbial infection, which could lead to asthma.
Environmental conditions like air pollution, tobacco smoke and dampness in
the house have adverse effects on childhood asthma.
Multiple studies have associated more than 100 genetic factors to asthma
based on GWAS carried on samples from different ethnicity (Figure I.1).
Some of the strong ones have been replicated with the same power in separate
studies. 17q21 locus encoding orosomucoid like 3 (ORMDL3) and gasdermin B (GSMDB) has been associated with childhood asthma in ethnically
diverse subjects from Europe, North America and Asia [162]. Multiple IgE
controlling genes e.g. FCER1A are found to mediate asthma [163]. Other
genes that are replicated in multiple studies to be associated with asthma are
DENND1B [164], locus containing IL1RL1 [162] and IL18R1 [162], HLA-DQ
[162], IL33 [162] and SMAD3 [162]. Stress has also been found to play a
critical role in asthmatic attacks in children connecting to the epigenetic
and genetic alterations in ADCYAP1R1 gene [165]. Replications of asthma
GWAS studies have identified multiple loci to be associated with pulmonary
functions [166] and lung function [167] thus indicating common cause behind
related phenotypes. Asthma heritability is estimated to be 70-90% and
GWAS have been able to identify limited loci explaining a small percentage
of this, leaving a bigger portion still to be revealed.
In case of complex disease like asthma, which has a vast variety of symptoms, phenotypic classification can help in effective search of the causative
variance. Exacerbation, one of the asthma phenotypes, is defined as frequent
admissions to hospital with asthma phenotype. Chapter 2 of the thesis
includes a manuscript titled ”A genome-wide association study identifies
CDHR3 as a susceptibility locus for early childhood asthma with severe
exacerbations”. In this study, GWAS was carried out on a children cohort
with exacerbation and normal adult controls. Genotyping was performed
with SNP-arrays to identify loci with exacerbation associations.
As different SNPs from same gene are discovered as the disease associated in multiple GWAS studies with close odd ratios, it shows that we still
have not found all causal variant. Thus, there is a need to identify the
missing causal variation within these genes. This can be done by selectively
sequencing the candidate genes. This provides not only the single point variation in these loci but also the structural variations. The study in chapter 3
is based on sequencing of 16 candidate regions, which have been associated
with asthma and related phenotype.
Gene-to-environmental interactions are important in the development and
expression of asthma. Thus, clinical features need to be included along with
genetic featuring for predicting asthma. Chapter 4 of this thesis aims to
combine genotypes and clinical features using ANN to predict asthma at age
7 years.
Chapter 2
Paper I - Genome-wide
association analysis of childhood
There are multiple dimensions in asthma disease, which can range from frequent wheeze to multiple hospitalisations called as exacerbation. Amongst
all the asthma phenotypes, exacerbations have the greatest impact on health
care and treatment costs [168]. Schatz et al [169] reported that exacerbation
clusters separately from the daily symptoms and lung function in discriminant analysis, thus suggesting that the factors responsible for exacerbation
risk may differ from the other asthma phenotypes.
To find genetic factors for exacerbation, GWAS was designed in a children cohort. Cohort was stratified for number of exacerbations to discrete
variations responsible for differences in the severity of the phenotype. To
test the robustness of the significant hits in the GWAS, study was replicated
in two birth cohorts of European ancestry. Replication in a cohort of mixed
ancestry was done to examine cross ethnicity causal variants. Regional
imputation as described in section 1.3 was performed for the top hits to
examine the variants missed in genotyping. A novel SNP in the CDHR3
(rs6967330) was further analysed. The study showed the importance of
specific phenotyping for using small cohorts in GWAS.
My contribution to the project
My contribution to this project was to investigate the impact of the novel
GWAS hit, rs6967330, on the CDHR3 gene product and its influence on
asthma exacerbation outcome. Different databases where searched to collect
known literature about CDHR3 and related proteins. CDHR3 is a transmembrane protein with six extracellular cadherin domains, belonging to a
family of membrane proteins adhesion molecules. The members of cadherin
family mediate Ca++ -dependent cell-cell adhesion in all solid tissues. These
proteins also modulate a wide variety of processes including cell polarisation and migration [170] [171]. According to uniprot knowledge base, these
proteins preferentially interact within the protein family in a homophilic
manner in connecting cells, thus are suggested to contribute towards sorting of heterogeneous cell populations. Other members of cadherin family,
E-cadherin [172] and protocadherin-1 [173] have been earlier associated with
asthma related traits.
Knowing the fact that not all genes are expressed in every cell of the
body, the first aim was to find if CDHR3 is expressed in any of the asthma
related tissues. Different data sets for gene expression data from GEO
were curated. Datasets containing CDHR3 probes were queried to retrieve
expression values for CDHR3. These expression values were normalised with
respect to other probes on the array. CDHR3 was found differentially over
expressed in lungs [174]. In another study of human post mortem tissue
samples, CDHR3 was found to be over expressed in bronchi, trachea and
lungs [175]. Gene expression profiling of the human hematopoietic system
showed high expression of CDHR3 in B-lymphocytes as compared to other
immunological white blood cells from healthy individuals [176]. CDHR3
was also found to be tenfold up-regulated in differentiating epithelial cells,
which is a process involved in the development of airway epithelium [177].
These findings were used in hypothesis generation and in designing of the
experiments for finding the effect of the SNP.
SNP rs6967330 (G>A) is a non-synonymous variation due to which cysteine, a medium size and polar amino acid, is replaced by tyrosine, a large
size and aromatic amino acid. This SNP is present in the cadherin 5 domain
of the mutated protein. The ancestral allele “A” codes for tyrosine animo acid
and is the frequent allele in mammals other than humans. The SNP is found
to be deleterious by SNP effect prediction tool condel [101]. Based on these
facts about the SNP and the gene, functional studies to find the expression
of non mutated and mutated proteins were designed. The homology model of
mutated CDHR3 suggest the interference of SNP in the protein stabilisation
and folding, which is in accordance with the experimental results.
© 2013 Nature America, Inc. All rights reserved.
A genome-wide association study identifies CDHR3 as
a susceptibility locus for early childhood asthma with
severe exacerbations
Klaus Bønnelykke1,24,25, Patrick Sleiman2,24, Kasper Nielsen3,24, Eskil Kreiner-Møller1, Josep M Mercader4,
Danielle Belgrave5,6, Herman T den Dekker7–9, Anders Husby1,10, Astrid Sevelsted1, Grissel Faura-Tellez11,12,
Li Juel Mortensen1, Lavinia Paternoster13, Richard Flaaten1, Anne Mølgaard1, David E Smart10, Philip F Thomsen14,
Morten A Rasmussen15, Silvia Bonàs-Guarch4, Claus Holst16, Ellen A Nohr17,18, Rachita Yadav3,
Michael E March2, Thomas Blicher19, Peter M Lackie11, Vincent W V Jaddoe7,9,20, Angela Simpson5,
John W Holloway11, Liesbeth Duijts8,9,21, Adnan Custovic5, Donna E Davies10, David Torrents4,22,
Ramneek Gupta3, Mads V Hollegaard23, David M Hougaard23, Hakon Hakonarson2,25 & Hans Bisgaard1,25
Asthma exacerbations are among the most frequent causes of
hospitalization during childhood, but the underlying mechanisms
are poorly understood. We performed a genome-wide association
study of a specific asthma phenotype characterized by recurrent,
severe exacerbations occurring between 2 and 6 years of age in
a total of 1,173 cases and 2,522 controls. Cases were identified
from national health registries of hospitalization, and DNA was
obtained from the Danish Neonatal Screening Biobank. We
identified five loci with genome-wide significant association.
Four of these, GSDMB, IL33, RAD50 and IL1RL1, were previously
reported as asthma susceptibility loci, but the effect sizes for
these loci in our cohort were considerably larger than in the
previous genome-wide association studies of asthma. We also
obtained strong evidence for a new susceptibility gene, CDHR3
(encoding cadherin-related family member 3), which is highly
expressed in airway epithelium. These results demonstrate the
strength of applying specific phenotyping in the search for asthma
susceptibility genes.
Acute asthma exacerbations are among the most frequent causes of
hospitalization during childhood and are responsible for large healthcare expenditures1–4. Available treatment options for prevention and
treatment of asthma exacerbations are inadequate5, suggesting that
asthma with severe exacerbations may represent a distinct subtype
of disease and demonstrating a need for improved understanding of
its pathogenesis.
Asthma heritability is estimated to be 70–90% (refs. 6,7), but only
a limited number of susceptibility loci have been verified in genomewide association studies (GWAS)8–13. Larger GWAS may identify
new susceptibility loci with smaller effects, but, owing to the large
heterogeneity in asthma14, an alternative strategy is to increase phenotype specificity in genome-wide analyses. A specific phenotype is
likely to be more closely related to a specific pathogenetic mechanism,
and focusing on a particular phenotype may increase the power of
genetic studies.
We aimed to increase understanding of the genetic background of
early childhood asthma with severe exacerbations by conducting a
Prospective Studies on Asthma in Childhood, Health Sciences, University of Copenhagen & Danish Pediatric Asthma Center, Copenhagen University
Hospital, Gentofte, Denmark. 2Center for Applied Genomics, Children’s Hospital of Philadelphia (CHOP), Philadelphia, Pennsylvania, USA. 3Center for Biological
Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark. 4Joint Institute for Research in Biomedicine and Barcelona
Supercomputing Center (IRB-BSC) Program on Computational Biology, Barcelona Supercomputing Center, Barcelona, Spain. 5Centre for Respiratory Medicine and Allergy,
Institute of Inflammation and Repair, University of Manchester and University Hospital of South Manchester, Manchester, UK. 6Centre for Health Informatics, Institute
of Population Health, University of Manchester, Manchester, UK. 7Generation R Study Group, Erasmus Medical Center, Rotterdam, The Netherlands. 8Department of
Pediatrics, Division of Respiratory Medicine, Erasmus Medical Center, Rotterdam, The Netherlands. 9Department of Epidemiology, Erasmus Medical Center, Rotterdam,
The Netherlands. 10Brooke Laboratory, Clinical and Experimental Sciences, Faculty of Medicine, University of Southampton, University Hospital Southampton,
Southampton, UK. 11Faculty of Medicine, University of Southampton, Southampton General Hospital, Southampton, UK. 12Pediatric Pulmonology and Pediatric
Allergology, University of Groningen, University Medical Center Groningen, Beatrix Children’s Hospital, Groningen Research Institute for Asthma and COPD, Groningen,
The Netherlands. 13Integrative Epidemiology Unit, School of Social & Community Medicine, University of Bristol, Bristol, UK. 14Center for GeoGenetics, Natural History
Museum of Denmark, University of Copenhagen, Copenhagen, Denmark. 15Department of Food Science, University of Copenhagen, Copenhagen, Denmark. 16Institute
of Preventive Medicine, Copenhagen University Hospital, Copenhagen, Denmark. 17Institute of Clinical Research, University of Southern Denmark, Aarhus, Denmark.
18Department of Public Health, Section for Epidemiology, Aarhus University, Aarhus, Denmark. 19Novo Nordisk Foundation Center for Protein Research, Faculty of
Health Sciences, University of Copenhagen, Copenhagen, Denmark. 20Department of Pediatrics, Erasmus Medical Center, Rotterdam, The Netherlands. 21Department
of Pediatrics, Division of Neonatology, Erasmus Medical Center, Rotterdam, The Netherlands. 22Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona,
Spain. 23Danish Centre for Neonatal Screening, Department of Clinical Biochemistry and Immunology, Statens Serum Institut (SSI), Copenhagen, Denmark.
24These authors contributed equally to this work. 25These authors jointly directed this work. Correspondence should be addressed to K.B. ([email protected]).
Received 27 May; accepted 28 October; published online 17 November 2013; doi:10.1038/ng.2830
–log10 (P value)
© 2013 Nature America, Inc. All rights reserved.
9 10 11 12 13141516 19 22
17 20
18 21
Figure 1 Manhattan plot for the discovery genome-wide association
analysis. The horizontal line indicates the genome-wide significance
threshold (P < 5 × 10−8).
GWAS of this particular asthma phenotype. We identified children
with recurrent acute hospitalizations for asthma occurring between 2
and 6 years of age (cases) from the Danish National Patient Register.
We then extracted and amplified DNA from dried blood spot samples
isolated from the Danish Neonatal Screening Biobank, as previously
described15,16, before genome-wide array genotyping (Affymetrix
Axiom CEU array).
Case criteria were fulfilled for 2,029 of 1.7 million children born in
Denmark between 1982 and 1995 (1.1/1,000 children). The final case
cohort (Copenhagen Prospective Studies on Asthma in Childhood
exacerbation cohort, COPSACexacerbation) after genotyping and quality
control comprised 1,173 children (Supplementary Fig. 1). Compared
to the general population, cases were more often boys (67 versus 51%)
and more often had mothers who smoked during pregnancy (32 versus 15%) (Supplementary Tables 1 and 2). Controls consisted of 2,511
individuals of Danish descent without asthma who were previously
genotyped (Illumina Human610-Quad v1.0 BeadChip). We analyzed
association between disease and 124,514 SNPs genotyped in both
cases and controls, and we accounted for population stratification
by multidimensional scaling. The genomic inflation factor was 1.04.
The genome-wide association analysis detected an excess of association signals beyond those expected by chance (Supplementary
Fig. 2), and SNPs from five regions reached genome-wide significance (P < 5 × 10−8; Fig. 1 and Supplementary Fig. 3). The top SNPs
from the five loci were rs2305480 in GSDMB (odds ratio (OR) = 2.28,
P = 1.3 × 10−48), rs928413 near IL33 (OR = 1.50, P = 4.2 × 10−13),
rs6871536 in RAD50 (OR = 1.44, P = 1.7 × 10−9), rs1558641 in IL1RL1
(OR = 1.56, P = 6.6 × 10−9) and rs6967330 in CDHR3 (OR = 1.45,
P = 1.4 × 10−8) (Table 1). Validation of results for the top SNPs by
regenotyping of cases and use of an alternative control population
gave similar results (Supplementary Tables 3 and 4).
Association analyses in the discovery cohort stratified on
number of asthma-related hospitalizations showed higher OR with
increasing number of hospitalizations for all five SNPs (Table 2).
There was no significant interaction between the top SNPs and no
effect modification by sex.
We first sought replication in the childhood-onset stratum
(with onset before 16 years of age) from a previous GWAS of
asthma including 14,503 individuals conducted by the GABRIEL
Consortium11 (Supplementary Table 5), which showed evidence of association for all 5 of the genome-wide significant loci
reported here (Table 1). The CDHR3 locus was the only locus
that had not previously been associated with asthma or any other
atopic trait. We therefore followed up the top SNP from this locus
(rs6967330) by further replication in a total of 3,975 children from
2 birth cohorts of European ancestry (COPSAC2000 and the
Manchester Asthma and Allergy Study (MAAS)) and in 1 cohort with
a population of mixed ancestry (Generation R). There was evidence
for association with asthma before the age of 6 years in combined
analyses of the three birth cohorts and in the combined replication
sets (Table 1, Supplementary Fig. 4 and Supplementary Table 6),
as well as in a subsample including the 980 individuals with nonEuropean ancestry (Supplementary Table 6).
Phenotype-specific replication was possible in the COPSAC2000
and MAAS birth cohorts with prospective registration of acute asthma
hospitalizations and exacerbations from birth to 6 years of age in a
Table 1 Discovery and replication results for the five genome-wide significant loci in the discovery analyses
SNP effect
Distance to
gene (bp)
Effect allele
Replication 1
Replication 1
Replication 1
Replication 1
Replication 1
Replication 2
Replications 1 + 2
Discovery + replications 1 + 2
OR (95% CI)
P value (fixed- P value (random
effects model)a effects model) P heterogeneity
1.3 × 10−48
6.4 × 10−23
4.2 × 10−13
8.8 × 10−13
1.8 × 10−9
7.6 × 10−7
6.6 × 10−9
1.4 × 10−8
3.0 × 10−6
3.2 × 10−4
1.6 × 10−8
2.7 × 10−14
6.4 × 10−23
2.5 × 10−6
7.6 × 10−7
1.3 × 10−4
3.2 × 10−4
2.6 × 10−6
2.7 × 10−7
Replication P values are shown in bold if significant after Bonferroni correction for the five loci tested (P < 0.01). Replication 1 results are from a previously published large-scale
GWAS of asthma (asthma onset before 16 years; subanalysis of ref. 11). Replication 2 results are from the COPSAC2000, MAAS and Generation R cohorts (asthma onset before
6 years). Chr., chromosome.
fixed-effects model was not applied in the discovery analysis.
Table 2 Association results for the five genome-wide significant and replicated top SNPs stratified on number of hospitalizations for
asthma or acute bronchitis from 0–6 years of age in the discovery cohort
Number of asthma-related hospitalizations
SNP effect allele
n = 272
n = 228
n = 277
6 or more
n = 358
Association between
number of hospitalizations
and genotype
P valuea
Nearest gene
OR (95% CI) P value
OR (95% CI) P value
OR (95% CI) P value
OR (95% CI) P value
1.87 (1.54–2.26)
1.5 × 10−10
2.24 (1.81–2.78)
2.1 × 10−13
2.24 (1.83–2.73)
1.7 × 10−15
2.72 (2.26–3.28)
3.5 × 10−27
1.32 (1.09–1.61)
1.22 (0.98–1.50)
1.47 (1.21–1.79)
8.5 × 10−5
1.91 (1.61–2.26)
6.2 × 10−14
2.4 × 10−4
1.31 (1.06–1.61)
1.26 (1.00–1.59)
1.45 (1.18–1.78)
3.6 × 10−4
1.58 (1.31–1.89)
1.3 × 10−6
1.53 (1.16–2.02)
1.20 (0.91–1.57)
1.32 (1.02–1.71)
2.19 (1.66–2.90)
1.23 (0.98–1,56)
1.37 (1.07–1.75)
1.42 (1.13–1.78)
3.2 × 10−8
1.63 (1.33–1.97)
1.6 × 10−6
Only the 1,135 children with full follow-up were included. The number of controls was 2,511 for all analyses.
test for linear association.
total of 1,091 children. The rs6967330 risk allele (A) was associated
with greater risk of asthma hospitalizations (hazards ratio (HR) = 1.7
(95% confidence interval (CI) = 1.2–2.4), P = 0.002) and severe
exacerbations (HR = 1.4 (95% CI = 1.1–1.9), P = 0.007) in combined
analyses (Fig. 2, Supplementary Fig. 5 and Supplementary Table 6).
In COPSAC2000, we observed a trend in the direction of increased
neonatal bronchial responsiveness associated with the rs6967330 risk
allele (P = 0.10) (Supplementary Table 7). There was no association
with eczema in any of the three birth cohorts, and data on allergic
sensitization were inconsistent (Supplementary Table 6).
The top SNP at the CDHR3 locus (rs6967330) is a nonsynonymous coding SNP, where the risk allele (A), corresponding to the
minor allele, results in an amino acid change from cysteine to tyrosine
at position 529. This SNP is the only known nonsynonymous variant in this linkage disequilibrium (LD) region, but there are other
variants located within Encyclopedia of DNA Elements (ENCODE)predicted regulatory regions that are in moderate to high LD (r2 > 0.5)
with the sentinel SNP (Supplementary Table 8). Two SNPs with partial LD (r2 = 0.71 and 0.58) were also associated with asthma in the
discovery analysis but with less statistical significance. A similar
association pattern with rs6967330 as the top SNP was observed in
the GABRIEL (replication) study (Supplementary Fig. 6) and in
the Generation R (replication) subsample of individuals with nonEuropean ancestry (Supplementary Fig. 7), suggesting that rs6967330
might be the causal gene variant at this locus.
We investigated the potential functional consequences of the top
variant in CDHR3 (rs6967330; p.Cys529Tyr) by generating an expression construct encoding tagged human CDHR3 and introducing the
mutation encoding p.Cys529Tyr (A allele at rs6967330 resulting in
mutation of cysteine 529 to tyrosine) by site-directed mutagenesis.
We transfected the constructs for wild-type and mutant CDHR3
into 293T cells. Consistent results from six independent experiments involving flow cytometry (n = 3) (Supplementary Fig. 8) and
immunofluorescence staining (n = 3) (Supplementary Fig. 9) showed
that the wild-type protein was expressed at very low levels at the cell
surface, whereas the Cys529Tyr mutant showed a marked increase in
cell surface expression (Supplementary Note). These results support
the possibility that rs6967330 represents the causal variant at this
locus. A recent study17 reported that a SNP (rs17152490) in high LD
(r2 = 0.69) with our top SNP was associated with lung expression of
CDHR3, further supporting a functional role for this locus.
CDHR3 is a transmembrane protein with six extracellular cadherin
domains. Protein structure modeling showed that the risk-associated
alteration (p.Cys529Tyr) was located at the interface between two
membrane-proximal cadherin domains, D5 and D6 (Fig. 3).
Interestingly, Cys592 and Cys566, which are expected to form a
disulfide bridge within D6, are close to Cys529 in D5, and the short
distance between them could allow disulfide rearrangement (for
the wild-type, non-risk cysteine variant). The location of the variant residue at the domain interface suggests that the variant residue
may interfere with interdomain stabilization, overall protein stability,
folding or conformation, in agreement with the observation in our
experimental studies of altered cell surface expression.
The biological function of CDHR3 is unknown, but it belongs to the
cadherin family of transmembrane proteins involved in homologous
cell adhesion and important for several cellular processes, including
epithelial polarity, cell-cell interaction and differentiation18. Other
members of the cadherin family have been associated with asthma
Risk of hospitalization
© 2013 Nature America, Inc. All rights reserved.
Age (years)
Figure 2 Cumulative risk of asthma hospitalization during the first
6 years of life stratified on CDHR3 (rs6967330) genotype. Data are
from combined analysis of the COPSAC2000 and MAAS birth cohorts
(replication), including a total of 1,091 children, of whom 92 were
hospitalized for asthma. Genotype distribution was as follows: AA,
30 individuals; AG, 312 individuals; GG, 749 individuals. The P value for
the association between genotype and risk of hospitalization was 0.002
(Cox regression analysis using an additive genetic model).
Figure 3 Overview of the CDHR3 protein model.
The model covers cadherin domains 2–6 (D2–D6)
and is based on the structure of the entire
mouse N-cadherin ectodomain (Protein Data
Bank (PDB) 3Q2W; domains 1–5). The location
of the alteration at position 529 is indicated
with a blue star. The distance between residue
529 and the disulfide bridge in D6 (between
residues 566 and 592) is approximately 20 Å.
© 2013 Nature America, Inc. All rights reserved.
and related traits, including E-cadherin19 and
protocadherin-1 (ref. 20).
We demonstrated protein expression of
CDHR3 in bronchial epithelium from adults
and in fetal lung tissue (Supplementary
Fig. 10). CDHR3 was previously found to
be highly expressed in normal human lung
tissue21 and specifically in the bronchial epithelium22. CDHR3 (probe 235650_at) was
upregulated by tenfold in differentiating epithelial cells (with a rank
of 123 out of more than 47,000 transcripts ranked by magnitude of
upregulation)23 and seems to be highly expressed in the developing
human lung24.
There is an increasing focus on the role of the airway epithelium
in asthma pathogenesis. Structural or functional abnormalities in the
epithelium may increase susceptibility to environmental stimuli by
exaggerating immune responses and structural changes in underlying tissues and increasing airway reactivity 25. Epithelial integrity is
dependent on the interaction of proteins in cell-cell junction complexes, including adhesion molecules. Studies have shown impaired
tight junction function26 and reduced E-cadherin expression27 in the
airway epithelium of individuals with asthma. CDHR3 is a plausible
candidate gene for asthma because of its high level of expression in the
airway epithelium and the known role of cadherins in cell adhesion
and interaction. Most asthma exacerbations in children are caused
by respiratory infections, predominantly common viral infections
such as rhinovirus28, but bacterial infection may also have a role29,
as well as exposure to air pollution30. It is therefore plausible that
CDHR3 variation increases susceptibility to respiratory infections or
other airway irritants through impaired epithelial integrity and/or
disordered repair processes.
Interestingly, the CDHR3 asthma risk allele is the ancestral allele.
Public data from protein databases suggest that humans are unique
among 36 other vertebrate species in having the derived (non-risk)
allele resulting in a cysteine at position 529 (Supplementary Table 9),
which is now the wild-type allele in most human populations (Human
Genome Diversity Project (HGDP) selection browser; see URLs).
This finding suggests that the risk (ancestral) allele, associated with
increased surface expression of CDHR3, may have been advantageous
during early human evolution. This phenomenon in which the ancestral allele is the risk allele is known for other common diseases and
may reflect a shift from a beneficial to a deleterious effect for a particular allele as a result of a changing environment31.
The CDHR3 variant seems to be associated with an asthma phenotype of early onset, as demonstrated by the strongest replication
of association in the GABRIEL stratum with asthma onset before
16 years of age (Supplementary Table 10) and in the second
replication including children with asthma onset before 6 years of age
(Table 1). Increased risk was already demonstrated in the first year
of life (Fig. 2), particularly in children who were homozygous for the
risk allele (A). This finding is in line with the tendency toward association of increased airway reactivity in neonates with the risk allele
20 Å
and findings of CDHR3 expression in the fetal lung. CDHR3 variation
also seems to be more strongly associated with an asthma phenotype
with exacerbations (Supplementary Table 6), particularly with recurrent exacerbations (Table 2 and Supplementary Table 6).
The top locus in this study, on chromosome 17q12-21, has consistently been associated with childhood-onset asthma11,13. The effect
size in the present study is remarkably high, with an OR of 2.3 that
increases to 2.7 for the children with the highest number of exacerbations. This finding suggests a key role for this locus in severe
exacerbations in early childhood, in line with a previous report from
the COPSAC2000 birth cohort study32.
Genome-wide significant association with asthma has previously
been shown for variants in or near IL33, RAD50-IL13 and IL1RL1
(refs. 11,33). The fact that the top loci in our study were generally
shared with previous GWAS of asthma suggests that early-onset
asthma with severe exacerbations is at least partly driven by multiple
common variants in the same genes that contribute to asthma without
severe exacerbations.
The sample size of the present GWAS was less than one-fifth that
of the largest published GWAS of asthma (GABRIEL)11, and, yet, we
found a similar number of genome-wide significant loci, similar statistical significance and considerably larger effect estimates. Further
increasing phenotypic specificity by stratified analysis in the 358 children with the highest number of exacerbations resulted in an additional
increase in effect estimates, with ORs between 1.6 and 2.7 per risk allele,
and strong statistical significance. Effect estimates were also higher
than previously reported when replicating the exact top SNP from the
GABRIEL study (Supplementary Table 11). This finding demonstrates
that specific phenotyping is a helpful approach in the search for asthma
susceptibility genes. The narrow age criteria (2–6 years) for disease
may be an important phenotypic characteristic, as heritability has been
demonstrated to be higher for early-onset asthma34.
The method of case identification through national registries
allowed us to define a specific and rare phenotype of repeated
acute hospitalizations in young children from 2 to 6 years of age,
which, to our knowledge, has not previously been done in a GWAS.
One limitation of this study is that we had relatively poor genomewide coverage (approximately 125,000 SNPs).
In conclusion, our results demonstrate the strength of specific
phenotyping in genetic studies of asthma. Future research focusing
on understanding the role of CDHR3 variants in the development of
asthma and severe exacerbations may increase understanding and
improve treatment of this clinically important disease entity.
URLs. HGDP selection browser data for rs6967330, http://hgdp.
Methods and any associated references are available in the online
version of the paper.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
© 2013 Nature America, Inc. All rights reserved.
A full list of acknowledgments for each study is given in the Supplementary Note.
K.B. was the main author responsible for designing the study, analyzing and
interpreting data, writing the manuscript and directing the work. He had full
access to the data and final responsibility for the decision to submit the work for
publication. H.B. contributed to design of the study, analysis of data and writing
of the manuscript. P.S. and H.H. contributed to design of the study and analysis of
data in relation to whole-genome genotyping. K.N. performed the GWAS analysis
and contributed to regional imputation. E.K.-M., A. Sevelsted, M.A.R., R.Y. and
R.G. contributed to data analysis. J.M.M., S.B.-G. and D.T. directed and contributed
to regional imputation and data analyses. M.V.H. and D.M.H. were responsible
for subject identification, collection of dried blood spots and DNA extraction and
amplification. K.B., E.K.-M., L.J.M., R.F. and A.M. contributed to data acquisition.
T.B. performed modeling of the CDHR3 protein structure. L.P., C.H. and E.A.N.
were responsible for data from the discovery control cohort. H.H. and M.E.M.
were responsible for the functional studies of the CDHR3 variant involving
flow cytometry. A.H., D.E.S. and D.E.D. were responsible for the experimental
studies involving immunofluorescence staining. A. Simpson, A.C. and D.B. were
responsible for data from the MAAS cohort. H.T.d.D., L.D. and V.W.V.J. were
responsible for data from the Generation R cohort. G.F.-T., P.M.L. and J.W.H. were
responsible for the studies of lung tissue. P.F.T. studied the evolutionary aspects of
the CDHR3 risk variant (rs6967330). All coauthors provided important intellectual
input to the study and approved the final version of the manuscript.
The authors declare competing financial interests: details are available in the online
version of the paper.
Reprints and permissions information is available online at
1. Kocevar, V.S. et al. Variations in pediatric asthma hospitalization rates and costs
between and within Nordic countries. Chest 125, 1680–1684 (2004).
2. Lozano, P., Sullivan, S.D., Smith, D.H. & Weiss, K.B. The economic burden of
asthma in US children: estimates from the National Medical Expenditure Survey.
J. Allergy Clin. Immunol. 104, 957–963 (1999).
3. Matterne, U., Schmitt, J., Diepgen, T.L. & Apfelbacher, C. Children and adolescents’
health-related quality of life in relation to eczema, asthma and hay fever: results
from a population-based cross-sectional study. Qual. Life Res. 20, 1295–1305
4. Smith, D.H. et al. A national estimate of the economic costs of asthma. Am. J.
Respir. Crit. Care Med. 156, 787–793 (1997).
5. Bush, A. Practice imperfect—treatment for wheezing in preschoolers. N. Engl. J.
Med. 360, 409–410 (2009).
6. Duffy, D.L., Martin, N.G., Battistutta, D., Hopper, J.L. & Mathews, J.D. Genetics
of asthma and hay fever in Australian twins. Am. Rev. Respir. Dis. 142, 1351–1358
7. van Beijsterveldt, C.E. & Boomsma, D.I. Genetics of parentally reported asthma,
eczema and rhinitis in 5-yr-old twins. Eur. Respir. J. 29, 516–521 (2007).
8. Ferreira, M.A. et al. Identification of IL6R and chromosome 11q13.5 as risk loci
for asthma. Lancet 378, 1006–1014 (2011).
9. Gudbjartsson, D.F. et al. Sequence variants affecting eosinophil numbers associate
with asthma and myocardial infarction. Nat. Genet. 41, 342–347 (2009).
10. Himes, B.E. et al. Genome-wide association analysis identifies PDE4D as an asthmasusceptibility gene. Am. J. Hum. Genet. 84, 581–593 (2009).
11. Moffatt, M.F. et al. A large-scale, consortium-based genomewide association study
of asthma. N. Engl. J. Med. 363, 1211–1221 (2010).
12. Sleiman, P.M. et al. Variants of DENND1B associated with asthma in children.
N. Engl. J. Med. 362, 36–44 (2010).
13. Torgerson, D.G. et al. Meta-analysis of genome-wide association studies of asthma
in ethnically diverse North American populations. Nat. Genet. 43, 887–892
14. Anderson, G.P. Endotyping asthma: new insights into key pathogenic mechanisms
in a complex, heterogeneous disease. Lancet 372, 1107–1119 (2008).
15. Hollegaard, M.V. et al. Genome-wide scans using archived neonatal dried blood
spot samples. BMC Genomics 10, 297 (2009).
16. Hollegaard, M.V. et al. Robustness of genome-wide scanning using archived dried
blood spot samples as a DNA source. BMC Genet. 12, 58 (2011).
17. Hao, K. et al. Lung eQTLs to help reveal the molecular underpinnings of asthma.
PLoS Genet. 8, e1003029 (2012).
18. Hulpiau, P. & van Roy, F. Molecular evolution of the cadherin superfamily. Int. J.
Biochem. Cell Biol. 41, 349–369 (2009).
19. Nawijn, M.C., Hackett, T.L., Postma, D.S., van Oosterhout, A.J. & Heijink, I.H.
E-cadherin: gatekeeper of airway mucosa and allergic sensitization. Trends Immunol.
32, 248–255 (2011).
20. Koppelman, G.H. et al. Identification of PCDH1 as a novel susceptibility gene for
bronchial hyperresponsiveness. Am. J. Respir. Crit. Care Med. 180, 929–935
21. Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level
relationships in human tissue specification. Bioinformatics 21, 650–659 (2005).
22. McCall, M.N., Uppal, K., Jaffee, H.A., Zilliox, M.J. & Irizarry, R.A. The Gene
Expression Barcode: leveraging public data repositories to begin cataloging the
human and murine transcriptomes. Nucleic Acids Res. 39, D1011–D1015
23. Ross, A.J., Dailey, L.A., Brighton, L.E. & Devlin, R.B. Transcriptional profiling of
mucociliary differentiation in human airway epithelial cells. Am. J. Respir. Cell Mol.
Biol. 37, 169–185 (2007).
24. Kho, A.T. et al. Transcriptomic analysis of human lung development. Am. J. Respir.
Crit. Care Med. 181, 54–63 (2010).
25. Holgate, S.T. The sentinel role of the airway epithelium in asthma pathogenesis.
Immunol. Rev. 242, 205–219 (2011).
26. Xiao, C. et al. Defective epithelial barrier function in asthma. J. Allergy Clin.
Immunol. 128, 549–556 (2011).
27. de Boer, W.I. et al. Altered expression of epithelial junctional proteins in atopic
asthma: possible role in inflammation. Can. J. Physiol. Pharmacol. 86, 105–112
28. Johnston, S.L. et al. Community study of role of viral infections in exacerbations
of asthma in 9–11 year old children. Br. Med. J. 310, 1225–1229 (1995).
29. Bisgaard, H. et al. Association of bacteria and viruses with wheezy episodes
in young children: prospective birth cohort study. Br. Med. J. 341, c4978
30. Iskandar, A. et al. Coarse and fine particles but not ultrafine particles in urban
air trigger hospital admission for asthma in children. Thorax 67, 252–257
31. Di Rienzo, A. & Hudson, R.R. An evolutionary framework for common diseases: the
ancestral-susceptibility model. Trends Genet. 21, 596–601 (2005).
32. Bisgaard, H. et al. Chromosome 17q21 gene variants are associated with asthma
and exacerbations but not atopy in early childhood. Am. J. Respir. Crit. Care Med.
179, 179–185 (2009).
33. Li, X. et al. Genome-wide association study of asthma identifies RAD50-IL13 and
HLA-DR/DQ regions. J. Allergy Clin. Immunol. 125, 328–335 (2010).
34. Thomsen, S.F., Duffy, D.L., Kyvik, K.O. & Backer, V. Genetic influence on the age
at onset of asthma: a twin study. J. Allergy Clin. Immunol. 126, 626–630
© 2013 Nature America, Inc. All rights reserved.
The individual studies are described in further detail in the Supplementary
COPSAC exacerbation cohort (GWAS). This is a register-based cohort of children with asthma who were identified and characterized from national health
registries. The study was approved by the Ethics Committee for Copenhagen
(H-B-2998-103) and the Danish Data Protection Agency (2008-41-2622).
According to Danish law, research ethics committees can grant exemption
from obtaining informed consent for research projects based on biobank
material under certain circumstances. For this study, such an exemption was
granted (H-B-2998-103).
Case selection. Children with repeated acute hospitalizations (cases)
were identified in the Danish National Patient Register covering all diagnoses of discharges from Danish hospitals35. Information on birth-related
events was obtained from the national birth register. Inclusion criteria were
at least two acute hospitalizations for asthma (ICD8-codes 493, ICD-10
codes J45-46) from 2 to 6 years of age (both years included). Duration of
hospitalization had to be more than 1 d, and two hospitalizations had to
be separated by at least 6 months. Exclusion criteria were side diagnosis
during hospitalization, registered chronic diagnosis considered to affect
risk of hospitalization for asthma, low birth weight (<2.5 kg) or gestational age of under 36 weeks at birth. Cases were further characterized with
respect to the number of hospitalizations from asthma and acute bronchitis
and for concurrent atopy.
DNA sampling and genotyping of cases. DNA was obtained from blood spots
sampled as part of the Danish neonatal screening program and stored in the
Danish Neonatal Screening Biobank36. Two disks, each 3.2 mm in diameter,
were punched from each blood spot. DNA was extracted, and the whole
genome for each individual sample was amplified in triplicate as previously
described15,16. Cases were genotyped on the Affymetrix Axiom CEU array
(567,090 SNPs). Top SNPs from the five genome-wide significant loci were
regenotyped with the PCR KASPar genotyping system (KBiosciences) to
validate the results (Supplementary Table 3). Two additional SNPs in the
proximity of the newly discovered CDHR3 variant were genotyped for further
exploration of the region encompassing it.
Controls. The control population was randomly drawn from two large
Danish cohorts: the Danish National Birth Cohort (females) and the
Copenhagen draft board examinations (males). Individuals who indicated
in a questionnaire that they had physician-diagnosed asthma were excluded.
Genome-wide genotyping had previously been performed as part of the
Genomics of Overweight in Young Adults (GOYA) study37 on the Illumina
Human610-Quad v1.0 BeadChip (545,350 SNPs). Potential bias introduced
by differences in chemistry between the different platforms used for cases and
controls (Affymetrix and Illumina, respectively) was investigated by also using
control data from the Wellcome Trust Case Control Consortium 2 (WTCCC2)
project that performed genotyping on an Affymetrix platform (Affymetrix 6.0)
(Supplementary Table 4).
Replication in a previously published GWAS. Replication of the five genomewide significant loci from the discovery analysis was sought in publically
available data from a GWAS performed by the GABRIEL Consortium11.
This replication included 19 studies of childhood-onset asthma (onset before
16 years of age) with a total of 6,783 cases and 7,720 controls.
Replication in birth cohorts for the CDHR3 top SNP. The COPSAC2000
replication cohort. Replication and phenotypic characterization of the CDHR3
risk locus were sought in the COPSAC2000 cohort, a prospective clinical study
of a birth cohort of 411 children. This cohort is not overlapping with the
COPSAC exacerbation discovery study. The COPSAC 2000 cohort study was
approved by the Ethics Committee for Copenhagen (KF 01-289/96) and the
Danish Data Protection Agency (2008-41-1754), and informed consent was
obtained from both parents of each child. All mothers had a history of a doctor’s diagnosis of asthma after 7 years of age. Newborns were enrolled in the
first month of life, as previously described in detail38–40. This cohort is characterized by deep phenotyping during close clinical follow-up. Doctors employed
in the clinical research unit were acting primary physicians for the children
from the cohort and diagnosed and treated respiratory and skin symptoms,
and asthmatic symptoms were recorded in daily diaries41.
Acute, severe exacerbations from birth to 6 years of age were defined as
requiring the use of oral prednisolone or high-dose inhaled corticosteroid for wheezy symptoms, prescribed at the discretion of the doctor in the
clinical research unit, or by acute hospitalization at a local hospital for such
symptoms32. Asthma from birth to 7 years of age was diagnosed on the basis
of predefined algorithms of symptoms and response to treatment, as previously described40.
Neonatal spirometry and analysis of neonatal bronchial responsiveness to
methacholine were carried out by 4 weeks of age, applying the raised volume, rapid thoracic compression technique. Lung function was measured by
spirometry in the child’s seventh year of life. Specific airway resistance (sRaw)
was measured at 4 and 6 years by whole-body plethysmography. Bronchial
responsiveness at ages 4 and 6 years was determined as the relative change in
sRaw after hyperventilation of cold, dry air.
Allergic sensitization against common inhalant allergens was determined
at 6 years of age by measurement of serum-specific IgE levels. Atopic dermatitis was diagnosed using the Hanifin-Rajka criteria42 from birth to 7 years
of age.
High-throughput genome-wide SNP genotyping was performed using the
Illumina Infinium II HumanHap550 v1, v3 or Quad BeadChip platform at
the Children’s Hospital of Philadelphia’s Center for Applied Genomics. We
excluded SNPs with call rate of <95%, minor allele frequency (MAF) of <1%
or Hardy-Weinberg equilibrium P value of <1 × 10−5. rs6967330 was a genotyped SNP on this array.
MAAS replication cohort. The Manchester Asthma and Allergy Study is
a population-based birth cohort described in detail elsewhere43. Subjects
were recruited prenatally and were followed prospectively. The study was
approved by the local research ethics committee (South Manchester, reference
03/SM/400). Parents gave written informed consent. Participants attended
follow-up at ages 1, 3 and 5 years of age.
For asthma, validated questionnaires were administered by interviewers to
collect information on parentally reported symptoms, physician-diagnosed
asthma and treatments received. ‘Current wheeze and asthma treatment’ was
defined as parentally reported wheeze in the past 12 months. ‘Asthma ever’
was defined as positive if, at any given time point, two of three responses were
positive to the following questions: “Has your child wheezed within the past
12 months?”, “Does your child currently take asthma medication?” or “Has
a doctor ever told you that your child has asthma?” Controls were defined as
children with none of these symptoms.
For exacerbations, a pediatrician extracted data from primary-care medical records, including information on diagnosis with wheeze and/or asthma,
all prescriptions (including inhaled corticosteroids (ICS) and B2 agonists),
unscheduled visits and hospital admissions for asthma and/or wheeze during
the first 8 years of life. Following American Thoracic Society guidelines, we
defined asthma exacerbations by either admission to a hospital or an emergency
department visit and/or by receipt of oral corticosteroids for at least 3 d44.
DNA samples were genotyped on the Illumina Human610-Quad BeadChip.
Genotypes were called using the Illumina GenCall application, following the
manufacturer’s instructions. Quality control criteria for samples included call
rate of greater than 97%, exclusion of samples with outlier autosomal heterozygosity and sex validation. We excluded SNPs with call rate of <95%, HardyWeinberg equilibrium P value of >5.9 × 10−7 and MAF of <0.005. We then
performed a look-up for SNP rs6967330, which showed a genotyping success
rate of 100% and a Hardy-Weinberg equilibrium P value of 0.4164.
Generation R replication cohort. The Generation R Study is a populationbased prospective cohort study of pregnant women and their children from
fetal life onward in Rotterdam, The Netherlands45. The study protocol
was approved by the Medical Ethical Committee of the Erasmus Medical
Center, Rotterdam (MEC 217.595/2002/20). Written informed consent
was obtained from all mothers and biological fathers or legal guardians. Information on wheezing, asthma and eczema was collected for the
children by questionnaires at the ages of 1 to 4 and 6 years46. Questions
about wheezing included: “Has your child had problems with a wheezing
chest during the last year? (never, 1–3 times, >4 times) (age 1 to 4 years)”
and “Did your child ever suffer from chest wheezing? (never, 1–3 times,
© 2013 Nature America, Inc. All rights reserved.
>4 times) (age 6 years).” Questions about asthma included: “Has a doctor
diagnosed your child as having asthma during the past year? (yes, no) (age 2
and 4 years)” and “Was your child ever diagnosed with asthma by a doctor?
(yes, no) (age 3 and 6 years).” On the basis of the last obtained questionnaire,
we grouped children as having ‘asthma ever before 6 years of age’. Reported
asthma at 2, 3 or 4 years of age was used to reclassify children included in
this group where appropriate. We then recategorized children as those with
an asthma diagnosis before 3 years of age and at 3 years of age or older.
Reported numbers of wheezing episodes at 1 and 2 years of age and at 3 to
6 years of age, respectively, were used to reclassify asthma diagnosis before
and at 3 years of age into ‘asthma diagnosis or q3 episodes of wheezing before
3 years of age’. Questions about eczema included: “Has a doctor diagnosed
your child as having eczema during the past year? (yes, no) (age 1 to 4 years)”
and “Was your child ever diagnosed with eczema by a doctor? (yes, no)
(6 years).” As with asthma, we grouped children into those with ‘eczema ever
before 6 years of age’ on the basis of the last obtained questionnaire and used
reported eczema at 1 or 4 years of age to reclassify children included in this
group where appropriate.
Samples were genotyped using Illumina Infinium II HumanHap610 Quad
arrays, following standard manufacturer’s protocols. Intensity files were analyzed using BeadStudio Genotyping Module software v.3.2.32, and genotypes
were called using default cluster files. Any sample with a call rate of less than
97.5%, excess autosomal heterozygosity (F < mean – 4 s.d.) or mismatch
between called and phenotypic sex was excluded. rs6967330 was a genotyped
SNP in this set. Individuals identified as genetic outliers by identity-by-state
(IBS) clustering analysis (>3 s.d. away from the mean for the HapMap CEU
population (Utah residents of Northern and Western European ancestry)) were
considered to have non-European ancestry. Ancestry determination analysis
included genomic data from all Generation R individuals merged with data for
three reference panels from Phase 2 of the HapMap Project (YRI (Yoruba from
Ibadan, Nigeria), CHB + JPT (Han Chinese in Beijing, China, and Japanese
in Tokyo, Japan) and CEU). Analysis of association between an asthma or
eczema phenotype and GWAS SNPs was carried out using a regression framework, adjusting for population stratification in the Generation R cohort using
MACH2QTL, as implemented in GRIMP. Ten genomic principal components
obtained after the application of SNP quality exclusion criteria and LD pruning
were used to adjust for population substructure in the combined population,
four principal components were used for the European subpopulation and
eight principal components were used for the non-European subpopulation.
Individuals were grouped as having European (n = 1,962; 64.5%) or nonEuropean (n = 1,078; 35.5%) ancestry on the basis of genetic ancestry. On
the basis of information on the country of birth of parents and grandparents obtained by questionnaires, the largest non-European ancestry groups
included individuals of Turkish (5.4%), Surinamese (4.6%), Dutch Antillean
(4.0%), Moroccan (2.9%) and Cape Verdean (2.3%) origin.
Statistical analyses. Genome-wide association analysis. Quality control was
carried out separately on cases and controls. This included filtering on SNP
call rate (>99%) and sample call rate (>98%) and tests for excess heterozygosity,
deviation from Hardy-Weinberg equilibrium, sex mismatch and familial relatedness. Non-European individuals were excluded on the basis of deviation
from the HapMap CEU reference panel (release 22). Indication of population
stratification or genotyping bias was tested by multidimensional scaling (MDS)
after quality control. This analysis showed evidence of association with disease
status for the first seven MDS components, and these were therefore included
as covariates in the association analysis. Additional analyses including the
first 100 MDS components did not materially alter the results. Merged data
for SNPs present on both arrays after quality control were used for association
testing with PLINK (v. 1.07) using a logistic additive model, adjusting for the
first seven MDS components. Additional quality control was performed for
genome-wide significant SNPs after association analysis, including a test for
genotyping batch effects, resulting in the removal of one genome-wide significant SNP with strong evidence of batch-related genotyping error.
Functional annotation for the SNPs in LD (r2 > 0.5) with the CDHR3
top SNP (rs6967330) was obtained from the RefSeq track downloaded
from the UCSC Genome Browser. SNPs were associated with regulatory
elements by HaploReg47 in terms of predicted ENCODE chromatin state
(promoter and enhancer histone modification signals) and DNase I hypersensitivity (Supplementary Table 8).
Regional imputation was performed to describe the identified loci from the
discovery analysis (Supplementary Fig. 3) as well as reported loci from the
previous largest published GWAS (GABRIEL)11 (Supplementary Table 11).
We used two-step genotype imputation as described48. We used the SHAPEIT
algorithm to prephase the haplotypes 49 and then used IMPUEv2 software
for the imputation of unknown genotypes50 separately in cases and controls.
We used the 1000 Genomes Project reference panel51 (April 2012 version).
We used a strict cutoff (info of 0.88), which, according to our analyses, provides an allelic dosage R2 correlation between real and imputed genotypes of
greater than 0.8 and shows an optimal balance between sufficient accuracy and
power52. We then compared the resulting allelic frequencies using SNPTEST
2.4.1 (ref. 53).
CDHR3 protein expression in experimental models. The top SNP at the
CDHR3 locus is a nonsynonymous SNP (encoding p.Cys529Tyr). To determine
the functional consequences of the p.Cys529Tyr variant, we generated expression constructs encoding tagged human CDHR3 protein, and the mutation
encoding the p.Cys529Tyr alteration was introduced by site-directed mutagenesis. Plasmids encoding wild-type or mutant CDHR3 or empty vector were
transfected into 293T cells, and cells were monitored for surface and intracellular expression of CDHR3 by flow cytometry. 293T cells were from the
American Type Culture Collection (ATCC), catalog number CRL-3216. They
were recently tested for mycoplasma contamination but were not authenticated.
For protein blotting, cells expressing CDHR3 proteins were lysed, and wholecell lysates were separated by SDS-PAGE under reducing or non-reducing
conditions, transferred to PVDF membranes and blotted for Flag (anti-Flag
antibody, clone M2 (Agilent Technologies, 200470-21) at a dilution of 1:2,000).
For immunofluorescence and confocal microscopy, 293T cells were grown on
glass coverslips in DMEM with 3 mM glutamine and 10% heat-inactivated
FBS at 37 °C and 5% CO2 before and for 2 d after transfection with expression constructs for Flag-tagged wild-type CDHR3 and CDHR3 Cys529Tyr
using TransIT 2020 reagent according to a standard protocol (Mirus Bio).
Cells were obtained and used at a low passage from ATCC and had recently
been tested for mycoplasma. Cells were incubated in 10% serum-containing
culture medium plus primary anti-Flag mouse antibodies (F3165, Sigma; 1:300
dilution) for 1 h at 37 °C before being washed briefly with culture medium.
Cells were then stained with secondary rabbit anti-mouse antibodies (F0261,
Daco; 1:600 dilution) conjugated with fluorescein isothiocyanate (FITC)
with incubation at 37 °C for 30 min and washed with culture medium before
PBS. Afterward, cells were fixed in 2% paraformaldehyde for 15 min, washed
with PBS and permeabilized in 0.2% Triton X-100 in PBS for 5 min, washed
and incubated with Cy3-conjugated mouse anti-Flag antibody (Cy3-labeled
F3165, Sigma; 1:300 dilution). Finally, cells were mounted with ProLong Gold
antifade reagent with DAPI (Invitrogen). Images were acquired using a Leica
DMI 6000-B confocal microscope (Leica Microsystems) with 40× magnification and were processed in Photoshop (Adobe Systems). Experiments were
performed in triplicate (independent transfections) for both flow cytometry
and immunofluorescence staining. Data presented (Supplementary Figs. 8
and 9) were chosen as being representative of the repeated experiments.
CDHR3 protein structure modeling. A homology model of CDHR3 domains
2–6 (residues 141–681) was generated using the HHpred server54. The model
was based on the structure of mouse N-cadherin (PDB 3Q2W) domains 1–5.
A disulfide bridge was manually introduced in the final model between the
structurally adjacent residues Cys566 and Cys592, as this corresponds to a
disulfide bridge commonly observed in cadherin domains.
35. Lynge, E., Sandegaard, J.L. & Rebolj, M. The Danish National Patient Register.
Scand. J. Public Health 39, 30–33 (2011).
36. Nørgaard-Pedersen, B. & Hougaard, D.M. Storage policies and use of the Danish
Newborn Screening Biobank. J. Inherit. Metab. Dis. 30, 530–536 (2007).
37. Paternoster, L. et al. Genome-wide population-based association study of extremely
overweight young adults—the GOYA study. PLoS One 6, e24303 (2011).
38. Bisgaard, H. The Copenhagen Prospective Study on Asthma in Childhood (COPSAC):
design, rationale, and baseline data from a longitudinal birth cohort study. Ann.
Allergy Asthma Immunol. 93, 381–389 (2004).
47. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states,
conservation, and regulatory motif alterations within sets of genetically linked
variants. Nucleic Acids Res. 40, D930–D934 (2012).
48. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and
accurate genotype imputation in genome-wide association studies through prephasing. Nat. Genet. 44, 955–959 (2012).
49. Delaneau, O., Marchini, J. & Zagury, J.F. A linear complexity phasing method for
thousands of genomes. Nat. Methods 9, 179–181 (2012).
50. Howie, B., Marchini, J. & Stephens, M. Genotype imputation with thousands of
genomes. G3 1, 457–470 (2011).
51. Abecasis, G.R. et al. An integrated map of genetic variation from 1,092 human
genomes. Nature 491, 56–65 (2012).
52. Auer, P.L. et al. Imputation of exome sequence variants into population- based
samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome
Sequencing Project. Am. J. Hum. Genet. 91, 794–808 (2012).
53. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies.
Nat. Rev. Genet. 11, 499–511 (2010).
54. Söding, J., Biegert, A. & Lupas, A.N. The HHpred interactive server for protein
homology detection and structure prediction. Nucleic Acids Res. 33, W244–W248
© 2013 Nature America, Inc. All rights reserved.
39. Bisgaard, H., Hermansen, M.N., Loland, L., Halkjaer, L.B. & Buchvald, F.
Intermittent inhaled corticosteroids in infants with episodic wheezing. N. Engl. J.
Med. 354, 1998–2005 (2006).
40. Bisgaard, H. et al. Childhood asthma after bacterial colonization of the airway in
neonates. N. Engl. J. Med. 357, 1487–1495 (2007).
41. Bisgaard, H., Pipper, C.B. & Bonnelykke, K. Endotyping early childhood asthma by
quantitative symptom assessment. J. Allergy Clin. Immunol. 127, 1155–1164 (2011).
42. Hanifin, J.M. & Rajka, G. Diagnostic features of atopic dermatitis. Acta Derm.
Venereol. 92, 44–47 (1980).
43. Lowe, L. et al. Specific airway resistance in 3-year-old children: a prospective cohort
study. Lancet 359, 1904–1908 (2002).
44. Reddel, H.K. et al. An official American Thoracic Society/European Respiratory
Society statement: asthma control and exacerbations: standardizing endpoints for
clinical asthma trials and clinical practice. Am. J. Respir. Crit. Care Med. 180,
59–99 (2009).
45. Jaddoe, V.W. et al. The Generation R Study Biobank: a resource for epidemiological
studies in children and their parents. Eur. J. Epidemiol. 22, 917–923
46. Jaddoe, V.W. et al. The Generation R Study: design and cohort update 2012.
Eur. J. Epidemiol. 27, 739–756 (2012).
Chapter 3
Candidate gene study of
childhood asthma
This chapter describes the designing strategy for candidate gene based resequencing study for asthma exacerbation cases. The study was designed
to sequence gene related asthma and similar phenotype associated. The
sequencing of the samples is still in progress at the time of submission of this
thesis. The chapter here, describes the design for the candidate gene, the
multiplexing strategy, the sample preparation and the capturing method.
According to WHO report, childhood asthma has become epidemic in the
world [155]. Childhood asthma ranges from mild to severe, depending upon
the number of asthma events and acute asthma attacks. Age of onset of
asthma has significant effect on prognosis and implications, as early onset
increases the risk of severity and persistence in later stages of life [178].
Childhood Asthma has high phenotypic heterogeneity that is different individuals exhibit different phenotypes, which are also thought to differ in
the causal mechanisms. Exacerbation is one of the severe phenotypes of
asthma. Asthma exacerbation is marked by change in lung volume and
plural pressure, which significantly affects the cardiopulmonary interactions
There are known genetic factors associated with childhood asthma along
with the environmental factors. Multiple genes have been identified to be
associated with childhood asthma and related phenotypes. 17q21 loci on
chromosome 17 is strongly associated with asthma and has been found in multiple studies [180, 181]. Similarly, loci 9q24, 2q12 and 6p21 have appeared
to be robust across ethnicities for their association to asthma. Different
GWAS detect discrete variations with lower replication in other independent
studies. Also, it is hard to replicate GWAS SNPs with consistent effect size
and direction. Most of the GWAS are done on the commercially available
genome wide arrays. These SNP arrays try to maximise the coverage of
the genome by evenly distributing the SNP probes across the genome and
minimise the number of SNPs probed within high LD regions. The probe
designs from different array suppliers differ which may result in different
variations within the same genomic region. The LD patterns in the region
may result in different polymorphisms being associated with the disease,
although only one of them is the causal variant. Therefore, comparing
results from different studies of even comparable sample sizes is difficult.
The GWAS findings lead path for more detailed candidate gene studies.
Candidate gene studies focus on the plausibility of the gene to be involved
in disease pathogenesis. Focusing on the genomic regions with the known
disease genes, called candidate genes, would assist in detecting the causal
variations. This will also help in finding the functional alterations leading
to the phenotype. These studies are relatively fast, less costly, require less
amount of DNA and small sample size. Candidate gene studies augment
the array based GWAS by maximising the variation coverage in these genes.
These studies are suitable in detecting variation underlying common and
more complex diseases where the effect size is small. The additional information on the variation and their function in the gene would help in discovering
the biological mechanism leading to asthma phenotype. To capture the
genomic regions of interest from DNA samples prior to sequencing, target
enrichment is carried out in these studies [182]. The reduction in region for
sequencing enabled multiplexing.
Candidate Gene Selection
Literature survey based selection of the candidate genes was done for strong
asthma and asthma risk factors associated loci. Sixteen SNPs with corresponding regions were selected (Table 3.1). The selected regions include,
TCR α/δ region on chromosome 14q which contain V, J and D coding segments. Rearrangements in these regions give rise to an α or a δ chain of
the T cell receptor. TCR α/δ region has been associated with IgE responses
[183] and is known for its variability and the tight linkage disequilibrium
[184]. Thus, the baits were designed for this region. The two independent
loss-of-function variations in gene encoding filaggrin (FLG) have been found
as very strong predisposing factor of atopic dermatitis [185]. Locus 1q31 has
been implicated in asthma susceptibility in North American children of European ancestry and in African-American children. The regions is also been
suggested to influence the age of onset of asthma. The implicated region had
two genes, CRB1 and DENND1B. CRB1 has restricted expression in retina
and brain where as DENND1B encodes a protein expressed on the immune
dendritic cells and has been associated with susceptibility to asthma [164].
Accordingly, the region of DENND1B as well as the SNP (rs2786098) were
included in the design. A set of genes namely IL1RL1/IL18R1, HLA-DQ,
IL33, SMAD3, IL2RB, RORA, SLC22A5 and the ORMDL3/GSDMB locus
have been associated to asthma in a large-scale cohort study [162]. Additionally, since region on chromosome 17 has been associated with asthma
and exacerbation in early childhood and a total region including GSDMB,
ORMDL3, ERBB2, GSDMA is a part of the sequencing panel [180]. Novel
Moffatt MF, Hum Mol Gen 2000
Rodr�guez E, J Allergy Clin Immunol. 2009
Sleiman PM, N Engl J Med 2009
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Moffatt MF, N Engl J Med 2010
Ferreira MA, Lancet 2011
Hirota T, Nat Genet 2011; Torgerson DG, Nat Genet 2011
Xingnan Li, J Allergy Clin Immunol 2012
Bønnelykke K, Nat Genet 2013
Bønnelykke K, Nat Genet 2013
Bønnelykke K, Nat Genet 2013
Bønnelykke K, Nat Genet 2013
Table 3.1.
List of candidate gene selected for targeted sequencing,
with known asthma associated SNPs and corresponding publications
variants in IL6R were identified to be associated with asthma along with
11q13.5 locus [186]. The variations in IL6R are of great interest as they support the hypothesis that genetic alteration of cytokine signalling increases
asthma risk and can be used as target of genotype specific therapeutics [186].
Different types of thymic stromal lymphopoietin (TSLP) are associated with
asthma in cross ethnic cohorts of North American population as well as in
Japanese population. Thus, testing its association with asthma in a Danish
cohort would add to the asthma association value of this gene and establish
it as a cross ethnic gene[187, 188]. Two independent signals in chromosome
11 open reading frame 30 (C11orf30) and leucine rich repeat containing 32
(LRRC32) are being associated with total serum IgE levels, a risk factor
of asthma [189] and the SNP rs7927894, which lies between C11orf30 and
LRRC32 has been reported to be associated with atopic asthma [190]. Thus,
this region is of interest when looking for variants associated with asthma
phenotypes. SNP (rs17616434) lies in the region of the human genome also
codes multiple Toll-like receptors (TLR), that are recently associated with
allergic sensitization[191] and thus the full region was included in the target sequence design. The SNPs in region RAD50-IL13 as well as in the 3�
untranslated region of HLA-DQB1 are found to be associated with asthma
[192]. CDHR3 is a novel finding in the Danish cohort, which has been associated with asthma exacerbations in children and has been replicated in
multiple Danish as well as cross ethnicity cohorts [193].
Capture Region
The target designing was done to capture the gene boundaries and promoter
region (-2Kb of transcription start site of gene) for all genes associated with
the SNPs in the respective studies. To capture the specific SNPs from the
reference studies, +/-50 bases of the position of the SNPs (Table 1) were also
sequenced. RNA baits to densely capture these regions were designed using
the Agilent custom design service.
Samples and DNA Extraction
A total of 24 samples with the most severe symptoms of exacerbation were
selected from the Danish national birth registry with acute hospitalizations
for asthma (ICD8-codes 493, ICD-10 codes J45-46) from 2 to 6 years of age
(both years included). The criteria of inclusion required more that one day
of hospitalization with two hospitalizations had to separated by 6 months.
DNA for these samples was obtained from blood stops collected as a part
of Danish neonatal screening program and stored at the Danish Neonatal
Screening Biobank [194]. Genomic DNA was thereafter extracted using the
Extract-N-Amp kit (Sigma-Aldrich). Whole genome aplification was carried
out in triplicate using the REPLI-g mini kit (Qiagen) and quantifications
were preformed as described previously [195, 196].
Library Preparation
DNA shearing and library preparations were performed according to the
SureSelect XT Target Enrichment System protocol version 1.6 2013 (Agilent Technologies, Santa Clara, CA, USA) with minor modifications. 200 ng
of whole amplified genomic DNA was sheared by Covaris E210 System using
10% duty cycle, intensity of 5, cycles per burst of 200 for 360 sec. To create 150bp fragments. Then end-repair was performed (by applying T4 DNA
polymerase, T4 phosphonucleotide kinase and Klenow fragment enzyme) and
3′ ends A-overhang were produced (by applying Klenow 3′ to 5′ exo minus).
In the study, five bases long 25 customised barcodes (Table 3.2), which were
ligated to the primers having the last base as a thymidine (T) necessary for
ligation to DNA fragments for sequencing with a 3′ adenosine (A) overhang,
were designed based on primers from Agilent, NimbelGen and Illumina. The
diversity of these barcodes was highly required by the sequencing machine
to distinguish the samples. The logo diagram shows the percentage of each
base in the forward stand of the barcodes (Figure 3.1). The custom made
adapters containing unique barcodes were prepared.
Figure 3.1. Logo block diagram for the frequency of four bases
in different positions in the barcodes used in the study.
Pooling, Target Enrichment and Sequencing
The complementary oligos (DNA technology A/S, Risskov, Denmark) were
dissolved in Nuclease free water to a final concentration of 300µM. Complementary oligonucleotide pairs were mixed in ratio 1:1 in 1X annealing buffer
(10X buffer contained 100mM Tris-HCL pH8.1; 0.5M NaCl). The barcoded
adapter mix was heated to 90◦ C for 2 minutes, then cooled down to 30◦ C at
a rate of 2◦ C per minute, and diluted to a working concentration of 1.5µM.
The DNA libraries were amplified with a denaturation time of 30 seconds at
98◦ C, followed by 10 cycles of denaturation at 98◦ C for 30 seconds, annealing at 65◦ C for 30 seconds and extension at 72◦ C for 1 minutes according
to the protocol. The final extension was performed at 72◦ C for 5 minutes.
DNA quantity and quality was checked on a NanoDrop ND- 1000 UV-VIS
Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and Agilent 2100 Bioanalyzer using the Bioanalyser DNA High sensitivity (Agilent
Technologies), respectively (Figure 3.2 [A]). The DNA libraries were mixed
in groups of 25 in equimolar ratios to yield a final concentration of 221ng/uL
of each pooled library.
The pooled libraries were hybridized with our custom designed SureSelect
oligo capture library for 16 hours according to the manufacturer’s instructions. After incubation the selected hybrids custom made primers for Pairend sequencing and Herculase II Fusion DNA Polymerase (Stratagene, Agilent Technologies). The PCR reaction was performed with a denaturation
time of 2minutes at 98◦ C, followed by 12 cycles of denaturation at 98◦ C for
30 seconds, annealing at 57◦ C for 30 seconds and extension at 72◦ C for 30
seconds. The final extension was performed at 72◦ C for 10 minutes. After purification, DNA quantity and quality was checked on a NanoDrop ND- 1000
UV-VIS Spectrophotometer and Agilent 2100 Bioanalyzer using the Bioanalyser DNA High sensitivity, respectively (Figure 3.2 [B]).
DNA quantity and quality of the libraries were again checked before se-
50 100 150 200 250 300 350 400
Figure 3.2. DNA qualities of the samples as measured by length
post multiplexing and before capturing [A]. DNA qualities of the
samples as measured by length post capturing [B].
quencing using different markers to be sure about the capture (Figure 3).
Sequencing is performed for 100 bp paired-end run on HiSeq (Illumina Int.,
San Diego, CA USA) at BGI facility in Copenhagen following the manufacturer�s recommendations.
Exploration of method
The methods presented in this chapter describe the pilot study of 24 samples, where the 16 loci with known childhood asthma associations were
resequenced for second time. During first attempt for the resequencing
study, a different set of samples were used. Also, a larger panel of genes
covering 7Mb of the genome was designed. This earlier design included
a higher number the genes associated with asthma and related phenotype
as well as the interaction partners of these genes. The design was very
suitable for a hypothesis driven candidate gene study. This is based on
the fact that pathways based genes and SNP selection helps in discovering
the underlying biological mechanisms [197]. The studies based on pathways [198] and PPIs [199, 200] have found that the genes other than the
central genes can have an effect on the phenotype and thus they can also
Table 3.2. List of forward strands barcodes designed for this study.
The design was aimed at maximising the variability as each base on these
be therapeutic target. As we were interested in testing the association of
variations in these interacting genes, those were also included in the design.
Unfortunately, the sequencing of this highly explanatory design failed due
to bad DNA quality. The samples had a very low yield during target capture.
The samples used from the second attempt, are whole genome amplified
(WGA) and to increase the yield in the capture, we reduced the capture
size. Also, the capture kit used was upgraded to a newer technology of
target capture, which can work with lower amounts of DNA (200 ng). So,
now we aim at identifying potentially causal mutations in the proximity of
a known GWAS hits. The total region of the design was 2.482 Mbps, which
was captured by 39499 probes with an average coverage of 82.4%. The SNP
regions had 100% covered, while the promoter regions were amongst the least
covered regions. The selective sequencing of WGA DNA has been successful
in discovering majority of variants and achieves high concordance with the
corresponding arrays [201].
Selective sequencing and cases-only sequencing is a useful tool for discovering disease related variants. This leads the focus on rare variants, and
elucidate their effect on the phenotype by avoiding the dilution in effect
size caused by collective test of cases and controls [202]. Deep resequencing
of the GWAS loci associated with inflammatory bowel disease (IBD) has
previously resulted in functional confirmation of known susceptibility genes,
Nucleotide-binding oligomerization domain-containing protein 2 (NOD2)
as well as finding a protective effect of an isoform, Caspase recruitment
domain-containing protein 9 (CARD9) [203]. Also, the newly discovered
risk alleles in the study explain more risk variance in the overall population
than the original common variant known from GWAS analysis. Thorough
sequencing of significantly associated regions in GWAS not simply expands
the variance explained, but also identifies specific alleles that may substantially be important for the understanding of the functional role of each gene
Candidate-genes studies are criticised for the low replication in independent studies and for being “hypothesis-driven” [204]. Lack of replication
of variation across studies does not necessarily imply non-causality but it
might indicate population differences and LD structure differences [205]. The
strength of candidate gene studies depends on the selection of the targets
regions. These fine mapping studies are based on the prior finding of multiple
studies and it is an advantage to select candidate genes from loci found in
the same population or in cross ethnic studies [206] to minimise the risk of
false positives. It is also beneficial to have functional effects for the variations
[207] and thus sometimes, only the coding regions of the candidate genes
(exomes) are sequenced. With advancements in annotations of non-coding
variations, these regions are also gaining importance in disease association
studies. To include the regulatory variations in the study, the promoters
regions of the candidate as well as the introns were included in the design
[208]. The success of the �hypothesis-driven� candidate gene study depends
on the choice of hypothesis. Integrative systems biology based methods
using information extracted from public databases as well as automated data
mining would supplement in better candidate gene selection for disease and
drug studies [209, 210].
The success of the sequencing of this pilot study would eventually lead
to the re-sequencing of samples from the total cohort. We aim at analysing
the sequencing data with the state-of-the-art methods and find the causal
variations and as well as variations that could be used to stratify the cases
in the study.
Chapter 4
Paper II - Machine learning
based prediction of childhood
This chapter describes neural network based discovery tool for selecting the
genetic and clinical features till the age of two years that can predict asthma
outcome at the age of seven years. Genotyping of the two cohort used in this
study was done using SNP arrays. Deep phenotyping on the study cohort
COPSAC2000 includes several longitudinal phenotypes such as recurrent
wheeze, eczema, asthma and exacerbation. To supplement the GWAS where
a single SNP association to phenotype is made, we tested a group of SNPs.
Since all SNPs not always have additive effects and might have variety of
interactions amongst themselves, we used non-linear method of artificial
neural networks to test these associations. To use the information about
the risk factors of asthma available in early stages of life, we included the
clinical information about allergy, eczema, white blood cell (WBC) count,
lung function and presence of bacteria in the hypopharyngeal region as
rules to predict asthma later in life. The pregnancy and birth conditions also
play an important role in asthma risk and thus were also used as risk features.
As it is known that not all SNPs in the human genome interact with each
other and SNPs mapping to the genes within a pathway have higher chances
of affecting the activity of each other. So, the SNPs were grouped based on
signalling pathways from the database of cell signaling. We used genotyping
data from discovery cohort for selecting pathways with high association to
childhood asthma. The second and more informative cohort COPSAC2000
with eighteen clinical features was used to reduce the SNPs features within
these selected pathways.
The study tests the predictive power of SNPs and clinical features individually and then to find does the combination add any predictive values
the combinations of two types of features were tested. The number of SNPs
genotyped in the two dataset was high and even after grouping them into
pathways, trying all possible combinations, which increases exponentially
with adding every extra feature was not possible in real time. A brute force
version of genetic algorithm was employed to make all possible combination
of three SNPs to be trained and tested for association to the phenotype. The
method and results are described in the attached manuscript.
Ranking genetic and clinical features for prediction of asthma at age 7
Rachita Yadav1 , Thomas Nordahl Petersen1 , Eskil Kreiner-Møller2 , Hans Bisgaard2 , Kluas Bønnelykke2 and Ramneek
Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark
Copenhagen Prospective Studies on Asthma in Childhood, Health Sciences, University of Copenhagen and
Copenhagen University Hospital, Gentofte, Copenhagen, Denmark.
Background Asthma is one of the most common chronic diseases of childhood and the most frequent reason
for paediatric hospitalisation. Several genetic and environmental risk factors are known for childhood asthma.
The study aims at prioritising clinical and genetic features that are predictive of childhood asthma. The goal
is to prioritise a set of genetic and clinical features markers that should be replicated in other studies.
Results We present an artificial neural network based approach using genetic data in form of the single
nucleotide polymorphisms from genotyping along with clinical features before the age of 2 to predict asthma
at the age of 7 years. The methodology designed for this prediction, performs feature selection on SNP groups
based on biological pathways. Estrogen receptor pathway was shown to be associated with asthma at age 7,
with Matthews Correlation Coefficient of 0.71. Other pathways ranked high are Insulin Signaling Pathway,
Mitochondrial Pathway of Apoptosis, Phosphoinositide 3-kinase Pathway. Several of the pathway have known
asthma association, this methods allows the prioritisation of the genes within these pathways.
Conclusions The method prioritises 11 pathways carrying the predictive values towards asthma. Inclusion of selected 10 out of 18 clinical features added further value to predictive value. This method prioritises
pathways with association to childhood asthma rather that single SNPs. Additionally combining the clinical
and genetic features in the same models. This method helps in identification of pathways and variations
that can be studied in more detail in the upcoming asthma studies with replication and functional studies.
Prognosis of asthma at an early stage of life would help in earlier treatment and management of the asthma
disorder in children.
Childhood asthma, artificial neural network, prediction, GWAS, SNPs, pathway based
Asthma is one of the most common chronic diseases in
childhood. Definition of childhood asthma is a topic of
constant debate. Asthma can be characterised as an inflammatory disease with difficulties in airflow. This is
due to the narrowing of lung airways and hypersensitivity
of the mucous membrane caused by inflammation. The
symptoms of asthma include coughing, wheezing, shortness of breath and tightness of chest. Asthma heritability is estimated to be 7090% [12, 46]. There are multiple childhood asthma susceptibility loci, which have been
verified in genome-wide association studies (GWAS) [14,
28, 29, 41]. These loci have variable effect sizes and since
asthma is a heterogenetic disease it is still hard to explain
∗ Corresponding author: Ramneek Gupta, Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen,
Denmark. Telephone +45 25252425, e-mail: [email protected]
asthma just based on hereditary features. Along with genetics, there are racial, social and environmental risks involved in childhood asthma susceptibility [49].
It is found that 30% of children with preschool wheezing
develops asthma at the later stages [33]. Multiple phenotypes within asthma have been observed depending upon
the symptoms, duration of event and on the age of the
child. The asthma and related phenotypes share overlapping symptoms. So, the difficult part is to distinguishing between these endo-phenotypes with similar symptoms. Multiple independent as well as combination of
clinical and environmental features have been found to be
involved in the cause of childhood asthma. Neonates colonized in the hypopharyngeal region with S. pneumoniae,
H. influenzae, or M. catarrhalis, or with a combination
of these organisms, have an increased risk for recurrent
wheeze and asthma, early in life [4]. Atopic sensitization plays a major role in the development of asthma. A
Figure 1: The overall method for integrating heterogeneous features for prediction of asthma, using artificial neural network based
feature selection and prediction.
study testing asthma and allergy at age 8 found that children born by caesarean section are more prone to asthma
and allergy [24]. Also, there is a modest association found
between very low birth weight and asthma [8]. When analyzing the influence of maternal and paternal asthma and
atopy on children’s asthma at the age of 7 years separately,
persistently sensitized children with asthmatic mothers were
at 10 times higher risk of having current asthma at the
age of 7 years [20]. The children born to asthmatic mothers are at high risk of developing asthma but it is not a
strict rule that all children born to asthmatic women are
asthmatic. Smoking by mothers during pregnancy puts
the children in high asthma risk group and only a small
fraction of the effect seems to be mediated through fetal growth [22]. Use of antibiotics during pregnancy particularly in the third trimester also increases the risk of
asthma in the born child. Studies have been designed
to uncover gene-gene and gene-environment interactions
that may occur between different pathophysiological pathways in asthma, which lead to the discovery of genes related to home dampness, an environment risk factor for
asthma [42].
Various asthma susceptibility studies have identified several genetic and environmental factors to be implicated in
asthma pathogenesis. Several prognosis tools have been
designed for childhood asthma like Asthma Predictive Index (API) [9], the modied API [17], the cumulative risk
score of the Isle of Wright birth cohort [26], the severity score for obstructive airway disease [10] an extension
of the severity score [27] and PRIMA score [18]. These
tools use clinical features like allergies, bronchial obstruc-
tion and lung function, but none of the tools use genetic
constitution of the individuals or a combination of these
genetic and clinical features.
This study aims at selecting a set of clinical features with
genetic markers to predict asthma outcome at age 7 years
(figure 1). Pathway based methods complement conventional association analysis and offer additional insight. Several methods have been developed to study pathway-mediated
effects in GWAS studies. The most widely used of these
have been described in the review by Wang et al. 2010
[48]. To take biological mechanisms into account, the
available features were grouped based on the biological
pathways and pathway selection followed by feature reduction and was performed using machine learning methods to discover the most predictive features and find their
predictive performance.
Material and methods
Discovery cohort
The discovery cohort includes 2,029 individuals selected
from the Danish national birth registry with acute hospitalizations for asthma (ICD8-codes 493, ICD-10 codes
J45-46) from 2 to 6 years of age (both years included).
Details for the selection of individuals and approvals for
the study are previously described [7]. DNA samples for
these cases was obtained from the blood spots stored in
the Danish Newborn Screening Biobank as a part of the
neonatal screening program [32] . The cases were genotyped on the Affymetrix Axiom CEU array (567,090 SNPs)
The controls are a combined set of two population-based
cohorts, the Danish National Birth Cohort (females) and
the Copenhagen draft board examinations (males). The
individuals answering negatively to the question of having
a physician-diagnosed asthma in the questionnaire were
included as controls in the study. These individuals were
previously genotyped on Illumina Human610-Quad v1.0
BeadChip array (545,350 SNPs) as part of the Genetics of
Overweight Young Adults (GOYA) study [31].
Quality control measures were applied separately on cases
and controls included in the discovery cohort. The sample
call rate of >97.5% was used as inclusion criteria along
with removal of individuals with excess heterozygosity,
gender mismatch and familial relatedness. Ethnicity check
was performed using HapMapII CEU reference panel and
non-Danish samples were removed. SNP call rate of 100%
was used, as the applied machine learning method cannot
deal with missing data. After the quality filtering, discovery cohort consisted of 1,173 asthma cases and 2,522
controls. The overlaping SNPs between cases and controls were selected based on SNP position and they were
mapped to the ensembl [15] genes using ensembl API version 62 with the nearest gene function. SNPs not mapping
to any gene were excluded from further steps. There were
124,514 SNPs present on both case and control genotyping. 92,012 SNPs of these SNPs could be mapped to a
total of 13,737 genes.
COPSAC2000 cohort
The COPSAC2000 cohort consists of children born between 1998-2001 to mothers having a history of asthma
diagnosed after 7 years of age. Newborns were enrolled
in the study during the first month after birth and the cohort is characterized by deep phenotyping during close
clinical follow-up [3–5]. Doctors employed in the clinical research unit were acting primary physicians for the
children in the cohort. The diagnosed and treated respiratory and skin symptoms, and asthmatic symptoms were
recorded in daily diaries [6]. Predefined algorithms for
symptoms and responses were deployed to diagnose asthma
from birth to 7 years of age, as previously described [4].
The COPSAC2000 cohort study was approved by the Ethics
Committee for Copenhagen (KF 01-289/96) and the Danish Data Protection Agency (2008-41-1754), and informed
consents were obtained from both parents of each child.
Figure 2: Flowchart for the datasets used and methods applied.
GF= genetic features i.e. SNPs and CF= Clinical features. The
flowchart shows the flow of pathways selection using PKU cohort and feature selection using the COPSAC2000 cohort with
evaluation of the two sets of selected features (only SNPs and
SNPs + CFs) for 11 top pathways on the COPSAC2000 cohort.
Clinical data
Pregnancy conditions like smoking history, antibiotics usage, type of birth, newborn birth weight and weight 2
weeks after birth were recorded for all participants enrolled in the COPSAC2000 cohort. The presence on any
microbial growth in airway was checked on growth mediums [3]. Lung function for all individuals was measured
at the age of 1 month using the raised volume rapid thoracoabdominal compression technique [45]. Allergic sensitization against common inhalant and food allergens was
determined at age of six and eighteen months by the skin
prick test measuring ring diameter [3]. Atopic dermatitis
was diagnosed using the Hanifin-Rajka criteria [19] from
birth to 7 years of age. Two mutations detected outside the
genotyping array, the ORMDL3 and filaggrin mutations
were included in the clinical data. Accordingly, the 18
clinical features used for the COPSAC2000 cohort were
neonatal lung functions (FEF50 and PD15), airway bacteria presence in 1st month, allergy at 6 months, allergy at
18 months, Birth type either natural and C-section, eczema
in 1st year of age, WBC counts at 6 months, WBC counts
at 18 months, weight at birth, weight at two weeks age,
exacerbation, wheeze and asthma before 2 years, antibiotics intake by mother in third trimester, smoking history of mother, ORMDL3 mutation and filaggrin mutations. WBC counts and lung functions were converted to
z-scores and the remaining features were binary encoded.
Genotyping data
For the COPSAC2000 cohort, Genome-wide SNP genotyping was performed using SNP array Illumina Infinium
II HumanHap550 v1, v3 or Quad BeadChip platform at
the Childrens Hospital of Philadelphias Center for Applied Genomics. SNPs with minor allele frequency (MAF)
of <1% or Hardy-Weinberg equilibrium p-value of < 10e?5
were excluded from the analysis. SNPs passing the filtering criteria were mapped to the ensemble genes using
ensemble API version 62 with the nearest gene function.
SNPs that did not map to any gene or have missing genotyping values were excluded from further steps. Two hundred thirty-six participants from the COPSAC2000 cohort had complete set of 18 clinical features and 411534
SNPs. Out of the 411534 complete set SNPs, 271706
SNPs mapped to 20026 genes in the COPSAC2000 cohort.
A high number of SNPs were genotyped in the two datasets
it was not feasible to try all SNPs combinations in the
feature selection - even after grouping them into pathways. Thus, a combinatorial approach was designed to
train and test each pathway using 3-fold cross-validated
ANNs in combinations of up to three features. 59 different
pathway sets created from 92,012 SNPs in the discovery
cohort data. Each pathway was trained and tested independently to rank SNPs and SNP combinations based on
the average 3-fold cross-validated test Matthewss correlation coefficients (MCC). A set of combinations with best
predictive values was selected if they have MCCs with
difference of less than 0.1 from the best MCC (Figure
2). The top combination was used as seed and for each
SNP from the descending ordered list a new ANN was
trained and tested. Average Matthewss correlation coefficient (MCC) was calculated for this new combination
using 3-fold cross-validation. If the SNP increased the
average MCC, it was added to the combination otherwise
the next SNP on the list was tested (Figure 2). The pathways were ranked by the MCCs of their respective best
The top pathways with MCC >=0.3 was selected from the
discovery cohort. These were used for feature reduction
and training and testing using ANN on the COPSAC2000
cohort data, including as well as without the clinical features. In the sets comprising only of genetic features, all
combination till 3 were exhausted in singletons, pairs and
in combinations of three features to rank the feature combinations. The combinations are tested and selected using the combinatorial approach as described in the previous section. All possible combinations of the available
Artificial neural network
18 clinical features were tested to select a set of best discriminating features based on the 3-fold cross-validated
A dataset with 3000 individuals, both cases and controls,
from the discovery cohort having genotyping data for 124,514 ANNs and average MCC. In order to find the genetic features adding power to the selected clinical within the top
SNPs were used for feature selection and ranking of pathpathways, the genetic features were tested in combination
ways using a machine learning algorithm. Feed forward
with selected clinical features. The set of selected clinical
fully connected artificial neural networks (ANNs) with
features were used as a constant set, to train all possible
a single hidden layer using a standard back-propagation
combinations of the genetic features using them in single
procedure [37] were used for pathway selection and to asand in pairs with 3-fold cross-validated ANNs and aversess the predictive performance of the included genetic
age highest MCC as the selection criteria.
features. Pathway sets of the SNPs were created based on
the pathway definitions from the database of cell signaling ( Each SNP was enThe total dataset of COPSAC2000 with 236 individuals
coded by three binary input neurons, 100 for homozygous
was divided into a training-test set of 200 and a small evalreference, 010 for heterozygous and 001 for homozygous
uation set of 36 with balanced case-control division. A
non-reference. The pathways solely discovered in non4-fold cross-validation was performed on the select clinmammalian species were excluded from the analysis. The
ical features for testing their power independently. The
gene or the gene set defined as token by the database of
selected combinations of genetic features in the top pathcell signaling were manually mapped to human homoways with and without the clinical features were used for
logues using HGNC nomenclatures for ease of SNP maptraining 4-fold cross-validated ANNs using the training
ping using Ensembl API. This resulted in 59 mammalian
data set of 200 individuals. These trained ANNs were
Feature reduction
Features grouped in pathways
The SNPs from COPSAC2000 cohort genotyping data were
mapped to the genes from the top 11 pathways with best
performance from PKU cohort. These sets when used for
feature selection, lead to average 97% reduction in the size
of the pathway to result in best performing SNPs. The feature selection performance of these 11 pathways showed
increase in the MCC in the COPSAC2000 cohort as compared to KU cohort (Table 2).
Train and test
(in 1’s, 2’s and 3’s)
Ranked feature combinations
if MCC > (MCCBEST – 0.1)
Top combinations
Add feature
to the top
Add new feature
MCC increases
Reject the
new feature
Best feature combination
for pathway
Figure 3: Schematic diagram of the combinatorial approach
Clinical feature selection
Out of the 18 clinical features for the COPSAC2000, 10
features were selected to have the best discriminative value
for cases and controls from all possible combinations with
a MCC of 0.6418. They were allergy at 6 months of
age, allergy at 18 months, Birth type for natural and Csection, Eczema at 1st year, exacerbation before 2 years,
flaggrin mutation, WBC counts at 18 months of age, WBC
counts at 6 months, weight at birth, weight at 2 weeks
age, wheeze before 2 years. These selected clinical features when combined with the genetic factors in groups,
defined top 11 pathway boundaries selected from PKU cohort showed an increase in the performance with fewer
SNPs per pathway (Table 3).
used for the selection of features.
evaluated for their predictive power for the 36 individuals using MCC and AUC as the measures. The arithmetic
mean of the four trained networks was sued to evaluate
the evaluation set of 38 individuals for each pathway. The
GWAS to test the association of single SNPs of interest in
the dataset was carried out using PLINK [34].
Feature selection
SNPs from the genotyping of discovery cohort were grouped
into 59 pathways from database of cell signaling using the
genes they were mapped to by Ensembl. When trained
and tested with 3-fold cross-validation, the 59 tested pathways resulted in top eleven pathways with MCC >=0.3
(Table 1). These top pathways were reduced by average
90% in size by number of features selected for the best
Training and evaluation
The dataset of COPSAC2000 cohort was divided into 200
train-test set and 36 evaluation set. The networks were
trained on the 200 data points with 4 fold-cross validation using the maximum MCC as the selection criteria for
the best performing network. The selected 10 clinical features with 4 fold cross validation gave MCC= 0.62 using
the evaluation set of 36 individuals. The MCC for the
seven pathway sets, estrogen receptor pathway, differentiation pathway in PC12 cells, insulin signaling pathway, B
cell antigen receptor, mitochondrial pathway of apoptosis
(caspases), PI3K pathway, FAS signaling pathway were
higher as compared to clinical features used alone (Table 4). The test correlation coefficient for three pathways,
FAS signaling pathway, IL-1 pathway, B cell antigen receptor is less than the MCC obtained from clinical features alone and thus cannot be compared with the other
pathways. When prediction accuracy is checked for the
individual pathways, PI3K pathway is more accurate in
Pathway Name
Pathway Name
FAS signaling pathway
Interleukin 1 (IL-1) pathway
Insulin signaling pathway
Differentiation pathway in pc12 cells
Mitochondrial pathway of apoptosis BH3-only Bcl-2 family
G alpha 12 pathway
B cell antigen receptor pathway
Estrogen receptor pathway
PI3K pathway
PI3K class IB pathway in neutrophils
Mitochondrial pathway of apoptosis (Caspases)
Total SNPs
Total SNPs
Test MCC
Test MCC
No of selected SNPs
No of selected SNPs
Table 1: Selected pathways from PKU cohort using genetic data. The Table documents the total number of SNPs used
for feature selection, MCC of feature selection and number of features selected.
Pathway Name
Mitochondrial Pathway of Apoptosis (BH3-only Bcl-2 Family)
Mitochondrial pathway of apoptosis (BH3-only Bcl-2 Family)
PI3K class IB pathway in neutrophils
B cell antigen receptor pathway
Differentiation pathway in PC12 cells
Insulin signaling pathway
IL-1 pathway
Estrogen receptor pathway
PI3K pathway
FAS signaling pathway
Mitochondrial pathway of apoptosis (Caspases)
G alpha 12 pathway
Total SNPs
Test MCC
No of selected SNPs
Table 2: Feature reduction for the top 11 pathways in COPSAC2000 cohort using genetic data. The Table documents
the total number of SNPs used for feature selection, MCC for feature selection and number of features selected.
predicting true positive along G alpha 12 pathway, mitochondrial pathway of apoptosis (caspases) and B cell antigen receptor pathway. The overall performance of B cell
antigen receptor pathway is the worst amongst the tested
pathways, as the features of that pathway do not describe
the negatives precisely. Similarly, FAS signaling pathway
and PI3K Class IB pathway in neutrophils are better at assigning non-asthmatic class to controls. Estrogen receptor
pathway comes as the pathway with best predictive value
as it is good are both assigning the asthmatic class to cases
and non-asthmatic to controls.
The genome wide association analyses to associate the
variations to various asthma phenotypes using the discovery cohort and COPSAC2000 cohort have been carried
out as a part of different studies. The GWAS results from
discovery cohort showed associations of single SNP to
asthma exacerbation, replicating previously known loci
IL-33, RAD50/IL13, HLA-DQ and IL1RL1 and also discovering a new asthma associated gene CDHR3 [7]. The
GWAS analysis of the 411 children COPSAC2000 cohort
was recently published where variations in PCDH1 was
shown to increase risk for early asthma as well as atopic
dermatitis in early childhood [30].
In this study, we search for genetic and clinical feature
combinations that are associated with the development of
asthma before age 7 years. As all individuals in COPSAC2000 are born to asthmatic mothers, it might be speculated that all of them will develop asthma or related phenotypes. But that is not the case in our cohort. It has
been observed by other studies also that a family history
of asthma is not a strong predictor of asthma outcome in
children but absence of it better predicts that the child
will not develop asthma [25]. Longitudinal study inves-
Pathway Name
Mitochondrial pathway of apoptosis (BH3-only Bcl-2 Family)
Mitochondrial pathway of apoptosis (Caspases)
Insulin signaling pathway
Total SNPs
10 CF + 511 SNPs
Test MCC
10 CF + 579 SNPs
10 CF + 288 SNPs
PI3K class IB pathway in neutrophils
PI3K pathway
10 CF + 525 SNPs
10 CF + 553 SNPs
FAS signaling pathway
10 CF + 451 SNPs
G alpha 12 pathway
10 CF + 236 SNPs
Estrogen receptor pathway
10 CF + 713 SNPs
IL-1 pathway
10 CF + 664 SNPs
Differentiation pathway in PC12
B cell antigen receptor pathway
10 CF + 312 SNPs
10 CF + 546 SNPs
No of selected SNPs
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Selected 10 CFs,
Table 3: Feature selection in COPSAC2000 cohort using 10 selected clinical features and genetic data from the top
11 pathways. The table documents the total number of SNPs used for feature selection, MCC of feature selection and
selected features.
tigating multiloci profile of genetic risk for asthma in cohort with family history with asthma, found that multiple
GWAS discoveries are associated with childhood-onset of
asthma [2]. Thus, we design a methodology using combinations of multiple loci and testing them for association
with the trait. The environmental factors cannot be ignored in designing studies concerning complex diseases.
Multi-ethnic group study found that caregiver reports of
physician diagnosis of asthma (CRPDA) when augmented
with assessment of bronchial hyperresponsiveness (BHR)
results in precise identification of children with Asthma
at age 7 [43]. Similarly, we also found that none of clinical features alone can have better prediction power than
the combination of 10 clinical features. Article by Hans
et al. describes the details of clinical data collection, phenotyping of children as well as long follow-up giving us
high quality of clinical data [6] used as input in this study.
The combinations of genetic and clinical data have more
predictive value than any of them been alone as shown by
the increase in MCC (Table 3), which supports the idea of
gene-environment interaction.
Multifactor dimensionality reduction analysis suggests genegene interactions may occur between different physiological pathways as well as between gene and environment
factors like dampness in childhood asthma [1]. While test-
ing at multiple loci, defining and evaluation of genomic
profile at risk is valuable. The biological pathway based
combination of SNPs, gives clues of biological mechanisms occurring in the pathophysiology of the disease.
This study does not involve any of the environmental factors like the surroundings, climate and living and economic conditions, which might add more power to the
tool. The risk of fracture is determined by genetic as well
as non-genetic factors and a combined model of genes and
clinical features is 45% accurate than the genetic model
with 41% specificity [44]. In vague situations like primary
evaluation of head trauma patients based on clinical data,
ANN out performed the logistic regression [13]. Similarly, in this study ANN based prediction using clinical
risk features and genetic data out performs when tested
against the single data type predictors.
The ANN models in our study use only one continuous
variable (WBC counts) at the end and rest of the features
are binary variable. It is known that binary data is less
powerful than the continuous data types to detect association between feature and outcome but binary variable have
earlier proved their ability to represent genetic data used
to ascertain a priori score for the predisposition of coronary infract [51]. All SNPs in the dataset were not used,
as the current architecture was unable to handle missing
Pathway Name
Estrogen receptor pathway
Differentiation pathway in PC12 cells
Insulin signaling pathway
Mitochondrial pathway of apoptosis (Caspases)
PI3K pathway
Mitochondrial pathway of apoptosis (BH3-only Bcl-2 Family)
PI3K Class IB pathway in neutrophils
G alpha 12 pathway
FAS Signaling pathway
IL-1 pathway
B Cell Antigen Receptor pathway
Features for selection
Test MCC
Selected features
Table 4: Evaluation results for pathways. 11 selected pathways with the features used for training the 4-fold crossvalidated networks and evaluation MCC and AUC on the evaluation set.
B Cell AnAgen Receptor Interleukin 1 (IL-­‐1) Pathway Mitochondrial Pathway of Apoptosis(Caspases) Fas Signaling Pathway G alpha 12 Pathway Mitochondrial Pathway of Apoptosis (BH3-­‐only Bcl-­‐2 Family) Insulin Signaling Pathway DifferenAaAon Pathway in PC12 Cells PI3K Class IB Pathway in Neutrophils Estrogen Receptor Pathway PI3K Pathway 0 False predicAons 5 10 True negaAves 15 20 25 30 35 True posiAves Figure 4: The accuracy of different pathways in correctly predicting asthmatic and non-asthmatic individuals. True positive are the
individual with asthma at 7, which are predicted asthmatic while true negative are controls predicted as non-asthmatic. Negatives are
the count of mis-predictions of asthmatic been predicted as non-asthmatic and non-asthmatic been predicted as asthmatic.
data. Since, the genotyping is never 100% for all samples, using probabilities of the genotype can overcome
this problem and would increase the coverage of SNPs
and the pathways to be tested in the method. The top
11 pathways found to be associated with asthma include
pathways, which either as the pathway or the gene component have been associated with asthma. Variation is estrogen receptors, the key molecule of the best performing
pathway has being associated with different asthma like
phenotypes [11] and reduced ER-alpha receptor has been
reported in the mitochondria of fatal asthma cases. This
indicates the function of ER-alpha during the inflamma-
tion of airways and their crucial role in pathophysiology
of asthma [39]. Increasing links between asthma, obesity
and diabetes are not only due to mechanical pulmonary
disadvantage but there are some molecular connects between these phenotype. Multiple studies suggests insulin
affects lungs and airway smooth muscles and also insulin
is downstream pathway of PI3K/Akt signalling [40]. PI3K
is an intracellular signalling pathway, which is important
in apoptosis. A common SNP detected in the two pathways rs7566856, is mapped to Inositol Polyphosphate5-Phosphatase, 145kDa (INPP5D /SHIP-1) gene, which
acts a positive regulator in Th2 cells in the adaptive im-
mune response to aeroallergen [36]. It has been found that
inhibitors targeting PI3K isoforms can serve as therapeutic agents for treatment of asthma and chronic obstructive
pulmonary disease [21]. This study finds two Phosphoinositide 3-kinase (PI3K) pathways to be associated with
the asthma outcome, only one of which is being found
to be perform well on evaluation phase. Thus, the selection of insulin signalling and PI3K pathways over the
other pathways in the database indicates towards a link
between the genetic features from these pathways, and interaction with each other and asthma. Mitochondria mediated apoptosis have been found to affect atopic asthma
by delayed cell death of neutrophils, which contribute to
neutrophilic inammation in asthma [38]. Thus, the inflammation reactions occurring during the phenotype and their
role in prediction of asthma can be facilitated though these
mitochondria mediated apoptosis pathways. The pathway
ranking second in the evaluation list is the differentiation
pathway in PC12 cells, which has been detected in the tumor cells of adrenal glands. These cells are known to be
under the control of different growth factors [47]. Thus,
the pathway definition contains genes PI3K, Protein kinase B (AKT), cAMP response element-binding proteins
(CREB), which are common between different pathways.
Thus, the selection of pathways to be tested and the definition of pathways play a crucial role in the success of this
method. Taking expression of genes in asthma related tissues would avoid false positive hits.
Though the other 6 pathways do not increase the performance of these 10 selected clinical features, they still have
predictive power and role in asthma phenotype. The Galpha 12 pathway activity leads to transformation, regulation of mitogenesis, regulation of survival, induction of
stress fibers and is under the control of Thrombin [35].
Higher concentration of thrombin is found in the sputum
of asthmatic patients and is relevant to airway tissue remodelling during the disease [16]. FAS signalling, the
top ranker in the discovery cohort belongs the death receptor subgroup of the TNF receptor superfamily. The
three pathways having the test as well as the evaluation
MCC less than the clinical features alone show that the
selected genetic features do not correlate with the clinical
features. Asthmatic cell lines have been found to be resistant to First apoptosis signal (FAS) signalled apoptosis.
Also, FAS signal transduction was suggested to contribute
to T-cell-dependent immunoinammation in asthma [23].
So, the genetic features in FAS signalling pathway are
relevant to asthma but might not present a coordinated effect with selected clinical risk features. The other 2 pathways lowering the MCC of the clinical features MCC in
the test set only are B-cell antigen receptor and IL-1 pathways. These are known candidates of immune response
in asthmatic conditions. Thus, all the top 11 pathways
selected by the discovery method are asthma associated.
This method helps in reducing the complexity from thousands of SNPs and hundreds of gene to few best predictive SNPs when combined with the clinical features can
be used as a predictive tool for asthma.
Earlier prediction studies have been simplistic in the sense
that they were based on small number of variants explaining only a fraction of genetic variability. High predictive
value gives chances of discovering interventions where as
low predictive results give discovery tool and risk predictions. The aim of these combination studies is to define
a genetic risk score which can be used to predicting the
disease risk of healthy people without any symptoms of
the disease at the present [50]. This would be of great advantage for the management and treatment of disease like
This method tries to maximise the exploration of the SNPs
and thus the gene space but still it is not complete. Same
SNPs are not present in all arrays and not all SNPs are detected in sequencing data, thus it is difficult to find a oneto-one replication set. Due to the small size of the cohort
available with complete clinical data, we were not able to
use an external evaluation set, which might have led to the
problem of over fitting. Also, a replication of the method
in an independent cohort would add more confidence to
the pathway selection and predictions. Functional studies
based on the replication results would lead to the causative
variations within pathways. The method can be improved
if we were able to try more combination by introducing
more parallelization in the method. The definition of the
pathways also play important role in this method and all
the resources of pathway information are under development, meaning that we might be missing some genes and
connection in our background data.
This method allows selection of pathways along with the
selection of features with these pathways. The identified
pathways and variations that can serve as basis of upcoming asthma studies to be studied in more details. This
study prioritises clinical risk features as well as genetic
features with predictive power for asthma, which can lead
to further functional studies. This study shows the advantage and success of combination of multiple data sources
for better predictive power.
[1] K. C. Barnes, “Gene-environment and gene-gene interaction studies in the molecular genetic analysis of asthma
and atopy”, Clin Exp Allergy, Vol. 29 Suppl 4, pp. 47–51,
[2] D. D. W. Belsky, P. M. R. Sears, R. J. Hancox, H. Harrington, R. Houts, P. T. E. Moffitt, K. Sugden, B. Williams,
P. R. Poulton, and P. A. Caspi, “Polygenic risk and the development and course of asthma: an analysis of data from
a four-decade longitudinal study”, The Lancet Respiratory
Medicine, Vol. 1, No. 6, pp. 453 – 461, August 2013.
[3] H. Bisgaard, “The Copenhagen Prospective Study on
Asthma in Childhood (COPSAC): design, rationale, and
baseline data from a longitudinal birth cohort study”, Ann
Allergy Asthma Immunol, Vol. 93, No. 4, pp. 381–9, 2004.
[4] H. Bisgaard, M. N. Hermansen, F. Buchvald, L. Loland,
L. B. Halkjaer, K. Bonnelykke, M. Brasholt, A. Heltberg,
N. H. Vissing, S. V. Thorsen, M. Stage, and C. B. Pipper,
“Childhood asthma after bacterial colonization of the airway in neonates”, N Engl J Med, Vol. 357, No. 15, pp.
1487–95, 2007.
[5] H. Bisgaard, M. N. Hermansen, L. Loland, L. B. Halkjaer,
and F. Buchvald, “Intermittent inhaled corticosteroids in
infants with episodic wheezing”, N Engl J Med, Vol. 354,
No. 19, pp. 1998–2005, 2006.
[6] H. Bisgaard, C. B. Pipper, and K. Bonnelykke, “Endotyping early childhood asthma by quantitative symptom
assessment”, J Allergy Clin Immunol, Vol. 127, No. 5, pp.
1155–64 e2, 2011.
[7] K. Bonnelykke, P. Sleiman, K. Nielsen, E. Kreiner-Moller,
J. M. Mercader, D. Belgrave, H. T. den Dekker, A. Husby,
A. Sevelsted, G. Faura-Tellez, L. J. Mortensen, L. Paternoster, R. Flaaten, A. Molgaard, D. E. Smart, P. F. Thomsen, M. A. Rasmussen, S. Bonas-Guarch, C. Holst, E. A.
Nohr, R. Yadav, M. E. March, T. Blicher, P. M. Lackie,
V. W. Jaddoe, A. Simpson, J. W. Holloway, L. Duijts, A. Custovic, D. E. Davies, D. Torrents, R. Gupta,
M. V. Hollegaard, D. M. Hougaard, H. Hakonarson, and
H. Bisgaard, “A genome-wide association study identifies CDHR3 as a susceptibility locus for early childhood
asthma with severe exacerbations”, Nat Genet, Vol. 46,
No. 1, pp. 51–5, 2014.
[8] A. M. Brooks, R. S. Byrd, M. Weitzman, P. Auinger, and
J. T. McBride, “Impact of low birth weight on early childhood asthma in the United States”, Arch Pediatr Adolesc
Med, Vol. 155, No. 3, pp. 401–6, 2001.
[9] J. A. Castro-Rodriguez, C. J. Holberg, A. L. Wright, and
F. D. Martinez, “A clinical index to define risk of asthma
in young children with recurrent wheezing”, Am J Respir
Crit Care Med, Vol. 162, No. 4 Pt 1, pp. 1403–6, 2000.
[10] H. G. M.-K. M. P. M. M. P. e. a. Devulapalli CS,
Carlsen KC, “Severity of obstructive airways disease by
age 2 years predicts asthma at 10 years of age”, Thorax,
Vol. 63, pp. 8–13, 2008.
[11] A. Dijkstra, T. D. Howard, J. M. Vonk, E. J. Ampleford,
L. A. Lange, E. R. Bleecker, D. A. Meyers, and D. S.
Postma, “Estrogen receptor 1 polymorphisms are associated with airway hyperresponsiveness and lung function
decline, particularly in female subjects with asthma”, J
Allergy Clin Immunol, Vol. 117, No. 3, pp. 604–11, 2006.
[12] D. L. Duffy, N. G. Martin, D. Battistutta, J. L. Hopper,
and J. D. Mathews, “Genetics of asthma and hay fever in
Australian twins”, Am Rev Respir Dis, Vol. 142, No. 6 Pt
1, pp. 1351–8, 1990.
[13] B. Eftekhar, K. Mohammad, H. E. Ardebili, M. Ghodsi,
and E. Ketabchi, “Comparison of artificial neural network
and logistic regression models for prediction of mortality
in head trauma based on initial clinical data”, BMC Med
Inform Decis Mak, Vol. 5, p. 3, 2005.
M. e. a. Ferreira, “Identification of IL6R and chromosome
11q13.5 as risk loci for asthma.”, Lancet, Vol. 378, pp.
1006–1014, 2011.
P. Flicek, M. R. Amode, D. Barrell, K. Beal, S. Brent,
D. Carvalho-Silva, P. Clapham, G. Coates, S. Fairley,
S. Fitzgerald, L. Gil, L. Gordon, M. Hendrix, T. Hourlier,
N. Johnson, A. K. Kahari, D. Keefe, S. Keenan, R. Kinsella, M. Komorowska, G. Koscielny, E. Kulesha, P. Larsson, I. Longden, W. McLaren, M. Muffato, B. Overduin,
M. Pignatelli, B. Pritchard, H. S. Riat, G. R. Ritchie,
M. Ruffier, M. Schuster, D. Sobral, Y. A. Tang, K. Taylor,
S. Trevanion, J. Vandrovcova, S. White, M. Wilson, S. P.
Wilder, B. L. Aken, E. Birney, F. Cunningham, I. Dunham,
R. Durbin, X. M. Fernandez-Suarez, J. Harrow, J. Herrero,
T. J. Hubbard, A. Parker, G. Proctor, G. Spudich, J. Vogel,
A. Yates, A. Zadissa, and S. M. Searle, “Ensembl 2012”,
Nucleic Acids Res, Vol. 40, No. Database issue, pp. D84–
90, 2012.
E. C. Gabazza, O. Taguchi, S. Tamaki, H. Takeya,
H. Kobayashi, H. Yasui, T. Kobayashi, O. Hataji,
H. Urano, H. Zhou, K. Suzuki, and Y. Adachi, “Thrombin in the airways of asthmatic patients”, Lung, Vol. 177,
No. 4, pp. 253–62, 1999.
T. W. Guilbert, W. J. Morgan, M. Krawiec, J. Lemanske,
R. F., C. Sorkness, S. J. Szefler, G. Larsen, J. D. Spahn,
R. S. Zeiger, G. Heldt, R. C. Strunk, L. B. Bacharier,
G. R. Bloomberg, V. M. Chinchilli, S. J. Boehmer, E. A.
Mauger, D. T. Mauger, L. M. Taussig, and F. D. Martinez,
“The Prevention of Early Asthma in Kids study: design,
rationale and methods for the Childhood Asthma Research
and Education network”, Control Clin Trials, Vol. 25,
No. 3, pp. 286–310, 2004.
E. Hafkamp-de Groen, H. F. Lingsma, D. Caudri,
D. Levie, A. Wijga, G. H. Koppelman, L. Duijts, V. W.
Jaddoe, H. A. Smit, M. Kerkhof, H. A. Moll, A. Hofman,
E. W. Steyerberg, J. C. de Jongste, and H. Raat, “Predicting asthma in preschool children with asthma-like symptoms: Validating and updating the PIAMA risk score”, J
Allergy Clin Immunol, Vol. 132, No. 6, pp. 1303–1310 e6,
G. Hanifin, J.M. & Rajka, “Diagnostic features of atopic
dermatitis”, Acta Derm. Venereol., Vol. 92, pp. 44–47,
S. Illi, E. von Mutius, S. Lau, R. Nickel, B. Niggemann,
C. Sommerfeld, and U. Wahn, “The pattern of atopic sensitization is associated with the development of asthma in
childhood”, J Allergy Clin Immunol, Vol. 108, No. 5, pp.
709–14, 2001.
K. Ito, G. Caramori, and I. M. Adcock, “Therapeutic
potential of phosphatidylinositol 3-kinase inhibitors in inflammatory respiratory disease”, J Pharmacol Exp Ther,
Vol. 321, No. 1, pp. 1–8, 2007.
J. J. Jaakkola and M. Gissler, “Maternal smoking in pregnancy, fetal development, and childhood asthma”, Am J
Public Health, Vol. 94, No. 1, pp. 136–40, 2004.
S. Jayaraman, M. Castro, M. O’Sullivan, M. J. Bragdon,
and M. J. Holtzman, “Resistance to Fas-mediated T cell
apoptosis in asthma”, J Immunol, Vol. 162, No. 3, pp.
1717–22, 1999.
[24] O. Kolokotroni, N. Middleton, M. Gavatha, D. Lamnisos,
K. N. Priftis, and P. K. Yiallouros, “Asthma and atopy
in children born by caesarean section: effect modification
by family history of allergies - a population based crosssectional study”, BMC Pediatr, Vol. 12, p. 179, 2012.
[25] G. H. Koppelman, G. J. te Meerman, and D. S. Postma,
“Genetic testing for asthma”, Eur Respir J, Vol. 32, No. 3,
pp. 775–82, 2008.
[26] H. S. A. S. Kurukulaaratchy RJ, Matthews S, “Predicting persistent disease among children who wheeze during
early life”, Eur Respir Journal, Vol. 22, pp. 767–71, 2003.
[27] M. P. H. G. P. M. M. K. M. e. a. Lodrup Carlsen KC,
Soderstrom L, “Severity of obstructive airways disease by
age 2 years predicts asthma at 10 years of age”, Allergy,
Vol. 65, pp. 1134–40, 2010.
[28] A. L. Marat and P. S. McPherson, “Variants of DENND1B
associated with asthma in children”, N Engl J Med, Vol.
363, No. 10, pp. 988–9; author reply 989, 2010.
[29] M. F. Moffatt, I. G. Gut, F. Demenais, D. P. Strachan, E. Bouzigon, S. Heath, E. von Mutius, M. Farrall, M. Lathrop, and W. O. Cookson,
“A largescale, consortium-based genomewide association study of
asthma”, N Engl J Med, Vol. 363, No. 13, pp. 1211–21,
[30] L. J. Mortensen, E. Kreiner-Moller, H. Hakonarson,
K. Bonnelykke, and H. Bisgaard, “The PCDH1 gene and
asthma in early childhood”, Eur Respir J, Vol. 43, No. 3,
pp. 792–800, 2014.
[31] E. A. Nohr, N. J. Timpson, C. S. Andersen,
G. Davey Smith, J. Olsen, and T. I. A. Sorensen,
“Severe obesity in young women and reproductive health:
the Danish National Birth Cohort”, PloS one, Vol. 4,
No. 12, p. e8444, 2009.
[32] B. Norgaard-Pedersen and D. M. Hougaard, “Storage policies and use of the Danish Newborn Screening Biobank”,
Journal of inherited metabolic disease, Vol. 30, No. 4, pp.
530–6, 2007.
[33] G. K. D. P. O.E. Savenije, M. Kerkhof, “Predicting who
will have asthma at school age among preschool children”,
J Allergy Clin Immunol, Vol. 130, No. 2, p. 325331, 2012-.
[34] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A.
Ferreira, D. Bender, J. Maller, P. Sklar, P. I. de Bakker,
M. J. Daly, and P. C. Sham, “PLINK: a tool set for wholegenome association and population-based linkage analyses”, Am J Hum Genet, Vol. 81, No. 3, pp. 559–75, 2007.
[35] S. R. N. Ravi Iyengar, Prahlad Ram, “G alpha 12 Pathway”, G alpha 12 Pathway.
[36] S. Roongapinun, S. Y. Oh, F. Wu, A. Panthong, T. Zheng,
and Z. Zhu, “Role of SHIP-1 in the adaptive immune responses to aeroallergen in the airway”, PLoS One, Vol. 5,
No. 11, p. e14174, 2010.
[37] D. Rumelhart, G. Hinton, and R. Williams, Learning internal representations by error propagation, Parallel Distributed Processing, vol. 1, 318-362, MIT Press, Cambridge, 1986.
[38] A. S. Saffar, M. P. Alphonse, L. Shan, K. T. Hayglass, F. E.
Simons, and A. S. Gounni, “IgE modulates neutrophil
survival in asthma: role of mitochondrial pathway”, J Immunol, Vol. 178, No. 4, pp. 2535–41, 2007.
[39] D. C. Simoes, A. M. Psarra, T. Mauad, I. Pantou, C. Roussos, C. E. Sekeris, and C. Gratziou, “Glucocorticoid and
estrogen receptors are reduced in mitochondria of lung epithelial cells in asthma”, PLoS One, Vol. 7, No. 6, p.
e39183, 2012.
S. Singh, Y. S. Prakash, A. Linneberg, and A. Agrawal,
“Insulin and the Lung: Connecting Asthma and Metabolic
Syndrome”, J Allergy (Cairo), Vol. 2013, p. 627384, 2013.
P. M. Sleiman, J. Flory, M. Imielinski, J. P. Bradfield,
K. Annaiah, S. A. Willis-Owen, K. Wang, N. M. Rafaels,
S. Michel, K. Bonnelykke, H. Zhang, C. E. Kim, E. C.
Frackelton, J. T. Glessner, C. Hou, F. G. Otieno, E. Santa,
K. Thomas, R. M. Smith, W. R. Glaberson, M. Garris,
R. M. Chiavacci, T. H. Beaty, I. Ruczinski, J. S. Orange,
J. Allen, J. M. Spergel, R. Grundmeier, R. A. Mathias,
J. D. Christie, E. von Mutius, W. O. Cookson, M. Kabesch,
M. F. Moffatt, M. M. Grunstein, K. C. Barnes, M. Devoto, M. Magnusson, H. Li, S. F. Grant, H. Bisgaard, and
H. Hakonarson, “Variants of DENND1B associated with
asthma in children”, N Engl J Med, Vol. 362, No. 1, pp.
36–44, 2010.
M. W. Su, K. Y. Tung, P. H. Liang, C. H. Tsai, N. W.
Kuo, and Y. L. Lee, “Gene-gene and gene-environmental
interactions of childhood asthma: a multifactor dimension
reduction approach”, PLoS One, Vol. 7, No. 2, p. e30694,
G. P. Tamesis, R. A. Covar, M. Strand, A. H. Liu, S. J.
Szefler, and M. D. Klinnert, “Predictors for asthma at age
7 years for low-income children enrolled in the Childhood
Asthma Prevention Study”, J Pediatr, Vol. 162, No. 3, pp.
536–542 e2, 2013.
B. N. Tran, N. D. Nguyen, V. X. Nguyen, J. R. Center,
J. A. Eisman, and T. V. Nguyen, “Genetic profiling and
individualized prognosis of fracture”, J Bone Miner Res,
Vol. 26, No. 2, pp. 414–9, 2011.
D. J. Turner, S. M. Stick, K. L. Lesouef, P. D. Sly, and
P. N. Lesouef, “A new technique to generate and assess
forced expiration from raised lung volume in infants”, Am
J Respir Crit Care Med, Vol. 151, No. 5, pp. 1441–50,
C. E. van Beijsterveldt and D. I. Boomsma, “Genetics of
parentally reported asthma, eczema and rhinitis in 5-yr-old
twins”, Eur Respir J, Vol. 29, No. 3, pp. 516–21, 2007.
A. M. Vignola, G. Chiappara, P. Chanez, A. M.
Merendino, E. Pace, M. Spatafora, J. Bousquet, and
G. Bonsignore, “Growth factors in asthma”, Monaldi Arch
Chest Dis, Vol. 52, No. 2, pp. 159–69, 1997.
K. Wang, M. Li, and H. Hakonarson, “Analysing biological pathways in genome-wide association studies”, Nat
Rev Genet, Vol. 11, No. 12, pp. 843–54, 2010.
M. Weitzman, S. Gortmaker, and A. Sobol, “Racial, social, and environmental risks for childhood asthma”, Am
J Dis Child, Vol. 144, No. 11, pp. 1189–94, 1990.
N. R. Wray, M. E. Goddard, and P. M. Visscher, “Prediction of individual genetic risk to disease from genomewide association studies”, Genome Res, Vol. 17, No. 10,
pp. 1520–8, 2007.
N. Yiannakouris, A. Trichopoulou, V. Benetou,
T. Psaltopoulou, J. M. Ordovas, and D. Trichopoulos, “A direct assessment of genetic contribution to the
incidence of coronary infarct in the general population
Greek EPIC cohort”, Eur J Epidemiol, Vol. 21, No. 12,
pp. 859–67, 2006.
Part III
Obesity aetiology
Obesity is another endemic complex disease growing at high rates in the developed as well as developing parts of the world [211]. Obesity is caused
when energy intake exceeds energy expenditure [212] and is influenced by
genetics, diet, age and lifestyle [213]. Physiological presentation of obesity
is when abnormal amounts of triglycerides are stored in adipose tissue and,
later released from adipose tissue as free fatty acids (FFA) with detrimental
effects on other organs. Obesity can lead to other chronic conditions, including cardiovascular diseases, type II diabetes mellitus, osteoarthritis of the
lower extremities, mobility disorders, and increases overall mortality. The
goal of ongoing obesity research is to elucidate pathways and mechanisms
that control obesity and to improve prevention, management and therapy
[214]. Adipose tissue plays a major role in nutrient homeostasis, by serving
as the energy storage organ and as the source of energy during fasting, thus
making it important in pathophysiology of obesity. Adipose tissue is a mesh
of different cells like adipocytes (commonly called as fat cells), stromal cells,
vessels, nerves held together by elements of the extracellular matrix. Adipose
tissue is also regarded as an endocrine organ as it secretes factors like adipsin,
TNF-α, and leptin which are known to affect the activity of other organs.
Adipose tissue also differs in size, function and their potential contribution to
disease is based on their type and location within the body. In humans, the
adipose tissue can be broadly classified into subcutaneous (below the skin)
and visceral (around the organs). In mice, the adipose tissue is made up
of two subcutaneous depots, called the inguinal, and several visceral depots
near multiple organs. For example, the fat near the kidneys is called perirenal
and fat near the epididymis is called epididymal.
Adipose tissue
Traditionally, adipocytes have been classified on the basis of their morphology into two types: brown adipocytes and white adipocytes (Figure
II.1). Brown adipocytes contain numerous fat droplets and are specialised to
dissipate stored chemical energy in the form of heat. They make the brown
adipose tissue (BAT). Brown adipocytes are characterised by high expression
of uncoupling protein-1 (UCP-1) that catalyses the passage of a proton to the
inner mitochondrial membrane for adenosine triphosphate (ATP) synthesis.
On the other hand, white adipocytes present in the white adipose tissue
(WAT) are unilocular cells known for storing energy and increasing weight.
Big mammals like human are born with brown fat that disappears in first
few months after birth and is replaced by WAT later on. Another form of
temporary or intermediate adipose tissue is “beige” or “brite” adipocytes.
They are UCP-1 positive cells with a brown fat-like morphology within white
fat depots.
White fat cell
Brown fat cell
Figure II.1 White adipose cell and brown adipose cell. M = Mitochondria, LV=
Lipid vesicles.
Adipocytes develop from mesenchyme, but there are differences in the field
about the origin of brown and white fat cells and their replacement by each
other. Recent review by Rosen et al [215] discusses the state-of-art knowledge
about different adipocytes and their mechanisms of survival. BAT is more
efficient in energy utilisation and it is seen as a perspective key holder for
preventing obesity. Adult humans have small amount of functional BAT
and detection of brown adipose tissue inversely correlate with age [216].
The mechanism to increase the BAT content of body or making it more
efficiently have been sought as therapeutic methods to overcome obesity.
The knowledge of the mechanism by which BAT in early life is converted to
WAT in mammals is one of the blocks of the puzzle, which is important for
understanding the BAT WAT inter conversions.
Chapter 5 presents the work in the field of adipose biology where the
brown fat tissue to white adipose tissue conversion has been modelled in
another precocial mammal, i.e. sheep, to replicate what happens in humans.
Precocial mammals have a long gestation period, and are born with UCP-1
expressing brown fat which is replaced later by white fat [217]. This replacement happens by a immediate start of non-shivering thermogenesis at
birth [217]. The project aims at identifying the factors responsible for the
replacement of BAT by WAT with the help of transcriptome profiling of
adipose tissue using the RNA-seq over a period of 14 days starting at 2 days
before the birth.
Figure 4. Adipocyte-Mat
Play a Role in the Patholog
Adipocytes secrete numerous
maintain the structure of the d
nutrition, adipocytes increase
expansion becomes limited b
undergoes fibrotic changes. Th
that include hypoxia, inflamma
all of which contribute to insul
glucose uptake and a he
profile, may account fo
metabolic health of some
This is consistent with o
dence demonstrating th
dione treatment impro
parameters despite incre
cell number and total
et al., 2011; Yamauchi e
well as findings that m
healthy obese patients ha
preadipocyte pool (Gu
Figure II.2 The transformation of white adipose tissue during obesity. In obe2013). Whether increas
sity, adipose tissue undergoes hypertrophy followed by inflammation by invading
Fibrosis is an[215].
additional key element in determining the health or increased adipogenesis accounts for the p
of the fat pad (Figure 4). Adipocytes can be likened to ‘‘grapes in MHO individual, it certainly raises the paradox
mesh bag,’’
with elements
of the
matrix serving
as of the
obese population might be improved if
is basically
growth involving
of adipocytes
obese. We do not, however, expe
called hypertrophy (Figure II.2) as well as increase in number of adipocytes
priority for the pharmaceutical in
by the recruitment of new adipocytes called hyperplasia. Hypertrophy
usually precedes hyperplasia. This pre-adipocytes to adipocytes conversion
nutrient availability (Maquoi et al., 2002). Current thinking holds Adipocyte-Immune Cell Interactions: Come
during hyperplasia is influenced by neural inputs and hormones secreted eithat relaxation of the matrix allows healthy expansion of the fat Pad!
ther by other endocrine organs or by adipose tissue itself. With the increase
pad; if the matrix is too rigid, then adipocytes become limited In addition to a matrix of extracellular prote
in size of the adipose tissue, the neurovasculature development also occurs
in their ability to store excess nutrients, and this leads to patho- are surrounded by a wide variety of cells that
to supply blood and nervous signal to the enlarged tissue as well as to drain
logical features that include activation of stress-related path- thelium, immune cells, fibroblasts, preadipoc
the lymph.
the growth
of adipose
the vasculature
and ectopic
lipid deposition
in other
Overall, mature lipid-laden adipocytes
(Sun et al., 2013a). Collagen VI, for example, is the predominant make up only 20%–40% or so of the cellular
statusof iscollagen
in obesity
with elevated
of adipokines
e.g. leptin
by adipocytes.
the Col6a1
pad (although
they account for >90% of fat pad
and cytokines
necrosis ob
(Figure II.2).
is disrupted
they develop
gram Accumulatof adipose tissue contains 1–2 million adi
ing evidences
in epidemiological
in obesity
have implied
a role for
larger adipocytes
than wild-type
(but smaller
million stromal-vascular
cells, of which more tha
in fat reasons),
cells. Forcoupled
in prenatal
stage and Dixit, 2012). Immune c
pads overall,
for unclear
reduced cytes
or childhoodand
risks of adult-onset
and lipidthe
known to obesity
populate the fat pad for decades
(Khan et al., 2009). More recently, fibroblast growth factor 1 1963), but it was not clear until recently that th
(FGF1) was shown to be a critical mediator of adipose remodel- central role in adipose biology (Figure 5). This re
ing, such that Fgf1!/! mice display dramatically altered adipose with the observation that adipose tissue is an im
morphology upon chronic overfeeding or fasting, accompanied of TNF-a and other cytokines, an effect magnifi
by insulin resistance and dysglycemia (Jonker et al., 2012).
tion (Hotamisligil et al., 1993). These proinflamm
The Col6a1-deficient model and others with similar features significantly impair the insulin sensitivity of local
have been likened to a subgroup of human subjects called the also liver and muscle. Later work showed tha
‘‘metabolically healthy obese’’ (MHO). These individuals tend cytokines are produced by macrophages wit
including insulin resistance, referred to as “fetal programming” [218]. On
the other hand, obesity is also known as life style disease and junk food and
fats are being blamed for the increased rate of obesity not only in adults but
also in children. Thus, the current research in obesity tries to explore the
genetics as well as the effect of diet.
Obesity and Epigenetics
Obesity has a genetic component and the human obesity gene map in 1996
collected 127 genes from various studies linked to obesity phenotypes [219].
Obesity gene atlas identified 1,515 protein-coding genes and 221 miRNAs
compiled from studies in four different mammals: human, cattle, rat, and
mouse [220]. Along with genetics, alterations in the epigenetic marks like
DNA and histone methylation are also connected to body weight and weight
loss [221]. Thus, epigenetic mark profiling can be used to predict the susceptibility to obesity, and with the implementation of weight loss programs
and other therapeutic approaches the negative outcome can be prevented.
Leptin and TNF-α methylation levels can be used as epigenetic biomarkers
for weight loss as well as other comorbidities like diabetes and hypertension
[222]. Environmental exposures are likely to have an epigenetic effect on
complex diseases, as it is known that tobacco smoke modifies the gene expression by DNA hypermethylation [223]. Body weight homeostasis is regulated
through complex mechanism involving genetics and epigenetics, which are
influenced by dietary intake and physical activity [224]. In a human study,
the dietary folic acid intake has been associated with DNA methylation, with
a transgeneration effect [225]. Thus, the food we eat directly or indirectly
affects the cells and its epigenome. Leptin-deficient (ob/ob) mice is a widely
used mouse model of genetic obesity and diabetes as they have hyperglycemia
and obesity. High-fat containing foods are critically prevalent in modern
society and they are main contributor to human obesity these days. High fat
diet fed mouse model closely mimics high fat diet induced obesity in humans
and thus serves as an important model to study obesity caused by high-fat
foods. These mice have elevated blood glucose, impaired glucose tolerance,
and subsequently acquire insulin resistance.
With the knowledge of the epigenetic biomarker and the exposome for
obesity, two models of obesity, genetic and diet induced, were designed. For
genetic model, DNA methylation levels were compared between ob/ob and
wild type mice. While in diet induced obesity model, mice were fed with
high fat diet for 15 weeks and methylation was compared to regular diet fed
mice. Gene expression data was also used to support the methylation data
in diet model as it is of more clinical relevance for the diet induced obesity
in humans. To find how different adipose tissues react to genetic and diet
induced obesity, both inguinal and epididymal fat depots were examined
and compared. The results from this study are presented in the chapter 6 .
Chapter 5
Paper III - Brown to white
adipose tissue transition
In all mammals (including humans) there are two types of adipose tissues,
BAT and WAT. BAT is specialised in energy dissipation and the generation of
heat by oxidation of glucose and fatty acids, whereas WAT is wired for energy
storage. This project addresses the postnatal transformation of the innate to
white adipose tissue in sheep (Ovis aries). From earlier studies it is known
that this transformation takes place in about two weeks after the birth. As
sheep is a large mammal, that the transformation in sheep mimics the postnatal brown-to-white adipose conversion occurring in newborn human babies.
To find out the underlying mechanism of this transformation, adipose
tissue was collected at multiple time points from sheep. It includes time
points before birth (day -2) and shortly after birth (till 14th day). Tag-RNA
sequencing as described in chapter 1.4, was performed on the samples to
evaluate differential gene expression at different time points. The seven
time points were clustered based on gene expression data into three classes
representing brown adipose tissue, transition state and white adipose tissue.
The differentially expressed genes between these three states represent the
changes occurring when one type of adipose tissues is transformed into another. Gene ontology and pathway enrichment were done in the significantly
changing gene clusters to uncover the underlying biological mechanism during the transformation. Functional analysis was carried out to reveal novel
TFs linked to the adipose transformation process in large mammals.
Submitted manuscript
Global gene expression profiling of brown to white adipose tissue
transformation in sheep reveals novel transcriptional components
linked to adipose remodeling
Astrid L. Basse1,2,** , Karen Dixen1,2,** , Rachita Yadav3, ** , Malin P. Tygesen4 , Klaus Qvortrup2 , Karsten Kristiansen1 ,
Bjørn Quistorff2 , Ramneek Gupta∗3 , Jun Wang1,5,6,7 and Jacob B. Hansen†1
Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark
Department of Biomedical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark
Department of Veterinary Clinical and Animal Sciences, University of Copenhagen, DK-1870 Frederiksberg, Denmark
BGI-Shenzhen, Shenzhen 518083, China
Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah
21589, Saudi Arabia
Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau 999078, China
Joint first authorship
Large mammals are capable of thermoregulation shortly after birth due to the presence of brown adipose
tissue (BAT). The majority of BAT disappears after birth and is replaced by white adipose tissue (WAT).
We analyzed the postnatal transformation of adipose in sheep with a time course study of the perirenal adipose
depot. We observed changes in tissue morphology, gene expression and metabolism within the first two weeks
of postnatal life consistent with the expected transition from BAT to WAT. The transformation was characterized by massively decreased mitochondrial abundance and down-regulation of gene expression related to
mitochondrial function and oxidative phosphorylation. Global gene expression profiling demonstrated that the
time points grouped into three phases; a brown adipose phase, a transition phase and a white adipose phase.
Between the brown adipose and the transition phase 170 genes were differentially expressed, and 717 genes
were differentially expressed between the transition and the white adipose phase. Thirty-eight genes were
shared among the two sets of differentially expressed genes. We identified a number of regulated transcription
factors, including NR1H3, MYC, KLF4, ESR1, RELA and BCL6, which were linked to the overall changes
in gene expression during the adipose tissue remodeling. Finally, the perirenal adipose tissue expressed both
brown and brite/ beige adipocyte marker genes at birth, the expression of which changed substantially over
Using global gene expression profiling of the postnatal BAT to WAT transformation in sheep, we provide
novel insight into adipose tissue plasticity in a large mammal, including identification of novel transcriptional
components linked to adipose tissue remodeling. Moreover, our data set provides a useful resource for further
studies in adipose tissue plasticity.
BAT, brite/ beige adipose tissue, global gene expression profiling, mitochondrial
number, sheep, tag-based sequencing, transcription factors, UCP1, WAT
∗ Corresponding
[email protected]
† Corresponding author:
[email protected]
Two types of adipose tissue exist based on morphological
appearance and biological function. White adipose tissue
(WAT) stores energy in the form of triacylglycerol (TAG)
for later release and use by other tissues, whereas brown
Submitted manuscript
adipose tissue (BAT) metabolizes fatty acids and glucose
for heat production. Thermogenesis through uncoupled
respiration in BAT depends on a high mitochondrial density and expression of uncoupling protein 1 (UCP1) [11].
Larger mammals such as primates and ruminants are born
fully developed and able to thermoregulate minutes after
birth due to the presence of relatively large amounts of
functional BAT, which becomes activated at birth. The
majority of this BAT disappears after birth and is replaced
by WAT [11,12,21]. Contrary to larger mammals, rodents
are born with immature BAT that matures only postnatally and is largely retained throughout life [11, 36]. It
is relevant to understand the brown to white adipose tissue remodeling in large mammals, as it is likely to mimic
the transition occurring in human infants. The most frequently studied adipose tissue transition in a large mammal is the postnatal transformation of the perirenal adipose tissue in sheep. Around the time of birth, all visceral
adipose depots in lambs are brown of nature [21] [20].
Lambs are normally born with approx. 30 g perirenal
adipose tissue constituting 80 % of all their adipose tissue. The brown characteristics of the perirenal adipose
depot change dramatically to a white adipose phenotype
within a few weeks after birth [21] [20]. Although some
gene expression details have been reported [33, 40], relatively little is known about this transition at the molecular
level. Here we report a comprehensive time course analysis of the postnatal BAT to WAT transformation process of
the perirenal adipose depot in lambs, including histological, biochemical and molecular examination as well as
analyses of global gene expression profiles. We provide
evidence for dramatic changes in mitochondrial function
and fatty acid metabolism during the adipose remodeling
and we identified a number of transcriptional components
linked to this adipose tissue transformation process.
Characterization of the postnatal brown to white adipose transformation
At birth (designated day 0) the perirenal adipose tissue
macroscopically appeared dark brown. The brown color
fainted steadily during the time course, and the tissue ended
up being white in appearance at postnatal days 30 and
60 (data not shown). Accompanying the “whitening”, the
volume of the tissue gradually increased (data not shown).
HE stained sections were prepared from all lambs, and
representative sections from days 0, 2, 4, 14 and 30 are
presented in Figure 1A. In the first week of life, the tissue was an apparent mixture of brown adipocytes with
multilocular lipid droplets and white adipocytes with large
unilocular lipid droplets. The perirenal adipose tissue contained by appearance mostly brown adipocytes at early
ages (days 0 to 4), whereas white adipocytes were predominant from day 14. To approach the BAT to WAT
transformation in molecular terms, we measured mRNA
and protein levels of selected marker genes by RT-qPCR
and immunoblotting, respectively (Figure 1B and 1C). Expression of UCP1, the brown adipocyte-specific key thermogenic factor, was high and relatively stable until day
4, after which it became nearly undetectable. The BATenriched factors type II iodothyronine deiodinase (DIO2)
and peroxisome proliferator-activated receptor γ (PPARG)
co-activator 1α (PPARGC1A) were also highly expressed
at day -2 and 0, but displayed a faster and stepwise decrease in expression, being considerably reduced already
at days 0.5 and 1 and poorly expressed after day 4 (Figure
In summary, at the level of macroscopic, microscopic and
molecular analyses, we observed the expected postnatal
transformation of BAT to WAT.
Mitochondrial density declined during brown to white
adipose transformation
The ultra-structure of the perirenal adipose tissue was investigated at selected days by TEM (Additional file 1).
TEM confirmed the mixed presence of multilocular and
unilocular adipocytes at days 0 to 4 and the predominant
presence of the latter at day 14. Adipocyte mitochondrial
density was very high at days 0 to 4 and appeared lower at
day 14. To estimate mitochondrial density quantitatively,
we determined mtDNA content by qPCR as the ratio of
mtDNA and nDNA (Figure 2A). This ratio decreased approx. 7-fold between days 0 and 60, indicating that the
number of mitochondria per cell diminished during the
BAT to WAT transformation.
The ultra-structural observations and the mtDNA/nDNA
ratios prompted us to investigate more carefully gene expression of relevance for mitochondrial abundance and
function. The mRNA levels of the tricarboxylic acid (TCA)
cycle enzyme CS decreased gradually during the time course
(Figure 2B). CS activity, on the other hand, was high and
stable until day 4, after which it dropped (Figure 2C).
Two other mitochondrial genes were analyzed; the electron transporter cytochrome c1 (CYC1) and ATP5B. Levels of CYC1 mRNA (Figure 2D) and ATP5B protein (Figure 1C) displayed a time profile similar to that of CS activity.
A number of nuclear transcription factors regulate expression of genes encoding mitochondrial proteins. These nuclear transcription factors include PGC-1 family members, nuclear respiratory factor 1 (NRF1) and a number of
nuclear receptors. In addition to PPARGC1A (see Figure
1B), we measured the expression of PPARGC1B, PPARA
(also known as NR1C1), estrogen-related receptorα (ERRA,
also known as NR3B1) and NRF1 by RT-qPCR (Additional file 1). The expression pattern of PPARGC1B and
Submitted manuscript
Figure 2
ANOVA p< 0.0001
mtDNA / nDNA
Age (days)
CS U / mg protein
ANOVA p< 0.0001
Age (days)
Relative CS mRNA
ANOVA p< 0.0001
Relative CYC1 mRNA
Age (days)
ANOVA p< 0.0001
Age (days)
Figure 2: Mitochondrial density declines during brown to white
Figure 1: Characterization of the postnatal brown to white adipose transformation. (A) Hematoxylin-eosin (HE) staining of
perirenal adipose tissue at postnatal days 0, 2, 4, 14 and 30.
Representative HE-stained sections are shown for the indicated
time points (n = 5). (B) Total RNA was isolated from perirenal adipose tissue and used for RT-qPCR analysis. Relative expression was measured for uncoupling protein 1 (UCP1), type II
iodothyronine deiodinase (DIO2) and peroxisome proliferatoractivated receptor γ (PPARG) co-activator 1α (PPARGC1A).
The mRNA expression levels were normalized to expression of
β-actin (ACTB). Data are mean +SEM (n =4-5); *, p<0.05 vs.
day 0. (C) The level of uncoupling protein 1 (UCP1) and ATP
synthase subunit β (ATP5B) was determined by immunoblotting
on protein pools, one for each day during the time course. Transcription factor IIB (TFIIB) was used as a loading control.
ERRA was similar to CYC1, with stable expression until
day 4, followed by lower expression at subsequent time
points. Expression of PPARA and NRF1 transiently decreased after birth (Additional file 1). Accordingly, based
on ultra-structure, relative mtDNA measurements, expression and activity of key mitochondrial enzymes as well as
expression of nuclear transcription factors controlling levels of mitochondrial factors, we concluded that mitochondrial density and function declined remarkably during the
transition from BAT to WAT.
adipose transformation. (A) Total DNA was isolated from
perirenal adipose tissues and analyzed by qPCR with primers
specific for mtDNA (cytochrome c oxidase I (MT-CO1)) and
nDNA (suppression of tumorigenicity 7 (ST7)). The relative
mtDNA copy number was obtained as the ratio of MT-CO1 to
ST7 levels. (B) Total RNA was isolated from perirenal adipose
tissue and used for RT-qPCR analysis. Relative expression was
measured for citrate synthase (CS). The mRNA expression levels were normalized to expression of β-actin (ACTB). (C) Enzyme activity (U) of CS was determined and normalized to protein content. (D) Relative expression of cytochrome c1 (CYC1)
was measured by RT-qPCR as described in panel B. Data are
mean +SEM (n = 4-5); *, p <0.05 vs. day 0.
Global gene expression analysis of postnatal brown to
white adipose transformation
To obtain a global view of gene expression changes during
the BAT to WAT transformation in the perirenal adipose
depot, tag-based sequencing was performed on pools of
mRNA from days -2, 0, 0.5, 1, 2, 4 and 14. The resulting reads were mapped to the sheep genome (v3.1). We
were able to map approx. 20 % of the total reads from
all 7 time points to 13,963 annotated genes in the sheep
genome. The relatively low number of mapped reads accounts for the low annotation coverage available for the
sheep genome. The number of mapped reads per annotated gene was counted, and normalized read counts for
the 7 time points were calculated (Additional file 2). To
facilitate downstream analyses, the human genes homologous to the sheep genes were mined, and the results from
the gene expression analysis are discussed using the human protein symbols and names.
Next, we compared the expression of the genes measured
by RT-qPCR in Figure 1, 2 and Additional file 1 to their
expression in the tag-based sequencing data. Expression
Submitted manuscript
Day -2
Day 0
Day 0.5
Day 1
Day 2
Day 4
Day 14
GO Term
Figure 3
Day -2
Day 0
Day 0.5
Day 1
Day 2
Day 4
Day 14
Day 2
Day 4
Day 1
Day 0.5
Day -2
Day 0
Day 2
Day 1
Day 4
Day 0
Day -2
Day 0.5
Day 14
73 genes up-regulated
97 genes down-regulated
378 genes up-regulated
2 down-regulated
Day 4
Day 1
Day 0.5
Day -2
Day 0
Day 2
Day 1
Day 4
Day 0
Day -2
Day 0.5
Day 14
Figure 3: Identification of the brown adipose phase, transition
phase and white adipose phase. (A) Principal component analysis (PCA) plot for the expression data from seven time points
showing the clustering of time points in the first two components. (B) Heatmap showing the hierarchical clustering based
on Euclidean distances between the time points. (C) Allocation
of the different time points to the three phases and summary of
numbers of induced and repressed genes between phases.
and ERRA decreased from day 0 to day 14 in both the
sequencing data and when measured by RT-qPCR (Figure
1, 2, Additional file 1 and 2). In general, there was a relatively high correlation in the expression data obtained by
the two methods.
The most highly expressed gene at both, day 0 and day 14
was fatty acid-binding protein 4 (FABP4), a gene known
to be strongly enriched in adipocytes. Among the 20 genes
with the highest expression level at day 0 and day 14 were
several genes encoding ribosomal proteins, FABP5, the
fatty acid transporter cluster of differentiation 36 (CD36),
the glycolytic enzyme aldolase A (ALDOA) and regulator
of cell cycle (RGCC), a cell cycle regulator and kinase
modulating protein. Genes highly expressed at day 14
included the pentose phosphate pathway enzyme transaldolase (TALDO1) and catalase (CAT). When comparing
the 20 most highly expressed genes, genes related to fatty
acid oxidation, electron transport chain and ATP synthase
activity were more prevalent at day 0 compared to day 14.
Identification of three phases in the brown to white
adipose tissue transformation process
To analyze the distribution of gene expression, a PCA was
performed on the total gene expression data set (Figure
3A). The PCA plot indicated that total gene expression
at the different time points clustered into three groups; a
group including days -2 and 0, a second group including
days 0.5, 1, 2 and 4, and a third group comprising day
14. Hierarchical clustering of the total gene expression
Enrichment from upregulated genes
Muscle cell migration
of adaptive immune
down-regulated genes
Organic acid metabolic
Isocitrate metabolic process
metabolic process
of genes
Table 1: GO enrichment analysis of the 170 genes differentially expressed from the brown adipose phase to the
transition phase.
data set clustered the 7 time points into the same three
groups (Figure 3B). We interpreted the three clusters as
distinct phases in the BAT to WAT transition (Figure 3C).
At days -2 and 0 the tissue is in the brown adipose phase.
The tissue is in a transition phase at days 0.5, 1, 2 and
4, where gene expression starts to change, e.g. illustrated
by the decrease in PPARGC1A expression from day 0 to
day 0.5 (see Figure 1B). Day 14 represents the white adipose phase, as was also suggested by tissue morphology,
mitochondrial numbers and function as well as expression
level of UCP1 (Figure 1 and 2).
The expression of 170 genes changed significantly (p-value
< 0.1) between the brown adipose phase and the transition
phase (Additional file 3). Of these, 73 genes were upregulated and 97 genes were down-regulated (Figure 3C).
A heatmap with Euclidian distances for the 170 genes is
shown in Figure 4A. GO enrichment analysis on the 73
up-regulated genes revealed that they were enriched for
genes related to “negative regulation of adaptive immune
response” and “smooth muscle cell migration”, whereas
the 97 down-regulated genes were enriched for genes related to “organic acid metabolic processes” (Table 1).
Between the transition phase and the white adipose phase,
the expression of 717 genes changed significantly, of which
378 genes were up-regulated and 339 were down-regulated
(Figure 3C and Additional file 4). These differentially expressed genes are presented in a heatmap in Figure 4B.
A GO enrichment analysis demonstrated that the 378 upregulated genes were enriched for genes related to “cell
death” and “negative regulation of cell death” (Table 2).
The 339 down-regulated genes were enriched for genes
related to “metabolic process”, including “fatty acid betaoxidation” (Table 2).
Submitted manuscript
Figure 6
.'/&5 .'/&1 .'/&6 .'/&56
The GO term “metabolic process” included 242 genes, a
number of which have been measured by RT-qPCR, in-
Figure 4: Gene expression changes in the two phase shifts.
(A) Heatmap of the 170 genes differentially expressed from the
brown adipose phase to the transition phase. (B) Heatmap of the
717 genes differentially expressed from transition to the white
Figure 5
ANOVA p< 0.0001
ANOVA p= 0.271
ANOVA p< 0.0001
Relative DGAT1 mRNA
ANOVA p= 0.1096
Relative DGAT2 mRNA
ANOVA p< 0.0001
Relative ACACA mRNA
Relative FASN mRNA
Relative CPT1B mRNA
The changes in gene expression related to the GO term
“fatty acid beta-oxidation” were investigated in more detail by RT-qPCR (Figure 5). Of notice, the white adipose
phase included samples from days 14, 30 and 60 for RTqPCR measurements, whereas the white adipose phase
for the global gene expression analysis included samples
from day 14 only (see Figure 3). Two key enzymes in βoxidation are carnitine palmitoyltransferase 1B (CPT1B)
and the hydroxyacyl-CoA dehydrogenase complex (HADH).
The relative mRNA expression levels of both CPT1B and
the catalytic subunitα of HADH (HADHA) decreased from
the brown adipose phase to the transition phase and from
the transition phase to the white adipose phase (Figure
5A). We also measured expression of two genes involved
in fatty acid synthesis by RT-qPCR; acetyl-CoA carboxylase 1 (ACACA) and fatty acid synthase (FASN). Expression of both tended to increase during the postnatal adipose transformation (Figure 5B), suggesting a higher rate
of fatty acid synthesis in WAT compared to BAT.
Table 2: GO enrichment analysis of the 717 genes differentially expressed from transition to the white adipose
Enrichment from upregulated genes
Enzyme linked receptor
protein signaling pathway
Negative regulation of
cell death
Cell death
down-regulated genes
Generation of precursor
metabolites and energy
Cellular respiration
Mitochondrial ATP synthesis coupled electron
Metabolic process
Fatty acid beta-oxidation
Lipid modification
Mitochondrion organization
of genes
Relative HADHA mRNA
GO Term
ANOVA p< 0.0001
Figure 5: Expression of selected metabolic enzymes related to
fatty acid metabolism. Total RNA was isolated from perirenal adipose tissue and used for RT-qPCR analysis. Relative expression was measured for: (A) carnitine palmitoyltransferase 1b (CPT1B) and hydroxyacyl-CoA dehydrogenase
subunitα (HADHA); (B) acetyl-CoA carboxylase (ACACA) and
fatty acid synthase (FASN); (C) diacylglycerol O-acyltransferase
1 (DGAT1) and DGAT2. The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM
(brown, n = 9; transition, n = 20; white, n = 15); *, p<0.05.
cluding UCP1, CYC1 and CS. The two isoforms of the
Submitted manuscript
TAG synthesis enzymes diacylglycerol O-acyltransferase
in the three phases were validated by RT-qPCR (Figure
1 (DGAT1) and DGAT2, were also among the regulated
6). Expression of NR1H3 and MYC was significantly inmetabolic genes. RT-qPCR measurements confirmed a
creased and decreased, respectively, in the transition phase
decreased expression of DGAT1 and DGAT2 in the white
compared to both the brown and the white adipose phase
adipose phase compared to the transition phase (Figure
(Figure 6). Expression of the three transcription factors
ESR1, RELA and KLF4 was up-regulated between the tranOf the 170 genes differentially expressed between the brown sition and white adipose phase, whereas BCL6 was sigadipose and transition phase and 717 genes differentially
nificantly down-regulated from the transition to the white
expressed between the transition and the white adipose
adipose phase (Figure 6).
phase, 38 genes were in common. A Venn diagram of
the 849 regulated genes is shown in Additional file 5. FifFigure 6
teen of the 38 common genes were down-regulated at both
phase shifts, whereas 9 genes were up-regulated at both
phase shifts. Among the 15 consistently down-regulated
genes were several mitochondrial genes, e.g. the TCA
cycle enzyme isocitrate dehydrogenase 3α (IDH3A), and
the two transcription factors myeloid leukemia factor 1
(MLF1) and autoimmune regulator (AIRE) (Additional
file 6). Among the consistently up-regulated genes were
two receptors involved in cellular lipid uptake; low den*
sity lipoprotein receptor-related protein 1 (LRP1) and macrophage
scavenger receptor 1 (MSR1). The 38 genes also included
9 genes transiently up-regulated and 5 genes transiently
down-regulated during the transition phase (Additional file
6). Among the genes up-regulated during the transition
phase were two enzymes involved in TAG synthesis; the
mitochondrial glycerol-3-phosphate acyltransferase (GPAM)
and 1-acylglycerol-3-phosphate O-acyltransferase 9 (AGPAT9).
Figure 6: Validation of expression patterns of the transcripTranscriptional components regulated between the three tional components regulated between the three phases of adipose tissue transformation and having a consensus putative rephases of adipose tissue transformation
sponse element in an enriched set of regulated genes. Total RNA
Expression of 17 transcription factors was significantly
was isolated from perirenal adipose tissue and used for RT-qPCR
changing between the brown adipose and the transition
analysis. Relative expression was measured for nuclear receptor
phase, of which 7 were up-regulated (Additional file 7).
subfamily 1, group H, member 3 (NR1H3), v-myc avian myeloBetween the transition and the white adipose phase, 74
cytomatosis viral oncogene homolog (MYC), B-cell lymphoma
transcription factors were differently expressed, with 48
6 (BCL6), estrogen receptor 1 (ESR1), v-rel reticuloendotheliobeing up-regulated (Additional file 7). Four transcripsis viral oncogene homolog A (RELA) and krüppel-like factor 4
tion factors exhibited differential expression at both phase
(KLF4). The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM (brown, n =
shifts. Two of the 17 transcription factors differently ex9; transition, n = 20; white, n = 15); *, p<0.05.
pressed between the brown adipose and the transition phase
had consensus putative response elements in an enriched
set of the 170 genes displaying altered expression in the
Figure 7A lists genes with differential expression between
same phase shift. These were nuclear receptor subfamily
the brown adipose and transition phase that have a consen1, group H, member 3 (NR1H3, also called LXRA), and
sus putative response element for either MYC or NR1H3.
v-myc avian myelocytomatosis viral oncogene homolog
Among the up-regulated genes that were potentially reg(MYC). Of the 717 differently expressed genes between
ulated by MYC from the brown adipose to the transition
the transition and the white adipose phase, an enriched
phase were the adhesion protein thrombospondin 2 (THBS2)
set of genes contained consensus putative response eleand the Rab1 GTPase activator TBC1 domain family memments for six transcription factors that were themselves
ber 20 (TBC1D20). Among the down-regulated genes
regulated in the same phase shift. These transcription facfrom the brown adipose phase to the transition phase potors are NR1H3, MYC, B-cell lymphoma 6 (BCL6), estrotentially regulated by MYC were a 9-cis-retinoic acid syngen receptor 1 (ESR1, also called NR3A1 or ESRA), v-rel
thesizing enzyme, aldehyde dehydrogenase 8 family memreticuloendotheliosis viral oncogene homolog A (RELA)
ber A1 (ALDH8A1), and the transcription factor basic helixand krüppel-like factor 4 (KLF4).
loop-helix family member E40 (BHLHE40).
The expression levels of these six transcription factors
ANOVA p< 0.0001
Relative RELA mRNA
Relative NR1H3 mRNA
ANOVA p< 0.0001
ANOVA p< 0.0001
Relative KLF4 mRNA
Relative ESR1 mRNA
ANOVA p< 0.0025
Relative BCL6 mRNA
Relative MYC mRNA
ANOVA p< 0.0008
ANOVA p< 0.0004
Submitted manuscript
protein 2 (ANGPTL2), and one anti-angiogenic factor, serpin peptidase inhibitor F1 (SERPINF1) (Additional file 8).
Figure 7
Up-regulated genes
Down-regulated genes
! FG+%&;"[email protected]&%
! (((((((.:H1
[email protected];8"&%(
Figure 7: Genes with altered expression that potentially are
controlled by transcription factors regulated between the three
phases of adipose tissue transformation. (A) List of genes regulated from the brown adipose phase to the transition phase,
which have consensus putative response elements for nuclear receptor subfamily 1, group H, member 3 (NR1H3) and v-myc
avian myelocytomatosis viral oncogene homolog (MYC). (B)
Subcellular localization of differentially expressed genes from
the transition phase to the white adipose phase that have consensus putative response elements for NR1H3, MYC, B-cell lymphoma 6 (BCL6), estrogen receptor 1 (ESR1), krüppel-like factor 4 (KLF4) and v-rel reticuloendotheliosis viral oncogene homolog A (RELA). Red nodes indicate down-regulated genes in
the white adipose phase as compared to the transition phase and
green nodes indicate up-regulated genes in the white adipose
phase as compared to the transition phase. The corresponding
gene names are listed in Additional file 8.
Figure 7B depicts the subcellular distribution of the 288
genes potentially regulated by MYC, ESR1, RELA, BCL6,
KLF4 and NR1H3 between the transition and the white
adipose phase. Gene names corresponding to Figure 7B
are presented in Additional file 8. Forty of the regulated
genes were mitochondrial genes, 37 of which were downregulated. This is in accordance with the decreased activity and amount of mitochondria in the white adipose
phase (see Figure 2 and Additional file 1). Of notice,
half of the down-regulated mitochondrial genes have been
described to be regulated by RELA. Nineteen genes encoded secreted proteins, 11 of which were up-regulated,
including two pro-angiogenic factors; vascular endothelial growth factor B (VEGFB) and angiopoietin-related
Expression of white, brite/ beige and brown adipose
markers in the three phases of adipose tissue transformation
To study the three phases in the transformation process in
more detail, we measured a number of brown and white
adipose marker genes by RT-qPCR. As evident from Figure 1, we observed a down-regulation of UCP1 between
the transition and white adipose phase, and a stepwise decrease in expression of DIO2 and PPARGC1A through the
three phases of the transformation (Additional file 9). Expression of two transcription factors promoting white adipogenesis; nuclear receptor-interacting protein 1 (NRIP1,
also called RIP140) and retinoblastoma 1 (RB1), was increased between the transition and the white adipose phase
(Figure 8A). A typical white adipose marker gene leptin (LEP) displayed decreased expression in the transition phase compared to both the brown and white adipose phase (Figure 8A). Expression of the key transcriptional driver of brown adipogenesis, PR domain containing 16 (PRDM16), was not changed significantly between
the phases (Figure 8B). Overall, these measurements supported the brown to white adipose transformation occurring in the sheep perirenal adipose tissue within the first
two weeks after birth.
To address if the sheep perirenal adipose tissue qualified
as being brown, brite/ beige or a mixture of brown and
brite/ beige at birth, and whether this status of the tissue changed over time, we measured a number of recently
proposed marker genes selectively expressed in brown versus brite/ beige adipose tissue and adipocytes [56] [53]
[57] [46] [44].
We determined the expression of the classical brown adipose marker genes solute carrier family 29 member 1 (SLC29A1),
LIM homeobox 8 (LHX8), myelin protein zero-like 2 (MPZL2,
also called EVA1) and zinc finger protein 1 (ZIC1). SLC29A1
was expressed at birth and elicited a stepwise down-regulation
through the two adipose phase shifts (Figure 8B). MPZL2
expression did not significantly change. Contrary, LHX8
mRNA increased steadily through the three phases (Figure 8B). ZIC1 was not detected in any of the perirenal
adipose samples, but was easily detected in sheep brain
(data not shown). We also measured the expression of the
three brite/ beige marker genes homeobox C8 (HOXC8),
HOXC9 and tumor necrosis factor receptor superfamily
member 9 (TNFRSF9, also called CD137). The expression of all three genes increased from the transition phase
to the white adipose phase (Figure 8C). Of notice, HOXC8
and HOXC9 are marker genes for both WAT and brite/
beige adipose tissue [53, 57]. However, expression of the
brite/ beige marker genes transmembrane protein 26 (TMEM26)
and T-box protein 1 (TBX1) did not change (Figure 8C).
In summary, three out of four markers of classical BAT
Submitted manuscript
Figure 8
Relative RB1 mRNA
Relative SLC29A1 mRNA
ANOVA p< 0.0001
Relative LEP mRNA
ANOVA p< 0.0091
ANOVA p< 0.0001
ANOVA p< 0.0001
Relative LHX8 mRNA
ANOVA p< 0.0001
ANOVA p< 0.2682
ANOVA p< 0.0001
Relative HOXC8 mRNA
Relative HOXC9 mRNA
ANOVA p< 0.1468
Relative TNFRSF9 mRNA
Relative TMEM26 mRNA
ANOVA p< 0.0001
Relative MPZL2 mRNA
Relative NRIP1 mRNA
Relative PRDM16 mRNA
ANOVA p< 0.0919
Relative TBX1 mRNA
ANOVA p< 0.0004
ANOVA p< 0.8829
Figure 8: Expression of genetic markers for brown and brite/
beige adipose tissue during the three phases of the postnatal
perirenal adipose tissue transformation. Gene expression was
determined by RT-qPCR. (A) Relative levels of genes associated with white adipocytes; nuclear receptor-interacting protein
1 (NRIP1), retinoblastoma 1 (RB1) and leptin (LEP). (B) Relative levels of the brown adipose associated and marker genes;
PR domain containing 16 (PRDM16), solute carrier family 29
member 1 (SLC29A1), LIM homeobox 8 (LHX8) and myelin
protein zero-like 2 (MPZL2). Zinc finger protein 1 (ZIC1) was
not detectable in any of the adipose samples. (C) Relative levels of the brite/ beige adipose markers; homeobox C8 (HOXC8),
HOXC9, tumor necrosis factor receptor superfamily member 9
(TNFRSF9), transmembrane protein 26 (TMEM26) and T-box 1
(TBX1). The mRNA expression levels were normalized to expression of β-actin (ACTB). Data are mean +SEM (brown, n =
9; transition, n = 20; white, n = 15); *, p<0.05.
were detectable, and two out of these three changed expression over time. All five measured brite/ beige markers
were detectable; the expression of three increased and two
remained unchanged from the transition to the white adipose phase. Thus, markers of both brown and brite/ beige
adipose tissue were expressed in perirenal adipose tissue
from sheep, and most of these markers displayed altered
expression over time.
Plasticity of adipose tissues is important for adaptation to
changing physiological conditions [48]. In response to
prolonged cold exposure, subcutaneous WAT depots of
rodents undergo a transformation process during which
numerous brite/ beige adipocytes appear, thereby increasing overall thermogenic capacity of the animal [23,28,48].
In large mammals, a substantial part of the BAT present in
the newborn converts to WAT after birth, which may reflect that the need for endogenous thermogenesis drops
after the early postnatal period. As little molecular insight into this conversion in large mammals is available,
we have in the present study conducted a detailed analysis of the postnatal transformation of perirenal adipose
tissue in sheep. We chose this particular tissue, as it is
the most frequently studied BAT depot in large mammals.
Moreover, we reasoned that the transformation of this depot was suitably modeling the postnatal brown to white
adipose transformation in humans. The postnatal transformation process from BAT to WAT in perirenal adipose
tissue occurred within the first two weeks after birth as
determined by changes in tissue morphology, gene expression and mitochondrial density. Adipocyte morphology changed from being mainly multilocular to unilocular and the amount of mitochondria decreased. The expression of brown adipocyte-selective genes, e.g. UCP1,
DIO2 and PPARGC1A, declined, as did expression of additional genes encoding mitochondrial proteins. To understand the adipose transformation in more detail, we
performed a global gene expression analysis with seven
time points ranging from approx. two days before birth to
two weeks after birth. By two independent analyses of the
gene expression data, we determined that the transformation clustered into three phases: a brown adipose phase, a
transition phase and a white adipose phase.
Regulated transcription factors
Between the brown adipose and the transition phase were
170 genes differentially expressed, including 17 transcription factors, 10 of which were down-regulated (Additional
file 3 and 7). Five of these have chromatin modifying activity: circadian locomotor output cycles kaput (CLOCK),
nuclear receptor co-activator 1 (NCOA1, also called SRC1),
proviral insertion site in Moloney murine leukemia virus
lymphomagenesis (PIM1) and the SWI/SNF-related matrixassociated actin-dependent regulator of chromatin subfamily members SMARCC2 and SMARCD3. This leaves open
the possibility that extensive remodeling of chromatin is
occurring between the brown adipose and the transition
phase. In accordance with its down-regulation in the perirenal adipose transformation (Additional file 7), NCOA1 has
been reported to promote BAT activity in mice [39]. Between the transition and the white adipose phase 717 genes
were differentially expressed, of which 74 were transcription factors (Additional file 4 and 7). The list of regulated transcription factors included a few transcription
factors known to be differently expressed in mouse WAT
and BAT, e.g. NRIP1 and cell death-inducing DFFA-like
effector a (CIDEA). Among the down-regulated transcription factors from the transition to the white phase were
some related to brown adipocyte function: early B-cell
Submitted manuscript
factor 2 (EBF2), which have been described to induce expression of brown adipose-specific PPARG target genes
[42], leucine-rich PPR motif-containing protein (LRPPRC),
a PPARGC1A co-activator playing an important role in
BAT differentiation and function [15] and Y box binding
protein 1 (YBX1), an inducer of bone morphogenetic protein 7 (BMP7) transcription and brown adipocyte differentiation [38].
the mice are obesity resistant and less sensitive to cold
[50] (Additional file 10). ESR1 knockout mice have increased fat mass caused by adipocyte hyperplasia [25]. In
addition, ESR1 can inhibit the transcriptional activity of
RELA [19] (Additional file 10). Based on this, we speculate that RELA and ESR1 might contribute to the regulation of TAG accretion during the adipose transformation.
RELA might also contribute to the mitochondrial depletion observed, as RELA negatively impacts mitochondrial
Regulated transcription factors with a consensus pucontent in C2C12 myocytes [8]. RELA has putative retative response element in an enriched set of differensponse elements in the promoter of 20 genes encoding mitially expressed genes
tochondrial proteins regulated between the transition and
Expression of four transcription factors NR1H3, MYC, AIRE the white adipose phase, of these 17 were down-regulated.
and MLF1 was regulated at both phase shifts. The forThis is in accordance with RELA functioning both as a
mer two have a consensus putative response element in
transcriptional activator and repressor [10]. RELA has
an enriched set of genes displaying altered expression at
been described to repress BCL6 expression through inthe two phase shifts. Expression of NR1H3 and MYC was
terferon regulatory factor 4 [45]. This is in accordance
transiently increased and decreased, respectively, during
with the increased expression of RELA and the decreased
the transition phase (Figure 6). Consistent with the opexpression of BCL6 in the white adipose phase (Figure
posite regulation during the transition phase, NR1H3 have
6). BCL6 is a transcriptional repressor with the ability to
been reported to suppress MYC expression in colon canreduce the expression of e.g. MYC [37] (Additional file
cer cells [52] (Additional file 10). MYC is known to in10). The expression of BCL6 is strongly down-regulated
hibit adipogenesis [18, 24], which might explain why it is
by growth hormone in 3T3-F442A adipocytes [14]. Apart
down-regulated in the transition phase, where the tissue
from this, BCL6 has not been linked to adipocyte or adiexpands. NR1H3 has been described to regulate gene expose tissue function. KLF4 and RELA have been reported
pression linked to several important aspects of both brown
to be functionally intertwined, as they directly interact to
and white adipocyte biology, including adipogenesis, eninduce expression of selected genes [16], but also comergy expenditure, lipolysis and glucose transport [29]. NR1H3 pete for interaction with a co-activator, thereby inhibiting
was reported to be present at higher levels in mouse BAT
each others activity [4] (Additional file 10). KLF4 is imthan WAT [49] and to suppress PPARγ-induced UCP1 exportant for induction of adipogenesis in vitro and its expression by binding to the UCP1 enhancer together with
pression was reported to be induced in pre-adipocytes by
NRIP1 in mouse adipocytes [54]. Accordingly, NR1H3
cAMP. KLF4 stimulates CCAAT/enhancer-binding procan regulate the expression of brown adipocyte-selective
tein β (C/EBPB) expression, and C/EBPB in turn downgenes. However, the increased NR1H3 expression in the
regulates KLF4 expression, thereby forming a negative
transition phase did not correlate with decreased UCP1
feedback loop [9].
expression in our study, which might be explained by the
In summary, six transcription factors with differential exrelative low expression of NRIP1 in the transition phase
pression during the adipose transformation have consen(Figure 8). Beside NR1H3 and MYC, four other transus putative response elements in an enriched set of the
scription factors, RELA, KLF4, ESR1 and BCL6, were
regulated genes, suggesting that they are involved in the
regulated between the transition and the white adipose
control of the overall gene expression changes and thus
phase and found to have consensus putative response elepotentially have an impact on remodeling of the tissue.
ments in an enriched number of genes regulated between
Moreover, the six factors are mutually functionally linked,
these two phases. The former three were up-regulated
leaving open the possibility that they are part of a tranfrom the transition to the white adipose phase, whereas
scriptional network (Additional file 10).
BCL6 was down-regulated (Figure 6). Both RELA and
Brown and brite/ beige markers
ESR1 have been described to stimulate MYC expression
A number of marker genes selectively expressed in white,
[27, 43], which would be consistent with the increased
brite/ beige and brown adipose tissue and adipocytes have
expression of MYC in the white adipose phase (Figure 6
been reported [44, 46, 53, 56, 57]. It is being discussed
and Additional file 10). Of interest, RELA and ESR1 have
whether human BAT is composed of brown or brite/ beige
opposite effects on adipogenesis, as both knockdown of
adipocytes or a mixture of these. Moreover, it is not fully
RELA and activation of ESR1 by estrogen supplementaestablished to what extent expression of white, brite/ beige
tion attenuated adipogenesis [30, 50]. WAT from mice
and brown adipose marker genes changes in a particuwith an adipocyte-specific knockout of RELA have delar adipose depot during development or remodeling. To
creased lipid droplet size, increased glucose uptake and
elucidate this in sheep, we analyzed the expression of sereduced expression of adipogenic marker genes such as
lected marker genes in the brown adipose, transition and
LEP, PPARG and adiponectin (ADIPOQ). Furthermore,
Submitted manuscript
white adipose phase of the perirenal adipose depot. Although we only measured marker gene expression in the
perirenal depot, and thus have not compared expression
levels to those in other adipose depots, we could detect
both brown (e.g. LHX8) and brite/ beige (e.g. TNFRSF9)
adipocyte marker genes in the newborn sheep (Figure 8).
Co-expression of marker genes selective for brown and
brite/ beige adipose tissues have also been observed in
a recent study of the human supraclavicular brown adipose tissue [26]. In this human study, expression of UCP1
in supraclavicular biopsies was positively associated with
expression of both BAT markers (e.g. ZIC1 and LHX8)
and brite/ beige adipose markers (e.g. TBX1 and TMEM26).
Contrary, expression of the two WAT and brite/ beige adipose markers HOXC8 and HOXC9 correlated with low
UCP1 expression [26]. In our time course study, we did
not observe a correlation between high UCP1 expression
and high expression of ZIC1 (which was not detected),
LHX8, TBX1 or TMEM26, but we did detect higher expression of HOXC8 and HOXC9 in the white adipose state,
where UCP1 expression was low (Figure 8 and Additional
file 9). A similar HOXC9 profile in sheep perirenal adipose tissue has been reported by others [40].
BAT, brite/ beige adipose and WAT marker gene expression have not previously been studied in detail during adipose tissue remodeling in large mammals. Based on the
selective expression profile in mice, our observation that
HOXC8 and HOXC9 are induced in the white adipose
phase suggests that the perirenal adipose depot converts
from brown (not brite/beige) to white (Figure 8). The
brown adipose origin of the perirenal depot might be supported by the down-regulation of the BAT marker gene
SLC29A1 during whitening. Contrary, the 5-fold increase
in expression of LHX8 from the brown to the white adipose phase and the absence of ZIC1 expression were not
consistent with this model. In addition to HOXC8 and
HOXC9, a number of other brite/ beige marker genes were
expressed shortly after birth (Figure 8). Of notice, the expression of these was either unchanged or up-regulated
during whitening. In summary, markers of both brite/
beige and brown adipose tissue are expressed in the sheep
perirenal adipose tissue at birth and the expression of a
number of these changes substantially over time. The latter might be important to consider when analyzing adipose tissue type-selective gene expression.
Model of the transformation process
The postnatal transformation from brown (or brite/beige)
to white in the perirenal adipose tissue can occur through
at least three different mechanisms: 1) through transdifferentiation of brown (or brite/beige) to white adipocytes;
2) through proliferation and differentiation of white adipogenic precursor cells and death of mature brown (or
brite/beige) adipocytes; 3) through a combination of the
two. In mice, brite/ beige adipocytes can transdifferentiate into white adipocytes [44], but whether this obser-
vation extents to large mammals is unclear. If the white
adipocytes arise exclusively from proliferation and differentiation of precursor cells, it would require an enormous
cell turnover during the transition phase of the transformation, including extensive death of mature brown (or
brite/beige) adipocytes. We would expect this to result in
induction of cell cycle genes, but we did not observe this
in the GO term analysis. Moreover, the expression profile of key cyclins (CCNA, CCNB, CCND and CCNE)
was not consistent with massive cell cycling during the
transition phase (Additional file 2). In the transition to
white adipose phase shift, we did observe an up-regulation
of genes related to cell death, but nearly half of the upregulated cell death associated genes are negative regulators of cell death (Table 2). Thus, it is not obvious from
the gene expression data if cell death is increased or not.
Of notice, we did not observe evidence for significant cell
death in any of the tissue sections analyzed. Consistently,
Lomax et al. [33] failed to detect apoptotic cells during the
transformation of the perirenal adipose tissue in sheep.
Based on this, we find it plausible that transdifferentiation of brown (or brite/beige) to white adipocytes is a significant component of the postnatal transformation of the
perirenal adipose depot in sheep. Clearly, additional studies are required to validate this, including a time course
with more time points and a dedicated search for evidence
of cell proliferation and death.
By global gene expression profiling, we provide novel information of the postnatal BAT to WAT transformation
in sheep. This transformation process is poorly understood in molecular terms, but is of significant interest, as a
similar transformation occurs in human infants after birth.
An improved understanding of this tissue remodeling increases insights into adipose plasticity and might allow
identification of targets suitable for interfering with the
balance between energy-storing and energy-dissipating adipose tissue. Our results reveal novel transcription factors linked to the adipose transformation process in large
mammals. Clearly, validation of their actual relevance in
adipose function will require dedicated functional studies.
Finally, we show that expression of adipose tissue-type selective marker genes change substantially over time, which
might be an underappreciated variable in such analyses.
Material and methods
Animals and tissues
Experimental procedures were in compliance with guide-
Submitted manuscript
lines laid down by the Danish Inspectorate of Animal Experimentation. Lambs from cross-bred ewes (Texel x Gotland) in their second or third parturition sired by purebred
Texel ram, born and raised at a commercial farm in Denmark, were used. During gestation ewes were fed hay ad
libitum, 200 g barley and 200 g commercial concentrate
per day. Ewes were housed in groups of 40 until lambing.
After lambing they were housed individually for 2 days
and subsequently housed in groups of 20 until they were
transferred to pasture approx. one week after lambing.
The ewe-reared lambs were kept on pasture being a mixture of 70 % ray grass and 30 % white clover. Lambs were
killed by bolt pistol and bled by licensed staff. Perirenal
adipose tissue was carefully dissected and frozen in liquid
nitrogen for biochemical or molecular analyses or fixed
for histology as described below. Lambs at the following
ages (day relative to the time of birth) were used: -2 (n =
4), 0 (n = 5), 0.5 (n = 5), 1 (n = 5), 2 (n = 5), 4 (n = 5),
14 (n = 5), 30 (n = 5) and 60 (n = 5). Live weights of the
lambs were kept similar within groups.
µg of total RQ1 DNase (Promega)-treated RNA and 200
units of Moloney murine leukemia virus reverse transcriptase (Life Technologies). Reactions were left for 10 min at
room temperature, followed by incubation at 37 C for 1 h.
After cDNA synthesis, reactions were diluted with 50 µl
of water and frozen at -80 ◦ C. The cDNA was analyzed by
RT-qPCR using the Stratagene Mx3000P QPCR System.
Each PCR mixture contained, in a final volume of 20 µl,
1.5 µl of 1st strand cDNA, 10 µl of SensiFASTTM SYBR
Lo-ROX Kit (Bioline) and 2 pmol of each primer (Additional file 11). All reactions were run using the following
cycling conditions: 95 ◦ C for 10 min, then 40 cycles of 95
C for 15 s, 55 ◦ C for 30 s and 72 ◦ C for 15 s. PCR was
carried out in 96-well plates and each sample was run in
duplicate. Target gene mRNA expression was normalized
to expression of β-actin (ACTB) mRNA.
Protein extracts and immunoblotting
Tissues were homogenized in a GG-buffer (pH 7.5) containing 25 mM glycyl-glycin, 150 mM KCl, 5 mM MgSO4
and 5 mM ethylenediaminetetraacetic acid (EDTA) as well
Hematoxylin-eosin (HE) staining
as freshly added dithiothreitol (1 mM), bovine serum alSamples were fixed in 4 % neutral buffered formaldehyde
bumin (0.02 %) and Triton X-100 (0.1 %). Homogeniza(pH 7.4) at room temperature for 24 h and subsequently
tion was performed with a TissueLyser (QIAGEN) usat 4 ◦ C until preparation. The tissue was processed to
ing 5 mm stainless steel beads, and homogenates were
paraffin and sectioned in 4 ?m sections. HE staining was
subsequently frozen in liquid nitrogen. Protein concenperformed according to standard procedures.
trations were determined by the Lowry method [34] and
equal amounts of protein from each animal were pooled
according to age and diluted in a buffer containing 2.5
Transmission electron microscopy (TEM)
Samples were fixed in Karnowskys fixative (2 % paraformalde- % SDS and 10 % glycerol. Proteins were separated on
4-12 % Bis-Tris gradient gels (NuPAGE, Life Technolohyde and 2.5 % glutaraldehyde in 0.08 M cacodylate buffer,
gies), blotted onto Immobilon PVDF membranes (MillipH 7.4) for 3-5 days at room temperature and subsequently
pore) and stained with Amido Black 10B (Sigma-Aldrich).
stored in 0.08 M cacodylate buffer at 4 ◦ C until further
processing. The samples were rinsed three times in 0.15
Membranes were blocked in Tris-buffered saline (pH 7.4)
or phosphate-buffered saline (pH 9.0) with 5 % nonfat
M Sorensens Phosphate Buffer (pH 7.4) and subsequently
dry milk and 0.1 % Tween 20 (Sigma-Aldrich) and then
postfixed in 1 % OsO4 in 0.12 M sodium cacodylate buffer
(pH 7.4) for 2 h. The specimens were dehydrated in graded
probed with antibodies. Primary antibodies used were
against transcription factor IIB (TFIIB) (sc-225) (Santa
series of ethanol, transferred to propylene oxide and embedded in Epon according to standard procedures. UltraCruz Biotechnology), ATP synthase β (ATP5B) (ab14730)
(Abcam) and UCP1 (ab10983) (Abcam). Secondary anthin sections were cut with a Reichert-Jung Ultracut E
tibodies were horseradish peroxidase-conjugated (Dako).
microtome and collected on single slot copper grids with
Enhanced chemiluminescence (Biological Industries) was
Formvar supporting membranes. Sections were stained
used for detection.
with uranyl acetate and lead citrate and examined with a
Philips CM-100 transmission electron microscope, operQuantification of relative mitochondrial DNA (mtDNA)
ated at an accelerating voltage of 80 kV. Digital images
copy numbers
were recorded with a SIS MegaView2 camera and the
Relative mtDNA amount (copy number) was measured as
analySIS software package.
the ratio between mtDNA and nuclear DNA (nDNA). Tissues were homogenized using a TissueLyser (QIAGEN)
Reverse transcription-quantitative polymerase chain
in lysis buffer containing 100 mM Tris-base (pH 8.0), 5
reaction (RT-qPCR)
mM EDTA (pH 8.0), 0.2 % sodium dodecyl sulphate, 200
Tissues were homogenized in TRIzol (Life Technologies)
mM NaCl and 100 mg/ml proteinase K and incubated
using a Dispomix (Xiril) and total RNA was purified. Reovernight at 55 ◦ C with rotation. DNA was precipitated
verse transcriptions were performed in 25 µl reactions
with two volumes of 99 % ethanol and isolated with incontaining 1st Strand Buffer (Life Technologies), 2 µg
oculation loops, washed in 70 % ethanol and dissolved
random hexamers (Bioline), 0.9 mM of each dNTP (Sigmain Tris-EDTA buffer containing 10 mg/ml RNase A at 55
Aldrich), 20 units of RNaseOUT (Life Technologies), 1
Submitted manuscript
C overnight. DNA concentrations were determined on
the Eppendorf BioPhotometer at 260 nm and 50 ng DNA
was used for qPCR. PCR reactions and cycling conditions
were as described above, and primers were against cytochrome c oxidase I (MT-CO1) (mtDNA) and suppression of tumorigenicity 7 (ST7) (nDNA) (Additional file
Citrate synthase (CS) activity
Tissue homogenates (10 %) were generated in GG-buffer
(pH 7.5) as described above. Homogenates were thawed
on ice and centrifuged at 4 ◦ C at 20,000 g for 2 min. Supernatants were used for activity measurements. CS activity was measured spectrophotometrically at 25 ◦ C and
412 nm in CS buffer containing 100 mM Tris-base (pH
8.0), 10 mM 5,5?-dithiobis(2-nitrobenzoic acid), 5 mM
acetyl-CoA and 50 mM oxaloacetic acid and activity was
measured as described [47]. Each sample was measured
in duplicate and the mean was used for subsequent calculations. Activities were normalized to the amount of total
protein determined by the Lowry method [34].
Statistical analyses of qPCR data
The time course study was analyzed for statistical significance using one-way ANOVA and Students t-test with
Bonferroni correction for multiple testing as post hoc test.
A p-value < 0.05 was considered statistically significant.
Targeted RNA-sequencing and data analysis
Isolation of mRNA and synthesis of first strand cDNA:
Equal amounts of total RNA from perirenal adipose tissue
from lambs at the same age (days -2, 0, 0.5, 1, 2, 4, 14)
were pooled. mRNA was isolated from 4 ?g of total RNA
by magnetic oligo(dT) beads, which was used to synthesize bead-bound cDNA, according to the instructions of
the manufacturer (Illumina).
Tag library construction: The library for digital gene expression analysis was constructed according to the instructions of the manufacturer (Illumina). Bead-bound cDNA
was digested with NlaIII, followed by ligation of the GEX
adapter 1 to the bead-bound NlaIII-digested cDNA. This
was then digested with MmeI, releasing the GEX adaptor 1 linked to 17 bp cDNA from the beads. The released
fragment was ligated to GEX adapter 2. The 17 bp tags of
cDNA were PCR amplified using two primers that anneal
to the two adapters. The resultant tag library was used for
Illumina sequencing.
Data analysis for RNA-seq data: Quality control, trimming and adapter removal was done using FastQC [2] and
fastx_clipper from the FASTX-Toolkit [1]. The 4 bases
CATG were added to the 5 end of the reads to increase the
specificity of mapping. BWA [31] was employed for the
alignment and mapping of the reads to the Sheep genome
v3.1. Mapped reads were sorted and indexed with Sam-
tools [32]. HT-Seq [6] was used for counting mapped
reads per annotated gene. DESeq [7] and R [51] were
used for the post processing and statistical analysis of the
count data from the mapped reads.
Principle component analysis (PCA) and hierarchical clustering of time points: A two-dimensional PCA plot was
employed to visualize the overall effect of experimental
co-variates. Hierarchical clustering of the total gene expression was performed using a distance matrix to assess
the relationship between the samples and identify clusters
amongst the time points.
Grouping of time points: Based on the PCA and hierarchical clustering of the total gene expression, days 2 and 0 were used as replicates of the "brown adipose
state", days 0.5, 1, 2 and 4 as replicates of the "transition state" and day 14 as the "white adipose state”. Using the DESeq package of Bioconductor [22], differentially expressed genes were found between the brown adipose and the transition state as well as between the transition and the white adipose state. Heatmaps representing clustering for the differentially expressed genes were
created using the ggplots [55] package in R. The sheep
proteins were queried against the human non-redundant
protein database using BLAST [5] to find human homologous genes for further functional analysis. Reciprocal
BLAST, a computation method used to countercheck the
BLAST results was employed to filter the mapping between the sheep and human proteins. The BioMart tool on
Ensemble version 72 [17] was used for gene identification
conversion including obtaining Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC)
approved gene names for human proteins. UniProt [3]
was used to annotate the proteins for function, transcriptional activity and subcellular localization. Enrichr [13]
was used to find transcription factors and enrichment of
targets for the differentially expressed transcription factors from Transfac [35] and Jasper [41] in the two sets of
differentially expressed genes.
ACACA, acetyl-CoA carboxylase 1; AGPAT9, 1-acylglycerol3-phosphate O-acyltransferase 9; AIRE, autoimmune regulator; ALDH8A1, aldehyde dehydrogenase 8 family member A1; ALDOA, glycolytic enzyme aldolase A; ANGPTL2,
angiopoietin-related protein 2; ATP5B, ATP synthase β;
BAT, brown adipose tissue; BCL6, B-cell lymphoma 6;
BHLHE40, basic helix-loop-helix family member E40; BMP7,
bone morphogenetic protein 7; C/EBPB, CCAAT/ enhancerbinding protein β; CAT, catalase; CCN, cyclin; CD36,
cluster of differentiation 36; CIDEA, cell death-inducing
Submitted manuscript
DFFA-like effector a; CLOCK, circadian locomotor out[5]
put cycles kaput; CPT1B, carnitine palmitoyltransferase
1B; CS, citrate synthase; CYC1, cytochrome c1; DGAT,
diacylglycerol O-acyltransferase; DIO2, type II iodothyronine deiodinase; EBF2, early B-cell factor 2; EDTA,
ethylenediaminetetraacetic acid; EPAS1, endothelial PAS
domain-containing protein 1; ERRA, estrogen-related receptorα;
ESR1, estrogen receptor 1; FABP4, fatty acid-binding pro[8]
tein 4; FASN, fatty acid synthase; GPAM, mitochondrial
glycerol-3-phosphate acyltransferase; HADH, hydroxyacylCoA dehydrogenase complex; HADHA, HADH catalytic
subunitα; HGNC, HUGO Gene Nomenclature Committee; HE, hematoxylin-eosin; HOXC8, homeobox C8; HUGO,
Human Genome Organization; IDH3A, isocitrate dehydrogenase 3α; KLF4, krüppel-like factor 4; LEP, leptin;
LHX8, LIM homeobox 8; LRP1, low density lipoprotein
receptor-related protein 1; LRPPRC, leucine-rich PPR motifcontaining protein; MLF1, myeloid leukemia factor 1; MPZL2,
myelin protein zero-like 2; MSR1, macrophage scavenger
receptor 1; MT-CO1, mitochondrially encoded cytochrome
c oxidase I; mtDNA, mitochondrial DNA; MYC, v-myc
avian myelocytomatosis viral oncogene homolog; NCOA1,
nuclear receptor co-activator 1; nDNA, nuclear DNA; NRF1, [12]
nuclear respiratory factor 1; NRIP1, nuclear receptor-interacting
protein 1; PCA, principle component analysis; PIM1, proviral insertion site in Moloney murine leukemia virus lymphomagenesis; PPARG, peroxisome proliferator-activated
receptor γ; PPARGC1A, PPARG co-activator 1α; PRDM16,
PR domain containing 16; RB1, retinoblastoma 1; RELA,
v-rel reticuloendotheliosis viral oncogene homolog A; RGCC,
regulator of cell cycle; SERPINF1, serpin peptidase inhibitor F1; SLC29A1, solute carrier family 29 member
1; SMARCC2, SWI/SNF-related matrix-associated actindependent regulator of chromatin subfamily member C2;
ST7, suppression of tumorigenicity 7; TAG, triacylglycerol; TALDO1, transaldolase; TBX1, T-box protein 1; TCA,
tricarboxylic acid; TEM, transmission electron microscopy;
TFIIB, transcription factor IIB; TMEM26, transmembrane
protein 26; TNFRSF9, tumor necrosis factor receptor su[15]
perfamily member 9; UCP1, uncoupling protein 1; VEGFB,
vascular endothelial growth factor B; WAT, white adipose
tissue; YBX1, Y box binding protein 1; ZIC1, zinc finger
protein 1.
[1] “FASTXToolkit []”.
[2] “”.
[3] “Update on activities at the Universal Protein Resource
(UniProt) in 2013”,
Nucleic Acids Res, Vol. 41,
No. Database issue, pp. D43–7, 2013.
[4] K. L. Allen, A. Hamik, M. K. Jain, and K. R. McCrae,
“Endothelial cell activation by antiphospholipid antibodies is modulated by Kruppel-like transcription factors”,
Blood, Vol. 117, No. 23, pp. 6383–91, 2011.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.
Lipman, “Basic local alignment search tool”, J Mol Biol,
Vol. 215, No. 3, pp. 403–10, 1990.
S. Anders, “HTSeq: Analysing high-throughput sequencing data with Python”.
S. Anders and W. Huber, “Differential expression analysis
for sequence count data”, Genome Biol, Vol. 11, No. 10,
p. R106, 2010.
N. Bakkar, J. Wang, K. J. Ladner, H. Wang, J. M.
Dahlman, M. Carathers, S. Acharyya, M. A. Rudnicki,
A. D. Hollenbach, and D. C. Guttridge, “IKK/NF-kappaB
regulates skeletal myogenesis via a signaling switch to inhibit differentiation and promote mitochondrial biogenesis”, J Cell Biol, Vol. 180, No. 4, pp. 787–802, 2008.
K. Birsoy, Z. Chen, and J. Friedman, “Transcriptional regulation of adipogenesis by KLF4”, Cell Metab, Vol. 7,
No. 4, pp. 339–47, 2008.
K. J. Campbell, S. Rocha, and N. D. Perkins, “Active
repression of antiapoptotic gene expression by RelA(p65)
NF-kappa B”, Molecular Cell, Vol. 13, No. 6, pp. 853–65,
B. Cannon and J. Nedergaard, “Brown adipose tissue:
function and physiological significance”, Physiol Rev,
Vol. 84, No. 1, pp. 277–359, 2004.
L. Casteilla, O. Champigny, F. Bouillaud, J. Robelin, and
D. Ricquier, “Sequential changes in the expression of
mitochondrial protein mRNA during the development of
brown adipose tissue in bovine and ovine species. Sudden
occurrence of uncoupling protein mRNA during embryogenesis and its disappearance after birth”, Biochemical
Journal, Vol. 257, No. 3, pp. 665–71, 1989.
E. Y. Chen, C. M. Tan, Y. Kou, Q. Duan, Z. Wang, G. V.
Meirelles, N. R. Clark, and A. Ma’ayan, “Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool”, BMC Bioinformatics, Vol. 14, p. 128, 2013.
Y. Chen, G. Lin, J. S. Huo, D. Barney, Z. Wang, T. Livshiz,
D. J. States, Z. S. Qin, and J. Schwartz, “Computational and functional analysis of growth hormone (GH)regulated genes identifies the transcriptional repressor Bcell lymphoma 6 (Bc16) as a participant in GH-regulated
transcription”, Endocrinology, Vol. 150, No. 8, pp. 3645–
54, 2009.
M. P. Cooper, M. Uldry, S. Kajimura, Z. Arany, and B. M.
Spiegelman, “Modulation of PGC-1 coactivator pathways
in brown fat differentiation through LRP130”, Journal
of Biological Chemistry, Vol. 283, No. 46, pp. 31960–7,
[16] M. W. Feinberg, Z. Cao, A. K. Wara, M. A. Lebedeva,
S. Senbanerjee, and M. K. Jain, “Kruppel-like factor 4 is a
mediator of proinflammatory signaling in macrophages”,
Journal of Biological Chemistry, Vol. 280, No. 46, pp.
38247–58, 2005.
[17] P. Flicek, I. Ahmed, M. R. Amode, D. Barrell, K. Beal,
S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates,
S. Fairley, S. Fitzgerald, L. Gil, C. Garcia-Giron, L. Gordon, T. Hourlier, S. Hunt, T. Juettemann, A. K. Kahari, S. Keenan, M. Komorowska, E. Kulesha, I. Longden, T. Maurel, W. M. McLaren, M. Muffato, R. Nag,
B. Overduin, M. Pignatelli, B. Pritchard, E. Pritchard,
H. S. Riat, G. R. Ritchie, M. Ruffier, M. Schuster, D. Sheppard, D. Sobral, K. Taylor, A. Thormann, S. Trevanion,
Submitted manuscript
S. White, S. P. Wilder, B. L. Aken, E. Birney, F. Cunningham, I. Dunham, J. Harrow, J. Herrero, T. J. Hubbard,
N. Johnson, R. Kinsella, A. Parker, G. Spudich, A. Yates,
A. Zadissa, and S. M. Searle, “Ensembl 2013”, Nucleic
Acids Res, Vol. 41, No. Database issue, pp. D48–55, 2013.
S. O. Freytag and T. J. Geddes, “Reciprocal regulation
of adipogenesis by Myc and C/EBP alpha”, Science, Vol.
256, No. 5055, pp. 379–82, 1992.
R. Galien and T. Garcia, “Estrogen receptor impairs
interleukin-6 expression by preventing protein binding on
the NF-kappaB site”, Nucleic Acids Res, Vol. 25, No. 12,
pp. 2424–9, 1997.
R. T. Gemmell and G. Alexander, “Ultrastructural development of adipose tissue in foetal sheep”, Aust J Biol Sci,
Vol. 31, No. 5, pp. 505–15, 1978.
R. T. Gemmell, A. W. Bell, and G. Alexander, “Morphology of adipose cells in lambs at birth and during subsequent transition of brown to white adipose tissue in cold
and in warm conditons”, Am J Anat, Vol. 133, No. 2, pp.
143–64, 1972.
R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad,
M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry,
F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki,
C. Smith, G. Smyth, L. Tierney, J. Y. Yang, and J. Zhang,
“Bioconductor: open software development for computational biology and bioinformatics”, Genome Biol, Vol. 5,
No. 10, p. R80, 2004.
M. Harms and P. Seale, “Brown and beige fat: development, function and therapeutic potential”, Nat Med, Vol.
19, No. 10, pp. 1252–63, 2013.
V. J. Heath, D. A. Gillespie, and D. H. Crouch, “Inhibition of the terminal stages of adipocyte differentiation by
cMyc”, Exp Cell Res, Vol. 254, No. 1, pp. 91–8, 2000.
P. A. Heine, J. A. Taylor, G. A. Iwamoto, D. B. Lubahn,
and P. S. Cooke, “Increased adipose tissue in male and
female estrogen receptor-alpha knockout mice”, Proc Natl
Acad Sci U S A, Vol. 97, No. 23, pp. 12729–34, 2000.
N. Z. Jespersen, T. J. Larsen, L. Peijs, S. Daugaard, P. Homoe, A. Loft, J. de Jong, N. Mathur, B. Cannon, J. Nedergaard, B. K. Pedersen, K. Moller, and C. Scheele, “A classical brown adipose tissue mRNA signature partly overlaps with brite in the supraclavicular region of adult humans”, Cell Metabolism, Vol. 17, No. 5, pp. 798–805,
C. Jiang, M. Ito, V. Piening, K. Bruck, R. G. Roeder, and
H. Xiao, “TIP30 interacts with an estrogen receptor alphainteracting coactivator CIA and regulates c-myc transcription”, Journal of Biological Chemistry, Vol. 279, No. 26,
pp. 27781–9, 2004.
S. Kajimura and M. Saito, “A New Era in Brown Adipose
Tissue Biology: Molecular Control of Brown Fat Development and Energy Homeostasis”, Annu Rev Physiol, 2013.
J. Laurencikiene and M. Ryden, “Liver X receptors and
fat cell metabolism”, Int J Obes (Lond), Vol. 36, No. 12,
pp. 1494–502, 2012.
Y. R. Lea-Currie, D. Monroe, and M. K. McIntosh, “Dehydroepiandrosterone and related steroids alter 3T3-L1
preadipocyte proliferation and differentiation”, Comp
Biochem Physiol C Pharmacol Toxicol Endocrinol, Vol.
123, No. 1, pp. 17–25, 1999.
[31] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform”, Bioinformatics,
Vol. 25, No. 14, pp. 1754–60, 2009.
[32] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan,
N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The
Sequence Alignment/Map format and SAMtools”, Bioinformatics, Vol. 25, No. 16, pp. 2078–9, 2009.
[33] M. A. Lomax, F. Sadiq, G. Karamanlidis, A. Karamitri,
P. Trayhurn, and D. G. Hazlerigg, “Ontogenic loss of
brown adipose tissue sensitivity to beta-adrenergic stimulation in the ovine”, Endocrinology, Vol. 148, No. 1, pp.
461–8, 2007.
[34] O. H. Lowry, N. J. Rosebrough, A. L. Farr, and R. J.
Randall, “Protein measurement with the Folin phenol
reagent”, Journal of Biological Chemistry, Vol. 193,
No. 1, pp. 265–75, 1951.
[35] V. Matys, O. V. Kel-Margoulis, E. Fricke, I. Liebich,
S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev,
M. Krull, K. Hornischer, N. Voss, P. Stegmaier,
B. Lewicki-Potapov, H. Saxel, A. E. Kel, and E. Wingender, “TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes”, Nucleic Acids
Res, Vol. 34, No. Database issue, pp. D108–10, 2006.
[36] A. Mostyn, S. Pearce, T. Stephenson, and M. E. Symonds,
“Hormonal and nutritional regulation of adipose tissue mitochondrial development and function in the newborn”,
Exp Clin Endocrinol Diabetes, Vol. 112, No. 1, pp. 2–9,
[37] R. Nahar, P. Ramezani-Rad, M. Mossner, C. Duy, L. Cerchietti, H. Geng, S. Dovat, H. Jumaa, B. H. Ye, A. Melnick, and M. Muschen, “Pre-B cell receptor-mediated activation of BCL6 induces pre-B cell quiescence through
transcriptional repression of MYC”, Blood, Vol. 118,
No. 15, pp. 4174–8, 2011.
[38] J. H. Park, H. J. Kang, S. I. Kang, J. E. Lee, J. Hur, K. Ge,
E. Mueller, H. Li, B. C. Lee, and S. B. Lee, “A multifunctional protein, EWS, is essential for early brown fat
lineage determination”, Dev Cell, Vol. 26, No. 4, pp.
393–404, 2013.
[39] F. Picard, M. Gehin, J. Annicotte, S. Rocchi, M. F.
Champy, B. W. O’Malley, P. Chambon, and J. Auwerx,
“SRC-1 and TIF2 control energy balance between white
and brown adipose tissues”, Cell, Vol. 111, No. 7, pp.
931–41, 2002.
[40] M. Pope, H. Budge, and M. E. Symonds, “The developmental transition of ovine adipose tissue through early
life”, Acta Physiol (Oxf), Vol. 210, No. 1, pp. 20–30, 2014.
[41] E. Portales-Casamar, S. Thongjuea, A. T. Kwon, D. Arenillas, X. Zhao, E. Valen, D. Yusuf, B. Lenhard, W. W.
Wasserman, and A. Sandelin, “JASPAR 2010: the
greatly expanded open-access database of transcription
factor binding profiles”, Nucleic Acids Res, Vol. 38,
No. Database issue, pp. D105–10, 2010.
[42] S. Rajakumari, J. Wu, J. Ishibashi, H. W. Lim, A. H. Giang, K. J. Won, R. R. Reed, and P. Seale, “EBF2 determines and maintains brown adipocyte identity”, Cell
Metabolism, Vol. 17, No. 4, pp. 562–74, 2013.
[43] J. A. Romashkova and S. S. Makarov, “NF-kappaB is a
target of AKT in anti-apoptotic PDGF signalling”, Nature,
Vol. 401, No. 6748, pp. 86–90, 1999.
Submitted manuscript
[44] M. Rosenwald, A. Perdikari, T. Rulicke, and C. Wolfrum, “Bi-directional interconversion of brite and white
adipocytes”, Nat Cell Biol, Vol. 15, No. 6, pp. 659–67,
[45] M. Saito, J. Gao, K. Basso, Y. Kitagawa, P. M. Smith,
G. Bhagat, A. Pernis, L. Pasqualucci, and R. Dalla-Favera,
“A signaling pathway mediating downregulation of BCL6
in germinal center B cells is blocked by BCL6 gene alterations in B cell lymphoma”, Cancer Cell, Vol. 12, No. 3,
pp. 280–92, 2007.
[46] L. Z. Sharp, K. Shinoda, H. Ohno, D. W. Scheel, E. Tomoda, L. Ruiz, H. Hu, L. Wang, Z. Pavlova, V. Gilsanz,
and S. Kajimura, “Human BAT possesses molecular signatures that resemble beige/brite cells”, Plos One, Vol. 7,
No. 11, p. e49452, 2012.
[47] D. Shepherd and P. B. Garland, “The kinetic properties of
citrate synthase from rat liver mitochondria”, Biochemical
Journal, Vol. 114, No. 3, pp. 597–610, 1969.
[48] A. Smorlesi, A. Frontini, A. Giordano, and S. Cinti,
“The adipose organ: white-brown adipocyte plasticity and
metabolic inflammation”, Obes Rev, Vol. 13 Suppl 2, pp.
83–96, 2012.
[49] K. R. Steffensen, M. Nilsson, G. U. Schuster, T. M. Stulnig, K. Dahlman-Wright, and J. A. Gustafsson, “Gene
expression profiling in adipose tissue indicates different
transcriptional mechanisms of liver X receptors alpha and
beta, respectively”, Biochem Biophys Res Commun, Vol.
310, No. 2, pp. 589–93, 2003.
[50] T. Tang, J. Zhang, J. Yin, J. Staszkiewicz, B. GawronskaKozak, D. Y. Jung, H. J. Ko, H. Ong, J. K. Kim, R. Mynatt, R. J. Martin, M. Keenan, Z. Gao, and J. Ye, “Uncoupling of inflammation and insulin resistance by NFkappaB in transgenic mice through elevated energy expenditure”, Journal of Biological Chemistry, Vol. 285, No. 7,
pp. 4637–44, 2010.
[51] R. C. Team, “A Language and Environment for Statistical
Computing”, 2013.
[52] S. Uno, K. Endo, Y. Jeong, K. Kawana, H. Miyachi,
Y. Hashimoto, and M. Makishima, “Suppression of betacatenin signaling by liver X receptor ligands”, Biochem
Pharmacol, Vol. 77, No. 2, pp. 186–95, 2009.
[53] T. B. Walden, I. R. Hansen, J. A. Timmons, B. Cannon,
and J. Nedergaard, “Recruited vs. nonrecruited molecular
signatures of brown, "brite," and white adipose tissues”,
Am J Physiol Endocrinol Metab, Vol. 302, No. 1, pp. E19–
31, 2012.
[54] H. Wang, Y. Zhang, E. Yehuda-Shnaidman, A. V.
Medvedev, N. Kumar, K. W. Daniel, J. Robidoux, M. P.
Czech, D. J. Mangelsdorf, and S. Collins, “Liver X receptor alpha is a transcriptional repressor of the uncoupling
protein 1 gene and the brown fat phenotype”, Molecular
and Cellular Biology, Vol. 28, No. 7, pp. 2187–200, 2008.
[55] H. Wickham, “ggplot2: elegant graphics for data analysis.
[56] J. Wu, P. Bostrom, L. M. Sparks, L. Ye, J. H. Choi,
A. H. Giang, M. Khandekar, K. A. Virtanen, P. Nuutila,
G. Schaart, K. Huang, H. Tu, W. D. van Marken Lichtenbelt, J. Hoeks, S. Enerback, P. Schrauwen, and B. M.
Spiegelman, “Beige adipocytes are a distinct type of thermogenic fat cell in mouse and human”, Cell, Vol. 150,
No. 2, pp. 366–76, 2012.
[57] Y. Yamamoto, S. Gesta, K. Y. Lee, T. T. Tran, P. Saadatirad, and C. R. Kahn, “Adipose depots possess unique
developmental gene signatures”, Obesity (Silver Spring),
Vol. 18, No. 5, pp. 872–8, 2010.
Chapter 6
Paper IV - Epigenetic changes
in obesity
DNA methylation is a marker for metabolic memory and thus, it is important to examine its status under obesity, a known metabolic disorder. The
following chapter presents the efforts to capture epigenetic changes associated with altered gene expression in obese mice as compared to the lean mice.
The hypothesis under test in this chapter is: “does the DNA methylation of the adipose tissues differs between lean and obese mice and between
adipose depots”?. On one side, ob/ob obese mice compared against the wild
type lean mice while on other hand, diet-induced obese was compared against
the regular diet fed mice. MeDIP-seq was performed on mature adipocytes
from lean and obese mice. This lead to the discovery of differentially methylated regions (DMRs) between the lean and obsess mice in respective tissues
and models.
As we know that, genetic obesity is different from diet-induced obesity,
as later is more life style dependent and guided by external stimuli. The
DMRs obtained from obtained from diet induced obesity and genetic obesity
were compared. The different depots of adipose tissues in the body differ
in their metabolic activity, gene expression and response to obesity. Thus,
the comparison of effects of obesity on the DNA methylation between different adipose depots is also of interest.The DNA methylation changes were
combined with gene expression changes to find the mechanisms of obesity in
inguinal and epididymal tissue.
Adipose-depot specific gene regulation by DNA methylation in obesity
Rachita Yadav1 , Si Brask Sonne2,3 , Yin Guangliang4 , Lise Madsen7 , Ramneek Gupta1 , Jun Wang4,5,6 , Karsten
Kristiansen3 and Shingo Kajimura∗2
Center for Biological Sequence Analysis, The Technical University of Denmark, Copenhagen, Denmark
UCSF Diabetes Center and Department of Cell and Tissue Biology, University of California San Francisco, San
Francisco, California, USA
Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark
BGI-Shenzhen, Shenzhen, China
Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah,
Saudi Arabia
Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, China
National Institute of Nutrition and Seafood Research, Bergen, Norway
The study aims at examining the global DNA methylation, the metabolic memory marker of adipocytes in
obesity and its effects on the gene expression of adipose tissue. Additionally, the study aimed at finding
differences between genetic and diet-induced obesity and visceral and subcutaneous depots of adipose tissue
in the obese state. Genome wide DNA hypomethylation occured more commonly in both depots of adipose
tissue under study as well as in both types of obesity. We observe huge differences between the two models
and between epididymal and inguinal tissue. Common to all was the hypomethylation that followed obesity.
This report present the first study of global methylation study in mature adipocytes of genetic and diet induced
obesity in male mouse combining global methylation and gene expression data.
Obesity, mature adipocytes, DNA methylation, microarray, epididymal adipose tissue,
inguinal adipose tissue
Obesity is a growing problem worldwide [7] due to easier access to food and a more sedentary lifestyle. When
energy intake exceeds energy expenditure, excess energy
is stored in adipose tissues. The amount of adipose tissue
in the body defines the physiological state of the organism. Increase in adipose tissue, is associated with disorders like insulin resistance, hyperglycemia, dyslipidemia,
hypertension and inflammation [27].
Adipose tissue is a complex tissue with high metabolic
and endocrine activity. Adipose tissue is made up of multiple cell types i.e. adipocytes, vascular endothelial cells,
fibroblasts, and macrophages. The adipose tissues are
classified as either subcutaneous or visceral, depending
on their localization in the body. Generally, visceral fat,
which surrounds the organs in the body cavity, is more associated with metabolic disorders than the subcutaneous
fat depot, which is found right below the skin and can
be especially abundant on hips and thighs [41]. A high
∗ Corresponding author:
[email protected]
Shingo Kajimura,
waist-to-hip ratio rather than BMI is predictive of insulin
resistance and cardiovascular complications [32]. Increasing amounts of visceral fat are associated with increased
inflammation and release of inflammatory cytokines such
as TNF-alpha and IL-6 and increasing blood levels of FFA
[34]. The preadipocytes from the two depots have specific gene expression signatures that continue in mature
adipocytes. Adipose tissue present in different parts of
body are not same, they differ in metabolic properties [70],
gene expression [69], protein secretion [23] as well as depot specific angiogenesis [3].
One of the main features of adipose tissue is the ability to change its size. Adipose tissue accomplishes increase in size by increasing size of adipocytes (hypertrophy) as well by increasing the number of cells (hyperplasia). In obesity, hypertrophy precedes hyperplasia in order to meet the requirement of excess energy storage [48].
Over nutrition causes energy imbalance in the body that
in turn leads to adipose tissue to first increase in size and
then more adipocytes are recruited, which initiates obesity. Along with nutrition, genetic makeup of the organisms is also found to be responsible for obesity. Studies
in model organisms found that mechanisms of excess fat
mass accumulation differ substantially in the genetic obesity and diet induced obesity [74]. Multiple genes and
loci have been found to be associated with weight gain
and obesity related phenotypes. Along with genes, epigenetics have been found to be associated with metabolic
disorders including diabetes and obesity [49]. There are
multiple epigenetic factors, which make individuals more
predisposed towards certain phenotypes. Amongst all the
epigenetic features, DNA methylation has been associated
with adipogenesis [45], appetite regulation [66] as well as
body weight homeostasis [64]. Diabetic patients that undergo therapy to normalize their blood glucose levels still
have cardiovascular problems [11]. This phenomenon is
called the metabolic memory, and has been suggested to
be associated with epigenetic changes [50]. We speculate that such a metabolic memory may also be present in
fat cells and may contribute to the difficulty in sustaining
weight loss.
It is well known that exposure to certain external factors
causes cell type-dependent epigenetic changes. Different chemical and environmental toxins induce changes to
DNA methylation patterns leading to epimutations, which
are associated with phenotypes [24]. DNA methylation
depends on the activity of certain genes called DNA methyltransferases (Dnmt1, Dnmt2 and Dnmt3), which transfer the available methyl groups to DNA. Availability or
scarcity of methyl groups also plays a vital role in DNA
methylation. In humans, the major sources of methyl groups
in foods come from methionine, one-carbon metabolism
via methylfolate, and from choline [2]. An experiment in
mice showed that an increase in folic acid intake leads to
increased DNA methylation of the agouti locus, causing a
change in phenotype [73].
With this study, we aimed to find the methylation differences between the visceral and subcutaneous fat depots
along with the implication of these differences in obesity.
To infer the effect of genetics and diet on methylation differences in obesity, we tested the variance in methylation
in genetic as well as diet model. As a genetic model of
obesity, we chose the leptin deficient ob/ob mice, which
gain weight due to excessive food intake. In order to find
the effect of diet on the methylation of obese tissue, we
compared mice fed with high fat diet versus regular diet
for 15 weeks. The time point of 15 weeks of high fat diet
leads to a dramatically increased body weight, accompanied by a reduction of central leptin sensitivity [46] with
beginning insulin resistance/glucose intolerance [14].
Methods and material
Experiment design (Animal model and tissue
In the Genetic obese model, nine week old male wild type
(wt) and ob/ob mice were obtained from the Jackson Laboratory. DNA was isolated from mature adipocytes in
epididymal and inguinal adipose tissue (n=4, obese and
lean). These mice were fed a chow diet corresponding
to the Regular diet used as a control in the diet-induced
model. For the Diet-induced obese model, 4 week- old
male C57BL/6J mice were obtained from the Jackson Laboratory. Mice were fed a regular diet (RD: 10% kcal
fat, D12450B, Research Diets Inc.) or a high fat diet
(HFD: 60% kcal fat, D12492, Research Diets Inc.) for
fifteen weeks. DNA was isolated from mature adipocytes
in epididymal and inguinal adipose tissue (n=3, obese and
lean). For both models, mature fat cells were isolated
from the epididymal and inguinal adipose tissues, by digesting with collagenase D and dispase II. DNA was isolated from the mature fat cells using a commercially available kit (DNeasy, Qiagen).
Library preparation
A library was prepared from 5 µg original DNA as previously described [43]. Briefly, DNA was fragmented. End
repair, <A> base addition and adaptor ligation steps were
performed using Illuminas Paired-End DNA Sample Prep
kit following the manufacturers instructions. Adaptorligated DNA was immunoprecipitated by anti-5mC, and
MeDIP products were validated by qPCR using SYBR
green mastermix (Applied Biosystems) and primers for
positive and negative control regions supplied in the MeDIP
kit (Diagenode). The qPCR cycling conditions were of 95
C 5 min, followed by 40 cycles 95 ◦ C 15 s and 60 ◦ C 1
min. MeDIP DNA was purified with ZYMO DNA Clean
& Concentrator-5 column following the manufacturers instructions and amplified by adaptor-mediated PCR in a final reaction volume of 50 µl. After excising amplified
DNA between 220 and 320 bp on a 2% agarose gel, amplification quality and quantity were evaluated using Agilent
2100 bioanalyzer and DNA 1000 chips. The paired-end
sequencing was performed using an Illumina platform.
MeDIP-Seq data analysis
Raw paired end reads from the MeDIP sequencing were
checked for quality using FastQC [1]. 49 nucleotide clean
reads were mapped to the mouse reference genome build
mm9 using Bowtie2 [40] for each sample independently.
Mapped reads were filtered for mapping quality 30 and
sorted using Picard ( and samtools [42] where as duplicates removed using Picard MarkDuplicates ( After alignment,
reads were filtered for bad quality of alignment, PCR duplicates and missing mate in the alignment. Using the
mapped reads the correlation between replicates was checked
using spearman correlation coefficient. Mapped data in
BAM format were further analysed for differentially methylated regions between the ob/ob vs wt and HFD vs RD
in epididymal and inguinal fat. In the MEDIPS package
of R, reads (49 nucleotides each) mapped to the genome
were extended to 300 nucleotides to capture all CpGs in
a region. The genome was divided into non-overlapping
bins of 250 nucleotides during this analysis and reads mapped
per region were counted. Relative methylation scores (rms)
were calculation for each bin of genome by counting the
number of mapped reads. Normalization was applied on
this count data to convert it to reads per million. MEDIPS
internally uses the EdgeR package to identify differentially methylated regions within the genome. EdgeR uses
negative binomial distribution (especially useful for discrete data) to find the DMRs between two states under
comparison and thus calculates mean methylation values
(rpm, rms, ams), log fold changes, variances and p-values
comparing two sample sets. The DMRs are mapped to the
gene if they lie within the gene body or with 10kB +/- of
the gene body.
down regulated genes in two tissues using GSEA [62]
pre-ranked module with Molecular Signatures Database
(MSigDB) [44].
Integrative methods
The regions with differential methylation (DMR) between
RD and HFD were mapped to genes with differential gene
expression (DGE). If the DMR fall with the gene boundary or +/- 10 KB of the gene boundary, the DMR is taken
to affect the gene expression. The genes with DMRs and
DGE are only considered for further analysis if the log2
fold change between two conditions is >=1.
Functional analysis
Genes found differentially expressed and having a methylation effect were analysed for their functions in relation
to adipogenesis and their impact on obesity. For the differentially methylated regions, motif search was carried out
to find the transcription factor binding sites using meme
suit using Uniprobe database. To utilise the previous knowledge, transcriptional factors controlling the methylation
controlled differentially expressed genes were queried from
the public ChIP-X data analysed and stored in ChEA database
[39] using the enrichr [12] tool. Upstream analysis for the
two sets of genes using the signalling molecules was carMicroarray analysis
ried out using the key node functionality of Explain [36]
( from BIOBASE
For the microarray analysis, 20 four-week old male C57BL/6J
Corporation. The maximum distance to search was almice were obtained from the Jackson Laboratory and fed
lowed to be six and FDR < 0.05 was used as the cut-off.
either RD (n=10) or a HFD as described above. Total
RNA was extracted from mature adipocytes using Trizol
LS (Invitrogen), DNAse treated (Qiagen) and LiCl precipitated. The quality and quantity of RNA was determined
using a Bioanalyzer nano kit (Agilent Technologies, Santa
Clara, California, US) and Qubit RNA BR Assay (Life
We wanted to compare mature adipocytes from diet inTechnologies, Waltham, Massachusetts, US) respectively.
duced obesity and obesity caused by leptin deficiency (geOf these, the 5 mice with the highest RIN values were
netic obesity). For the genetic obesity, ob/ob mice were
chosen for subsequent analysis. Gene expression profiles
significantly heavier than the wild type mice when the
were determined using the Mouse Agilent 4X44 v2 gene
mice were sacrificed at 9 weeks of age (46.1 +/- 2.9 g,
expression arrays.
and, 24.8 +/- 2.1 g, respectively)(p-value<0.0001). In the
diet-induced model, after 15 weeks on RD vs HFD, average body weight was significantly higher in mice fed
the HFD (28.7 +/- 1.5g and 47.2 +/- 1.3g, respectively)(pData analysis of microarray
The single colour data from the 20 samples was analysed using the limma package [61]. Background correction and normalisation of the data was done based on the
negative controls. After the normalization, the differentially expressed genes between RD and HFD were identified in inguinal and epididymal data independently using the bayesian method. Pathway enrichment for Reactome database pathways was carried out on the up and
Inguinal genetic model
-log10 Adj. p-value
-log10 Adj. p-value
Inguinal diet model
log2 fold change
Epididymal diet model
log2 fold change
Epididymal genetic model
-log10 Adj. p-value
-log10 Adj. p-value
log2 fold change
log2 fold change
Figure 1: Volcano plots showing mean methylation differences between the tissue from obese mice and lean mice in inguinal and
epididymal fat of the two models.
MeDIP-seq reveals extensive hypomethylation
in obesity
Sequencing was carried out for 28 samples from two tissues in four different conditions: wt, ob/ob, RD and HFD.
This resulted in 4,898,872,318 paired end reads of 49 nucleotides. Cleaned data showed no adapter and sequencing primer contamination, the base quality was good and
consistent through out the read length. MeDIP-sequencing
read counts and uniquely mapped reads are represented
in Supplementary Table ST1. On average 170 millions
paired end reads per sequence were obtained from sequencing. After removing PCR duplicates, missing pair
mapped reads and bad quality mapped reads, approx. 35%
of reads from each sample are used in the later analysis. Short reads are obtained from the MEDIP-seq data,
which when mapped to the reference genome, gave less
mapping because the CpG islands lie in the repeat regions
of the genome, which are hard to map using short reads.
High correlation of R2 > 0.90 is found between the replicate samples collected for each tissue for respective ex-
perimental conditions .
The region of 250 bases that is differentially methylated in
obese mice as compared to lean is represented as one point
in each volcano plot, with p-value < 0.001 marked in red
(figure 1). The genome was divided into promoter, exon
or intron of the gene or in the intergenic regions to find
the distribution of differentially methylated regions. The
distribution of differentially methylated regions varies between the inguinal and epididymal tissue in diet-induced
and genetic obesity(Figure 2). Both the tissues in the diet
model have higher DMRs in exons as compared to the
other genetic regions. Only the inguinal tissue in genetic
model has higher number of DMRs in the promoters as
compared to other regions of the genome. The counts
of the differentially methylated regions in the two models and two tissues are summarised in Table 1. There are
more hypomethylated regions in the obese mice as compared to the corresponding lean mice.
Inguinal tissue in genetic model
Inguinal tissue in diet model
Fold change relative to whole genome
Fold change relative to whole genome
Epididymal tissue in diet model
Epididymal tissue in genetic model
Figure 2: DMR partition distribution with the genomic region based on functional properties.
Experiment Name
Inguinal High Fat Vs Regular diet
Epididymal High Fat Vs Regular diet
Inguinal Obese Vs Wild type
Epididymal Obese Vs Wild type
Table 1: Differentially methylated regions of 250 bases in epididymal and inguinal tissues in two different models
at p-value <= 0.001. Region in each comparison is divided into hypomethylated and hypermethylated regions based on
their state in obese as compared to corresponding lean mice. The fifth column shows the number of mouse genes mapped
by the differentially methylated regions.
Epididymal tissue and inguinal tissue is more
similar in diet model
The DMRs are mapped to genes if they are located within
the gene body or within 10kB upstream or downstream of
the gene body. The numbers of genes mapped by DMRs
in the two models in inguinal and epididymal tissue are
documented in the last column of Table 1.There are common genes affected by methylation between different tissues and models. The Venn diagram (Figure 3) shows
an overlap of six genes between inguinal and epididymal
tissues in the genetic model. These six genes are Sntg1,
Galntl6, 2210408I21Rik, Park2, Rn45s and Mid1. On
the contrary, the diet model has 2004 genes, which are
common between inguinal and epididymal tissues. Out of
2004 genes, 1969 are hypo methylated in the HFD mice
in both inguinal and epididymal tissues. Thus, the diet
model is more consistent between the two tissues than
the genetic model. The remaining 35 genes are hypermethylated in the epididymal tissue while hypo-methylated
in the inguinal tissue
More genes are affected by diet-induced obesity in epididymal fat tissue
The HFD feeding was repeated, to obtain samples for gene
expression analysis. After 15 weeks of feeding RD or
HFD, there was a significant difference in the weights of
mice used for microarray samples. The weight of RD and
HFD mice at 19th week were 31.76 +/- 1.8g and 46.03 +/4.8g respectively. High correlation was observed between
the biological replicates used for the gene expression analysis. The gene expression levels of different probes on
Color Key
and Histogram
Color Key
and Histogram
Gene expression changes in
epididymall tissue in diet model
0 8000
Gene expression changes in
inguinal tissue in diet model
Figure 4: Heatmaps for differentially expressed genes in the diet models for (A) inguinal and (B) epididymal adipocytes.
inguinal tissue
diet model
epididymal tissue
diet model
inguinal tissue
genetic model
epididymal tissue
genetic model
Figure 3: Overlap of genes with differentially methylated regions. Genes harbouring differentially methylated regions in the
inguinal and epididymal tissue in the genetic and diet models.
The highest overlap of 2004 genes is found between inguinal
diet model and epididymal diet model
the arrays were transformed to gene level expression using the median of the expression values of the multiple
probes mapped to that gene. At the log2 fold change of
>=1 between regular diet and high fat diet, 411 genes
were differentially expressed in inguinal tissue and 1135
genes were differentially expressed in epididymal tissue.
The differential gene expression for the epididymal and
inguinal tissue shows that we have almost twice as many
gene changes in epididymal as compared to inguinal tissue in obese mice. The heatmaps show that approximately
75 % of genes in inguinal tissue have higher expression in
the HFD, in line with the hypomethylation of most DMRs
in inguinal HFD. In epididymal tissue, there is 2:1 ratio
between up-regulated and down-regulated genes (Figure
4). Epididymal diet model had the maximum number of
hypermethylated region in HFD amongst the four tested
Gene ontology enrichment for differentially expressed
Out of the 411 genes differentially expressed in inguinal
tissue between HFD and RD, 304 genes were up-regulated
in HFD where as 107 genes were down-regulated. At
FDR <0.01, the up-regulated genes are enriched in biological functions like angiogenesis, vascular system development, immune system, cell migration and mortality,
response to stimuli and stress. On the contrary the downregulated genes from the inguinal tissue are involved in
lipid metabolic process, response to chemical stimulus, fat
cell differentiation, mitochondrion, response to hormone
stimulus, response to organic substance. In the epididymal tissue, among the 1135 differentially expressed genes,
757 were up-regulated while 378 were down-regulated.
The prominent results of gene ontology enrichment shows
that the up-regulated gene are responsible for immune system process, response to other organism, leukocyte activation, cytokine production, hemopoiesis, response to stress
and stimuli, angiogenesis, phagocytosis and membrane
organization including motility, adhesion, differentiation
and death of cells. On the other hand genes showing decrease in expression in HFD are metabolism related genes
involved in fatty acid oxidation and lipid metabolism, along
with fat cell differentiation and response to nutrient levels
which are all adipose tissue related. Other classes are response to chemical stimulus and response to peptide hormone stimulus. A large portion of these gene (53) are
localised to mitochondrion.
Pathway enrichment for differentially expressed genes
The enrichment for Reactome database pathways is observed for both up-regulated (enrichment score) and downregulated (enrichment score with -ve sign) genes in the
inguinal and epididymal tissue (Figure 7). In the inguinal
adipocytes from diet-induced obese mice, we can see that
the innate immunity genes, fatty acid and other metabolism
related genes are going down where as adaptive immune
response, cell cycle and organ development classes show
up-regulation in HFD. The pathways reflect that the cells
are storing the fatty acids and growing in size and at the
same time getting invaded by macrophages. In the epididymal tissue, similar to inguinal, the fatty acid and other
metabolism pathways are down-regulated and additionally epididymal has insulin signalling genes in the downregulated genes. The up-regulated genes are active in transcription, nervous tissue development and adaptive response.
These classes indicate that epididymal fat reacts more to
insulin activity and higher neurovascular tissue development occurs in epididymal tissue under obesity.
Methylation driven changes in gene expression differ between inguinal and epididymal
In mature adipocytes from inguinal gene expression of 39
genes has been affected by DMRs while in epididymal,
103 genes show change in expression with methylation
control. Thirty-nine genes from inguinal include 33 of the
genes that are up-regulated with lower methylation levels in HFD. Six genes were down-regulated and hypermethylated in HFD (figure ). In the epididymal tissue,
57 genes were up-regulated with lower methylation levels
in HFD, Thirty genes were hypomethylated in HFD but
were also down-regulated at the gene expression level.
Ten genes were hypermethylated in HFD and also show
up-regulation in this state where as 6 genes were hypermethylated in HFD and show down-regulation in this state
(Figure 5).
Twenty-four genes (Nrp2, Notch3, S100a8, Ptpn18, Lgals3,
Icam1, Cd300lg, Ptprc, Tspan2, Ncf4, Tnfaip8l1, Myh9,
Fbxl7, Dusp7, Mpzl1, Mmp11, Prcp, Rcsd1, Gmfg, Lipe,
4930519F09Rik, C4a, Aacs, Mup5) are common between
inguinal and epididymal fat in diet induced obese mice
(Figure 6), where all but one (Mup5) shows same direction of methylation and gene expression in both the tissues. Nineteen of these genes are hypomethylated and
up-regulated while 4 genes are hypermethylated but increase in expression in HFD.
In the two tissues, there are some common mechanisms
that are found to be affected by the HDF by some common
genes. The up-regulation in HFD with the hypomethylation control for genes like S100 calcium binding protein
A8 (S100a8) and Lectin, Galactoside-Binding, Soluble, 3
(Lgals3/Gal-3), show that these cells have high immunological activity. Lgals3 is another proinflammatory mediator, which is up-regulated in both adipose tissues of
the diet model. The down-regulation of lipid metabolism
genes namely Acetoacetyl-CoA synthetase (Aacs) and Hormonesensitive lipase (Lipe/Hsl) show that metabolic activities
in these cells are highly reduced. Prolylcarboxypeptidase
(Prcp) is a serine protease, which is expressed in multiple peripheral organs, white blood cells, fibroblasts and
endothelial cells where it is localised to the membrane.
Neuropilin-2 (Nrp2), a lymphatic vessel development gene
is found to be up-regulated in both inguinal and epididymal adipocytes.
There are genes that differ in methylation and expression
in two the issues. We see functional differences between
the genes under regulation of HFD in these tissues and
thus, there are physiological differences in the obesity phenotype of these two tissues [8]. In the inguinal tissue,
Carnitine palmitoyltransferase I (Cpt1) and Protein kinase
(cAMP-dependent, catalytic) inhibitor gamma (Pkig) is
up-regulated in HFD which have functions related to obesity. In the epididymal tissue, there are the specific genes
differentially expressed with DMRs within the effective
boundaries. Some of them with functions in obesity are
High mobility group A1 (Hmga1), Complement component 3a receptor 1 (C3ar1), BTB and CNC homology 1
(Bach1), Minichromosome maintenance complex component 10 (Mcm10). All of these are hypomethylated and
upregulated. Zinc-finger nuclear protein (Zfp152) is key
regulator of adipose commitment [35] and differentiation
and acts as a repressor of adipogenesis [35]. We find it
up-regulated in epididymal dataset, which is not the property of the adipose tissues.
The published ChIP-seq data from ChEA database using enrichr revealed thirteen transcription factors binding
(FDR <0.01) to the 34 genes with differential gene expression and differential methylation in inguinal tissues
(Table 2). In the epididymal tissue, thirteen transcription
factors (FDR <0.01) have a binding site in 93 genes, with
differential gene expression and differential methylation
(Table 3). There are 5 transcription factors with are common between the two tissues, Friend leukemia integration
1 (Fli1), Krueppel-like factor 4 (Klf4), SWI/SNF Related,
Matrix Associated, Actin Dependent Regulator Of Chromatin, Subfamily A, Member 4 (Transcription activator
BRG1 /Smarc4a), Wilms tumor protein (Wt1) and T-cell
acute lymphocytic leukemia 1 (Tal-1/scl).
Inguinal diet model
Epididymal diet model
Methylation level
Methylation level
Gene expression class
Gene expression class
Figure 5: Boxplot showing the mean methylation fold changes in the three classes of gene expression, up-regulated, down-regulated
and genes not changing in expression in two tissues
Epididymal diet model
Inguinal diet model
Figure 6: Venn diagram representing the overlap between genes with differential gene expression along with differential methylation
between inguinal and epididymal diet models.
Obesity is known to be polygenic and environment, especially diet, plays an important role in the regulation of
gene functions in obesity. In this study, we initially included two separate obesity models, diet induced obesity
and genetic obesity (ob/ob), to identify common methylation patterns in these two settings. Interestingly, the
methylation data consistently revealed more hypomethylated regions in obese compared to lean mice. This was
the case in both inguinal and epididymal fat from diet induced obese mice and ob/ob mice. Hypomethylation due
to decreased methionine levels [75] or mutations in the
emphMTHFR gene [67] is associated with an increased
risk of Type 2 diabetes, and feeding pregnant mice a high
fat diet has been shown to lead to hypomethylation in the
offspring [9, 21]. Also, in obesity multiple micronutrients
deficiencies are observed, including zinc, selenium, vitamin B1/B12, folate [19]. Folate is important for multiple
biological processes as 1-carbon source for methylation
of different molecules including DNA, RNA and proteins.
Anaemia and obesity were suggested causes of folate deficiency in females [10]. Also, serum folate concentra-
Upregulated tagets
Acoxl, Gmfg, Icam1, Plekhg2, Cpt1a, Ptpn18, Dusp7, Pkig, Myo1b, Myo1g, Fbxl7,
Nrp2, Crim1, Ncf4, Cotl1, Lgals3, Myh9, Cdh13, Tnfaip8l1, Slco3a1
Mpzl1, Tspan2, Myo1b, Nrp2, Crim1, Rcsd1
Dusp7, Pkig, Nrp2, Ncf4, Acoxl, Cotl1, Gmfg, Myh9, Tnfaip8l1, Slco3a1
Rcsd1, Ptpn18, Notch3, Dusp7, Myo1b, Lrrc8c, Fbxl7, Nrp2, Crim1, Lgals3, Cdh13,
Slco3a1, Tspan2, Mpp6, Cdh13, Dusp7, Nrp2, Crim1, Slco3a1, Ptpn18, Rcsd1
Mmp11, Mpzl1, Myo1b, Lgals3, Gmfg, Icam1, Plekhg2, Cpt1a
Mmp11, Dusp7, Myo1g, Nrp2, Lgals3, Gmfg, Prcp, Tnfaip8l1, Cpt1a, Notch3
Mpzl1, Acoxl, Prcp, Cpt1a, Ptpn18, Mmp11, Myo1g, Myh9, Cdh13, Pde7a, Cd300lg,
Acoxl, Rcsd1, Prcp, Plekhg2, Cpt1a, Mmp11, Pkig, Myo1g, Fbxl7, Cotl1, Tnfaip8l1,
Gmfg, Icam1, Notch3, Crim1, Cdh13, Mpp6
Prcp, Lrrc8c, Icam1, Lgals3, Notch3, Cd300lg
Dusp7, Myo1b, Crim1, Rcsd1, Prcp, Plekhg2, Slco3a1, Notch3
Mpzl1, Adamts9, Acoxl, Icam1, Plekhg2, Cpt1a, Mmp11, Pkig, Myo1b, Nrp2, Crim1,
Lgals3, Myh9, Tnfaip8l1, Pde7a
Mmp11, Mpzl1, Gmfg, Myh9, Icam1, Plekhg2, Cpt1a, Notch3
Gmfg, Prcp, Icam1, Cpt1a, Ptpn18, Notch3, Mmp11, Pkig, Lrrc8c, Ncf4, Cotl1,
Lgals3, Myh9
Lipe, Aacs
Table 2: Transcriptional control of the differentially expressed genes with methylation region in the inguinal diet model
with the effect range with the adjusted p-value < 0.01 for the group and the target genes
tions were lower in obese patients with non-alcoholic fatty
liver [30] as well as in individuals with high BMI [38].
The lower folate concentration in serum and low folate
availability to the cells is associated with an increased urinary 8-hydroxy-2?-deoxyguanosine and may further promote DNA strand breaks and global DNA hypomethylation [17, 71].
creased transcription [37]. Intragenic CGIs may indirectly
affect the gene expression, through regulatory noncoding
RNAs, alternative splicing regulated by methylation status
or regulate transcriptional elongation [20]. The effect of
methylation on gene expression is not totally understood
mechanism and DNA methylation can lead to positive and
negative regulation of gene expression.
During obesity, the adipose tissue is fast growing in mass
of cells with endocrine activity with adipocytes undergoing hypertrophy and hyperplasia and neovascularisation
occurring within the adipose tissue. The adipose tissue
obesity shares these properties with the cancerous tissue
[51]. The DNA damage caused by folate deficiency can
lead to abnormal DNA repair and methylation, which can
be the cause of the cells to be undergo neoplasia like situation [33]. Thus, we can suggest that obesity might be
causing hypomethylation in the adipocytes as an effect of
folate deficiency, which is responsible for disrupted DNA
repair mechanisms and lead to selective hypomethylation
of the adipocytes in obese tissue. Thus, methyl donor supplementation in our HFD mice could potentially change
the methylation pattern we observe.
The overlap between the two models turned out to be very
limited, which is maybe not surprising given that ob/ob
mice gain weight due to increased intake of a chow diet,
whereas the diet induced obese mice are challenged with a
high fat diet. Furthermore the degree of obesity/metabolic
syndrome may have progressed to different stages. Finally, leptin not only acts to decrease satiety, it also has
systemic effects on metabolism, i.e. decreasing energy
expenditure by 30% [28]. Therefore we decided to focus
on the diet-induced model, as it is a more relevant model
in a clinical setting.
Only a small proportion of the DMRs identified were localised to promotor regions, while especially exons were
frequent targets. Methylation of CpG islands in promotor
regions usually decreases gene expression through sterical interference with transcription factor binding, but in
a few cases hypomethylation has been correlated with de-
Epididymal and inguinal adipose tissue vary in their cellular composition, origin and gene expression patterns [5].
Accordingly the effect of obesity shows disparities in these
tissues. Epididymal adipose tissue associated with insulin
resistance, diabetes, hypertension, atherosclerosis and hepatic steatosis [26]. On the other hand, inguinal adipose tissue secretes more adiponectin and less inflammatory cytokines and has better response towards insulin [26]. In
this study we discovered methylation differences between
these tissues. DMRs also differ amongst the models re-
vealing that these tissues respond to genetic and diet induced obesity differently. These two tissues react to the
high fat diet stimulation differently. The epididymal tissue
showed more differences in obese mice as compared to inguinal tissue reflecting that the epididymal tissue is more
variable than the inguinal tissue. The epididymal fat tissue
are known to respond to short term of high fat feeding and
percentage of weights gains is also more in epididymal fat
as compared to inguinal tissue which requires longer exposure to high fat diet for weight gain [46]..Differential
gene expression showed more genes changing expression
in epididymal tissue that the inguinal tissue. In the inguinal adipocytes, pathways reflect that the cells are storing the fatty acids and growing in size and at the same time
getting invaded by macrophages where as the epididymal
tissue reacts more to insulin activity and higher neurovascular tissue development occurs in epididymal tissue under obesity. In the inguinal tissue, immune system shadows other functions of the cells in this tissue as shown
earlier [68].
Common methylation and expression changes
in epididymal and inguinal tissue
As higher number of DMRs and differentially expressed
gene are observed in the epididymal tissues than the inguinal tissue. S100a8 mRNA was highly expressed in
white adipose tissue of mice and macrophage expressing cells with increased expression in mature adipose tissue from obese mice [31]. Also, high circulating levels
of S100a8 are observed in obese male individuals [57].
S100a8 is endogenous ligand of TLR4 along with S100a9.
TLR4 is known to play important role in systemic glucose
and lipid metabolism as well as in obesity-induced adipose tissue inammation. Lgals3 up-regulation indicates
that both epididymal and inguinal tissues in HFD have inflammation [54]. Lipe is one of the major enzymes for
fat cell lipolysis, where trigycerides are converted to free
fatty acids (FFA). During obesity, higher concentrations
of FFA are present in blood and the cells do not need lipolysis to get more FFAs ad thus lipolysis is down-regulated.
Aacs is a ketone utilising enzyme, which provides acetyl
substrate for lipogenesis and knock out of the gene in
mice leads to suppression of adipocyte markers like Pparγ and C/rebp-α that play an important role in adipocyte
differentiation [29]. As HFD provides fatty acids to the
cells and there is a surplus of them to the cells, there is
no need for the cells to convert other compounds to fatty
acids and lipogenesis is reduced in the adipocytes under
obesity. Prcp inactivates the α-MSH hormone and acts
an appetite stimulant [58]. Prcp up-regulation is found
to be associated with obesity, diabetes mellitus and cardiovascular abnormalities. Obesity does not affect just
the adipocytes but also the surrounding cells like stromal
vascular tissue and nerves. The growth of adipose tissue
is highly linked to angiogenesis [16] and Nrp2 is selectively required for the development of small lymphatic
vessels and capillaries [76]. It has been suggested that
neovascularization might play a critical role in adipose
tissue growth [55]. Myosin, heavy chain 9 (Myh9), a nonmuscle myosin gene known to be associated with diabetic
neuropathy [15] is also up-regulated.
Inguinal specific methylation and expression
changes in diet-induced obesity
Overexpression of Cpt1 significantly reduces the content
of intracellular non-esterified fatty acids (NEFAs) when
adipocytes are challenged with fatty acids. These changes
were caused by an increase in fatty acid uptake and a decrease in fatty acid release [25]. Methylation differences
of Cpt1 are associated with Triglyceride Levels, BMI and
WHR [18]. On the other hand, Pkig deletion or knockdown simultaneously increases osteogenesis and decreases
adipogenesis [13]. Thus, fatty acid accumulation is happening in inguinal tissue but inguinal specific genes points
towards down regulation of adipogenesis.
Epididymal specific methylation and expression changes in diet-induced obesity
The genes specifically regulated in the epididymal adipocytes
from HFD mice are related to adipocyte differentiation
and adipose tissue development. Hmga1 forms a complex
with Retinoblastoma Protein (emphRb) protein, which positively regulates adipocyte differentiation and is also a downstream nuclear target of insulin signalling [22]. C3ar1 is
also found up-regulated in the obese co-twins as well in
the epididymal HFD of our experiments, is among the
three new genes with causal relationships for obesity in
an study in rodents integrating gene expression and DNA
variations [52]. Also, C3aR, a Gi-coupled G proteincoupled receptor is found to play an significant role in macrophages and adipose tissue and control energy homeostasis
and insulin resistant in HFD exposure [47]. Bach1 is hypomethylated and up-regulated in the epididymal adipocytes
and is a leucine zipper transcription factor and downstream
targets of Bach1 are involved in oxidative stress response
and cell cycle which is the similar state of hypoxia and
cell number increase adipose tissue under go during obesity [72]. Aldh1a3 also known as RALDH3 was earlier
reported as not expressed in subcutaneuos or visceral adipose tissue [59] but we find it up-regulated in the HFD
in epididymal. Retinaldehyde dehydrogenase 1 (Raldh1/
Aldh1a1) is another member of Rald-catabolizing enzyme
family as Aldh1a3. Raldh1 knockouts are resistant to dietinduced obesity and insulin resistance and also showed
Upregulated tagets
Downregulated targets
S100a8, Gng2, Tpst1, Bach1, Fgd6, Aldh1a3, Notch3,
Man1c1, Tnfrsf1b, Myh9, Rab31, Timp3
Plxdc2, Tm6sf1, Zfp521, Rcsd1, Hmha1, Aldh1a3,
Ptpn18, Notch3, Elk3, H2-Ab1, Dusp7, Rnf128, Cd44,
Fbxl7, Nrp2, Zswim6, Man1c1, Lgals3, Tnfrsf1b, Lyn,
Rab31, Itga9, Tspan2, Timp3
Mmp11, Elmo1, Zfp710, Fn1, Gng2, Tnfrsf1b, Hmha1,
Myh9, Bach1, Prcp, Cysltr1, Itga9, Rps26, Cyb5r4,
Tpm3, Bach1, Gng2, Tnfrsf1b
Plxdc2, Gng2, Tm6sf1, Gmfg, Hmha1, Fgd6, Icam1,
Cyfip1, Rps26, Cyb5r4, Ptpn18, Elk3, Elmo1, Dusp7,
Fbxl7, Nrp2, Ncf4, Zswim6, Ncoa7, Man1c1, Lgals3,
Nckap1l, Myh9, Lyn, Rab31, Tnfaip8l1, Tpm3
Elmo1, Cd44, Nrp2, Zfp710, Gng2, Zswim6, Rcsd1,
Hmha1, Bach1, Abi3, Prcp, Fgd6, Notch3
Bach1, Fgf7, Clic1, Prcp, Icam1, Rab31, Cyb5r4, Lgals3,
Notch3, Cd300lg
Elmo1, Emr1, Gpr65, Lgals3, Gmfg, Fgf7, Clic1, Prcp,
C3ar1, Pla1a, Tnfaip8l1, Tspan2, H2-Aa
Elmo1, Plxdc2, Fbxl7, Zfp710, Gng2, Ncoa7, Hmga1,
Man1c1, Tnfrsf1b, Bach1, Itga9, Timp3, Cyb5r4,
Mcm10, Elk3, 4930503l19rik
Plxdc2, Dusp7, Ncoa7, Hmga1, Rcsd1, Clic1, Prcp,
Notch3, Elk3, Tpm3
Plxdc2, Cd44, Ncf4, Mrc1, Ncoa7, Nckap1l, Gmfg,
Hmha1, Lyn, Prcp, Pla1a, Notch3
Cd44, Elmo1, Elk3, Ptprc
Mmp11, Elmo1, Mpzl1, Fn1, Lgals3, Gmfg, Hmha1,
Clic1, Abi3, Icam1, Rps26
Plxdc2, Fbxl7, Fn1, Gng2, Ncoa7, Man1c1, Myh9, Fgd6,
Atp6v0e2, A530016l24rik, Fgf10, Hspb8,
Slc1a5, Pcx, Mgst1, Lipe, Fbxo21, Tns1,
Sh3pxd2a, Sod3, Nrip1, Cyp21a1, Isoc2a,
Pde1a, Aacs, Galnt2
Gnai1, Tns1, Cib2, Sod3, Sgce, Otud3,
Nrip1, Fgf10, Ntrk2, Galnt2, Plcd1
Nrip1, Gnai1, Fgf10, Cib2, Slc1a5, Adipor2, Pde1a, Hspb8, Slc1a5, Adipor2,
Fbxo21, Gnai1, Sh3pxd2a, Zdhhc5,
Atp6v0e2, Nrip1, Lipe, Chchd6, Adipor2,
Aacs, Galnt2
Grina, Nrip1, Pcx, Isoc2a, Aacs, Galnt2
Lrrc58, D830050j10rik, Adipor2, Lipe
Tns1, Hspb8, Galnt2, Plcd1
C4a, Gnai1, Slc1a5, Ntrk2, Sod3, Aacs
Grina, Gnai1, Sh3pxd2a, Sod3, Hspb8,
Otud3, Grina, Chchd6, Zdhhc5
Sgce, Nrip1
D830050j10rik, Aacs, Hspb8, Plcd1
Lrrc58, Atp6v0e2, Nrip1, Gnai1, Fgf10,
Sh3pxd2a, Slc1a5, Ntrk2, Aacs
Table 3: Transcriptional control of the differentially expressed genes with methylation region in the epididymal diet model
with the effect range with the adjusted p-value < 0.01 for the group and the target genes
increased energy dissipation. It also suggested that Ralds
transcriptionally regulate the metabolic responses to highfat diet [77]. Mcm10 belongs to class involved in the initiation of eukaryotic genome replication and plays a role in
preventing DNA damage during replication. Mcm10 upreglation in HFD indicates towards underlying DNA repair mechanisms needs to be active in obese tissues, which
might either be the cause or the effect of hypomethylation.
Regulatory mechanisms
The transcription controls in the two tissues show five
common TFs which share target genes. Klf4 is known
to regulate adipogenesis together with Early Growth Response 2 (Krox20), cooperatively trans-activates CCAAT/
enhancer binding protein beta (C/EBPβ), which in turn
activates C/EBPα and PPARγ [6]. Sumyolation Klf4 also
stimulates adipocyte differentiation [63], making it an important regulator of adipose tissue growth during obesity.
The two TFs, Fli1 and Scl are regulators of hematopoiesis.
In epididymal adipocytes, we found other two regulators
of hematopoiesis Gata1 and Runx1 having target site in
differentially expressed and methylated genes [53]. Substantial increase is seen in lymphopoietic and hematopoietic processes in HFD mice, which indicates the immune
system dysregulation by obesity. The catalytic subunits
of Brg1 / Smarca4 interacts with PPARγ and are required
for induction of adipogenic transcription programs [56].
Brg1 containing BAF chromatin remodelling complexes
have shown to essential for embryonic development and
for reprogramming of somatic cells [65]. Brg1 is found to
regulate the pluripotency factors [60] as well as altered expression of genes influencing cell proliferation and metastasis in cancer cells by demethylation [4].
In conclusion, we show that mature adipocytes from inguinal and epididymal fat acquire tissue specific changes
in DNA methylation in diet induced obesity. Obesity caused
either genetically or by challenging the mice with HFD
triggers hypomethylation in both inguinal and epididymal
tissues. The changes in methylation as well as gene expression are more pronounced in epididymal tissue than
the inguinal tissue. The hypomethylation could be result
of micronutrient deficiency for methyl donors like folate
or methionine caused by obesity. These deficiencies lead
to DNA damage during replication with faulty DNA repair. We also found numerous inflammation related genes
been up-regulated in the two tissues, which is dues to invasion of these tissues by macrophages. The inguinal tissue is undergoing hypertrophy by accumulation of triglycerides inside the adipocytes where as the epididymal tissue has a lot of adipogenesis regulation along with triggered in impaired insulin signalling. This study suggests
that epididymal tissue reacts more vigorously when challenged with high fat diet and there are mechanistic differences in obesity of two tissues.
[1] “”.
[2] In Dietary Reference Intakes for Thiamin, Riboflavin,
Niacin, Vitamin B6, Folate, Vitamin B12, Pantothenic
Acid, Biotin, and Choline, The National Academies Collection: Reports funded by National Institutes of Health,
Washington (DC), 1998, Institute of Medicine (US) Standing Committee on the Scientific Evaluation of Dietary
Reference Intakes and its Panel on Folate, Other B Vitamins, and Choline Book.
[3] A. H. Bakker, F. M. Van Dielen, J. W. Greve, J. A. Adam,
and W. A. Buurman, “Preadipocyte number in omental and subcutaneous adipose tissue of obese individuals”,
Obes Res, Vol. 12, No. 3, pp. 488–98, 2004.
[4] F. Banine, C. Bartlett, R. Gunawardena, C. Muchardt,
M. Yaniv, E. S. Knudsen, B. E. Weissman, and L. S. Sherman, “SWI/SNF chromatin-remodeling factors induce
changes in DNA methylation to promote transcriptional
activation”, Cancer Res, Vol. 65, No. 9, pp. 3542–7, 2005.
[5] N. Billon and C. Dani, “Developmental origins of the
adipocyte lineage: new insights from genetics and genomics studies”, Stem Cell Rev, Vol. 8, No. 1, pp. 55–66,
[6] K. Birsoy, Z. Chen, and J. Friedman, “Transcriptional regulation of adipogenesis by KLF4”, Cell Metab, Vol. 7,
No. 4, pp. 339–47, 2008.
[7] B. Caballero, “The global epidemic of obesity: an
overview”, Epidemiol Rev, Vol. 29, pp. 1–5, 2007.
[8] R. Caesar, M. Manieri, T. Kelder, M. Boekschoten,
C. Evelo, M. Muller, T. Kooistra, S. Cinti, R. Kleemann, and C. A. Drevon, “A combined transcriptomics
and lipidomics analysis of subcutaneous, epididymal and
mesenteric adipose tissue reveals marked functional differences”, PLoS One, Vol. 5, No. 7, p. e11525, 2010.
[9] J. Carlin, R. George, and T. M. Reyes, “Methyl donor supplementation blocks the adverse effects of maternal high
fat diet on offspring physiology”, PLoS One, Vol. 8,
No. 5, p. e63549, 2013.
[10] E. Casanueva, A. Drijanski, A. C. Fernandez-Gaxiola,
C. Meza, and F. Pfeffer, “Folate deficiency is associated
with obesity and anemia in Mexican urban women”, Nutrition Research, Vol. 20, No. 10, pp. 1389–1394, 2000.
[11] A. Ceriello, M. A. Ihnat, and J. E. Thorpe, “Clinical review 2: The "metabolic memory": is more than just tight
glucose control necessary to prevent diabetic complications?”, J Clin Endocrinol Metab, Vol. 94, No. 2, pp.
410–5, 2009.
[12] E. Y. Chen, C. M. Tan, Y. Kou, Q. Duan, Z. Wang, G. V.
Meirelles, N. R. Clark, and A. Ma’ayan, “Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool”, BMC Bioinformatics, Vol. 14, p. 128, 2013.
[13] X. Chen, B. S. Hausman, G. Luo, G. Zhou, S. Murakami,
J. Rubin, and E. M. Greenfield, “Protein Kinase Inhibitor
gamma Reciprocally Regulates Osteoblast and Adipocyte
Differentiation by Downregulating Leukemia Inhibitory
Factor”, Stem Cells, Vol. 31, No. 12, pp. 2789–99, 2013.
[14] S. C. Collins, M. B. Hoppa, J. N. Walker, S. Amisten, F. Abdulkader, M. Bengtsson, J. Fearnside, R. Ramracheya, A. A. Toye, Q. Zhang, A. Clark, D. Gauguier,
and P. Rorsman, “Progression of diet-induced diabetes in
C57BL6J mice involves functional dissociation of Ca2(+)
channels from secretory vesicles”, Diabetes, Vol. 59,
No. 5, pp. 1192–201, 2010.
[15] J. N. Cooke, M. A. Bostrom, P. J. Hicks, M. C. Ng, J. N.
Hellwege, M. E. Comeau, J. Divers, C. D. Langefeld,
B. I. Freedman, and D. W. Bowden, “Polymorphisms in
MYH9 are associated with diabetic nephropathy in European Americans”, Nephrol Dial Transplant, Vol. 27,
No. 4, pp. 1505–11, 2012.
[16] D. L. Crandall, G. J. Hausman, and J. G. Kral, “A review of the microcirculation of adipose tissue: anatomic,
metabolic, and angiogenic perspectives”, Microcirculation, Vol. 4, No. 2, pp. 211–32, 1997.
[17] K. S. Crider, T. P. Yang, R. J. Berry, and L. B. Bailey, “Folate and DNA methylation: a review of molecular mechanisms and the evidence for folate’s role”, Adv Nutr, Vol.
3, No. 1, pp. 21–38, 2012.
[18] S. A. J. S. L. L. W. D. Z. K. S. T. J. O. D. K. A. D. M. Absher, M. R. Irvin, “DNA Methylation at CPT1A is Associated with Triglyceride Levels, BMI and WHR”, 2012.
[19] A. Damms-Machado, G. Weser, and S. C. Bischoff, “Micronutrient deficiency in obese subjects undergoing low
calorie diet”, Nutr J, Vol. 11, p. 34, 2012.
[20] A. M. Deaton and A. Bird, “CpG islands and the regulation of transcription”, Genes Dev, Vol. 25, No. 10, pp.
1010–22, 2011.
[21] Y. Ding, J. Li, S. Liu, L. Zhang, H. Xiao, J. Li, H. Chen,
R. B. Petersen, K. Huang, and L. Zheng, “DNA hypomethylation of inflammation-associated genes in adipose tissue of female mice after multigenerational high fat
diet feeding”, Int J Obes (Lond), Vol. 38, No. 2, pp. 198–
204, 2014.
[22] F. Esposito, G. M. Pierantoni, S. Battista, R. M. Melillo,
S. Scala, P. Chieffi, M. Fedele, and A. Fusco, “Interaction between HMGA1 and retinoblastoma protein is required for adipocyte differentiation”, J Biol Chem, Vol.
284, No. 38, pp. 25993–6004, 2009.
[23] J. N. Fain, A. K. Madan, M. L. Hiler, P. Cheema, and
S. W. Bahouth, “Comparison of the release of adipokines
by adipose tissue, adipose tissue matrix, and adipocytes
from visceral and subcutaneous abdominal adipose tissues
of obese humans”, Endocrinology, Vol. 145, No. 5, pp.
2273–82, 2004.
[24] R. Feil, “Environmental and nutritional effects on the epigenetic regulation of genes”, Mutat Res, Vol. 600, No. 1-2,
pp. 46–57, 2006.
[25] X. Gao, K. Li, X. Hui, X. Kong, G. Sweeney, Y. Wang,
A. Xu, M. Teng, P. Liu, and D. Wu, “Carnitine palmitoyltransferase 1A prevents fatty acid-induced adipocyte dysfunction through suppression of c-Jun N-terminal kinase”,
Biochem J, Vol. 435, No. 3, pp. 723–32, 2011.
[26] A. Gil, J. Olza, M. Gil-Campos, C. Gomez-Llorente, and
C. M. Aguilera, “Is adipose tissue metabolically different
at different sites?”, Int J Pediatr Obes, Vol. 6 Suppl 1, pp.
13–20, 2011.
[27] G. R. Hajer, T. W. van Haeften, and F. L. Visseren, “Adipose tissue dysfunction in obesity, diabetes, and vascular
diseases”, Eur Heart J, Vol. 29, No. 24, pp. 2959–71,
[28] J. L. Halaas, K. S. Gajiwala, M. Maffei, S. L. Cohen, B. T.
Chait, D. Rabinowitz, R. L. Lallone, S. K. Burley, and
J. M. Friedman, “Weight-reducing effects of the plasma
protein encoded by the obese gene”, Science, Vol. 269,
No. 5223, pp. 543–6, 1995.
[29] S. Hasegawa, Y. Ikeda, M. Yamasaki, and T. Fukui,
“The role of acetoacetyl-CoA synthetase, a ketone bodyutilizing enzyme, in 3T3-L1 adipocyte differentiation”,
Biol Pharm Bull, Vol. 35, No. 11, pp. 1980–5, 2012.
[30] S. Hirsch, J. Poniachick, M. Avendano, A. Csendes,
P. Burdiles, G. Smok, J. C. Diaz, and M. P. de la Maza,
“Serum folate and homocysteine levels in obese females
with non-alcoholic fatty liver”, Nutrition, Vol. 21, No. 2,
pp. 137–41, 2005.
[31] A. Hiuge-Shimizu, N. Maeda, A. Hirata, H. Nakatsuji,
K. Nakamura, A. Okuno, S. Kihara, T. Funahashi, and
I. Shimomura, “Dynamic changes of adiponectin and
S100A8 levels by the selective peroxisome proliferatoractivated receptor-gamma agonist rivoglitazone”, Arterioscler Thromb Vasc Biol, Vol. 31, No. 4, pp. 792–9,
[32] R. Huxley, S. Mendis, E. Zheleznyakov, S. Reddy, and
J. Chan, “Body mass index, waist circumference and
waist:hip ratio as predictors of cardiovascular risk–a review of the literature”, Eur J Clin Nutr, Vol. 64, No. 1,
pp. 16–22, 2010.
[33] S. J. James, I. P. Pogribny, M. Pogribna, B. J. Miller,
S. Jernigan, and S. Melnyk, “Mechanisms of DNA damage, DNA hypomethylation, and tumor progression in the
folate/methyl-deficient rat model of hepatocarcinogenesis”, J Nutr, Vol. 133, No. 11 Suppl 1, pp. 3740S–3747S,
K. Jiao, H. Liu, J. Chen, D. Tian, J. Hou, and A. D. Kaye,
“Roles of plasma interleukin-6 and tumor necrosis factoralpha and FFA and TG in the development of insulin resistance induced by high-fat diet”, Cytokine, Vol. 42, No. 2,
pp. 161–9, 2008.
S. Kang, P. Akerblad, R. Kiviranta, R. K. Gupta, S. Kajimura, M. J. Griffin, J. Min, R. Baron, and E. D. Rosen,
“Regulation of early adipose commitment by Zfp521”,
PLoS Biol, Vol. 10, No. 11, p. e1001433, 2012.
A. Kel, N. Voss, T. Valeev, P. Stegmaier, O. KelMargoulis, and E. Wingender, “ExPlain: finding upstream
drug targets in disease gene regulatory networks”, SAR
QSAR Environ Res, Vol. 19, No. 5-6, pp. 481–94, 2008.
S. J. Kim, H. S. Kang, H. L. Chang, Y. C. Jung, H. B. Sim,
K. S. Lee, J. Ro, and E. S. Lee, “Promoter hypomethylation of the N-acetyltransferase 1 gene in breast cancer”,
Oncol Rep, Vol. 19, No. 3, pp. 663–8, 2008.
J. E. Kimmons, H. M. Blanck, B. C. Tohill, J. Zhang,
and L. K. Khan, “Associations between body mass index and the prevalence of low micronutrient levels among
US adults”, MedGenMed, Vol. 8, No. 4, p. 59, 2006.
A. Lachmann, H. Xu, J. Krishnan, S. I. Berger, A. R. Mazloom, and A. Ma’ayan, “ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments”, Bioinformatics, Vol. 26, No. 19, pp. 2438–44,
B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie 2”, Nat Methods, Vol. 9, No. 4, pp.
357–9, 2012.
M. J. Lee, Y. Wu, and S. K. Fried, “Adipose tissue heterogeneity: implication of depot differences in adipose tissue
for obesity complications”, Mol Aspects Med, Vol. 34,
No. 1, pp. 1–11, 2013.
H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan,
N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The
Sequence Alignment/Map format and SAMtools”, Bioinformatics, Vol. 25, No. 16, pp. 2078–9, 2009.
N. Li, M. Ye, Y. Li, Z. Yan, L. M. Butcher, J. Sun, X. Han,
Q. Chen, X. Zhang, and J. Wang, “Whole genome DNA
methylation analysis based on high throughput sequencing
technology”, Methods, Vol. 52, No. 3, pp. 203–12, 2010.
A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdottir, P. Tamayo, and J. P. Mesirov, “Molecular signatures database (MSigDB) 3.0”, Bioinformatics, Vol. 27,
No. 12, pp. 1739–40, 2011.
K. A. Lillycrop, E. S. Phillips, C. Torrens, M. A. Hanson,
A. A. Jackson, and G. C. Burdge, “Feeding pregnant rats
a protein-restricted diet persistently alters the methylation
of specific cytosines in the hepatic PPAR alpha promoter
of the offspring”, Br J Nutr, Vol. 100, No. 2, pp. 278–82,
S. Lin, T. C. Thomas, L. H. Storlien, and X. F. Huang,
“Development of high fat diet-induced obesity and leptin
resistance in C57Bl/6J mice”, Int J Obes Relat Metab Disord, Vol. 24, No. 5, pp. 639–46, 2000.
Y. Mamane, C. Chung Chan, G. Lavallee, N. Morin, L. J.
Xu, J. Huang, R. Gordon, W. Thomas, J. Lamb, E. E.
Schadt, B. P. Kennedy, and J. A. Mancini, “The C3a anaphylatoxin receptor is a key mediator of insulin resistance
and functions by modulating adipose tissue macrophage
infiltration and activation”, Diabetes, Vol. 58, No. 9, pp.
2006–17, 2009.
B. G. Marques, D. B. Hausman, and R. J. Martin, “Association of fat cell size and paracrine growth factors in
development of hyperplastic obesity”, Am J Physiol, Vol.
275, No. 6 Pt 2, pp. R1898–908, 1998.
J. A. Martinez, F. I. Milagro, K. J. Claycombe, and K. L.
Schalinske, “Epigenetics in adipose tissue, obesity, weight
loss, and diabetes”, Adv Nutr, Vol. 5, No. 1, pp. 71–81,
J. Okabe, C. Orlowski, A. Balcerczyk, C. Tikellis, M. C.
Thomas, M. E. Cooper, and A. El-Osta, “Distinguishing hyperglycemic changes by Set7 in vascular endothelial
cells”, Circ Res, Vol. 110, No. 8, pp. 1067–76, 2012.
D. Onmer and E. Alyamac, “Obesity: an endocrine tumor?”, Medical hypotheses, Vol. 63, No. 5, pp. 790–792,
K. H. Pietilainen, J. Naukkarinen, A. Rissanen, J. Saharinen, P. Ellonen, H. Keranen, A. Suomalainen, A. Gotz,
T. Suortti, H. Yki-Jarvinen, M. Oresic, J. Kaprio, and
L. Peltonen, “Global transcript profiles of fat in monozygotic twins discordant for BMI: pathways behind acquired
obesity”, PLoS Med, Vol. 5, No. 3, p. e51, 2008.
J. E. Pimanda, K. Ottersbach, K. Knezevic, S. Kinston,
W. Y. Chan, N. K. Wilson, J. R. Landry, A. D. Wood,
A. Kolb-Kokocinski, A. R. Green, D. Tannahill, G. Lacaud, V. Kouskoff, and B. Gottgens, “Gata2, Fli1, and
Scl form a recursively wired gene-regulatory circuit during early hematopoietic development”, Proc Natl Acad Sci
U S A, Vol. 104, No. 45, pp. 17692–7, 2007.
D. H. Rhodes, M. Pini, K. J. Castellanos, T. MonteroMelendez, D. Cooper, M. Perretti, and G. Fantuzzi, “Adipose tissue-specific modulation of galectin expression in
lean and obese mice: evidence for regulatory function”,
Obesity (Silver Spring), Vol. 21, No. 2, pp. 310–9, 2013.
M. A. Rupnick, D. Panigrahy, C. Y. Zhang, S. M. Dallabrida, B. B. Lowell, R. Langer, and M. J. Folkman, “Adipose tissue mass can be regulated through the vasculature”, Proc Natl Acad Sci U S A, Vol. 99, No. 16, pp.
10730–5, 2002.
N. Salma, H. Xiao, E. Mueller, and A. N. Imbalzano, “Temporal recruitment of transcription factors
and SWI/SNF chromatin-remodeling enzymes during adipogenic induction of the peroxisome proliferator-activated
receptor gamma nuclear hormone receptor”, Mol Cell
Biol, Vol. 24, No. 11, pp. 4651–63, 2004.
R. Sekimoto, K. Kishida, H. Nakatsuji, T. Nakagawa,
T. Funahashi, and I. Shimomura, “High circulating levels
of S100A8/A9 complex (calprotectin) in male Japanese
with abdominal adiposity and dysregulated expression of
S100A8 and S100A9 in adipose tissues of obese mice”,
Biochem Biophys Res Commun, Vol. 419, No. 4, pp. 782–
9, 2012.
B. Shariat-Madar, D. Kolte, A. Verlangieri, and Z. ShariatMadar, “Prolylcarboxypeptidase (PRCP) as a new target
for obesity treatment”, Diabetes Metab Syndr Obes, Vol.
3, pp. 67–78, 2010.
A. Sima, D. C. Manolescu, and P. Bhat, “Retinoids and
retinoid-metabolic gene expression in mouse adipose tissues”, Biochem Cell Biol, Vol. 89, No. 6, pp. 578–84,
[60] N. Singhal, J. Graumann, G. Wu, M. J. Arauzo-Bravo,
D. W. Han, B. Greber, L. Gentile, M. Mann, and H. R. Scholer, “Chromatin-Remodeling Components of the BAF
Complex Facilitate Reprogramming”, Cell, Vol. 141,
No. 6, pp. 943–55, 2010.
[61] G. K. Smyth, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments”, Stat Appl Genet Mol Biol, Vol. 3, p. Article3,
[62] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee,
B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy,
T. R. Golub, E. S. Lander, and J. P. Mesirov, “Gene set
enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles”, Proc Natl
Acad Sci U S A, Vol. 102, No. 43, pp. 15545–50, 2005.
[63] S. Tahmasebi, M. Ghorbani, P. Savage, K. Yan, G. Gocevski, L. Xiao, L. You, and X. J. Yang, “Sumoylation of
Kruppel-like factor 4 inhibits pluripotency induction but
promotes adipocyte differentiation”, J Biol Chem, Vol.
288, No. 18, pp. 12791–804, 2013.
[64] G. Toperoff, D. Aran, J. D. Kark, M. Rosenberg, T. Dubnikov, B. Nissan, J. Wainstein, Y. Friedlander, E. LevyLahad, B. Glaser, and A. Hellman, “Genome-wide survey
reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood”, Hum Mol
Genet, Vol. 21, No. 2, pp. 371–83, 2012.
[65] M. D. Trottier, A. Naaz, Y. Li, and P. J. Fraker, “Enhancement of hematopoiesis and lymphopoiesis in diet-induced
obese mice”, Proc Natl Acad Sci U S A, Vol. 109, No. 20,
pp. 7622–9, 2012.
[66] G. Uriarte, L. Paternain, F. I. Milagro, J. A. Martinez,
and J. Campion, “Shifting to a control diet after a highfat, high-sucrose diet intake induces epigenetic changes
in retroperitoneal adipocytes of Wistar rats”, J Physiol
Biochem, Vol. 69, No. 3, pp. 601–11, 2013.
[67] E. A. Varga, A. C. Sturm, C. P. Misita, and S. Moll, “Cardiology patient pages. Homocysteine and MTHFR mutations: relation to thrombosis and coronary artery disease”,
Circulation, Vol. 111, No. 19, pp. e289–93, 2005.
[68] J. A. Villena, B. Cousin, L. Penicaud, and L. Casteilla,
“Adipose tissues display differential phagocytic and microbicidal activities depending on their localization”, Int
J Obes Relat Metab Disord, Vol. 25, No. 9, pp. 1275–80,
[69] M. C. Vohl, R. Sladek, J. Robitaille, S. Gurd, P. Marceau,
D. Richard, T. J. Hudson, and A. Tchernof, “A survey
of genes differentially expressed in subcutaneous and visceral adipose tissue in men”, Obes Res, Vol. 12, No. 8,
pp. 1217–22, 2004.
[70] B. L. Wajchenberg, “Subcutaneous and visceral adipose
tissue: their relation to the metabolic syndrome”, Endocr
Rev, Vol. 21, No. 6, pp. 697–738, 2000.
[71] T. C. Wang, Y. S. Song, H. Wang, J. Zhang, S. F. Yu, Y. E.
Gu, T. Chen, Y. Wang, H. Q. Shen, and G. Jia, “Oxidative DNA damage and global DNA hypomethylation
are related to folate deficiency in chromate manufacturing
workers”, J Hazard Mater, Vol. 213-214, pp. 440–6, 2012.
[72] H. J. Warnatz, D. Schmidt, T. Manke, I. Piccini, M. Sultan,
T. Borodina, D. Balzereit, W. Wruck, A. Soldatov, M. Vingron, H. Lehrach, and M. L. Yaspo, “The BTB and CNC
homology 1 (BACH1) target genes are involved in the oxidative stress response and in control of the cell cycle”, J
Biol Chem, Vol. 286, No. 26, pp. 23521–32, 2011.
R. A. Waterland and R. L. Jirtle, “Transposable elements:
targets for early nutritional effects on epigenetic gene regulation”, Mol Cell Biol, Vol. 23, No. 15, pp. 5293–300,
D. B. West and B. York, “Dietary fat, genetic predisposition, and obesity: lessons from animal models”, Am J Clin
Nutr, Vol. 67, No. 3 Suppl, pp. 505S–512S, 1998.
J. D. Wren and H. R. Garner, “Data-mining analysis suggests an epigenetic pathogenesis for type 2 diabetes”, J
Biomed Biotechnol, Vol. 2005, No. 2, pp. 104–12, 2005.
L. Yuan, D. Moyon, L. Pardanaud, C. Breant, M. J.
Karkkainen, K. Alitalo, and A. Eichmann, “Abnormal lymphatic vessel development in neuropilin 2 mutant
mice”, Development, Vol. 129, No. 20, pp. 4797–806,
O. Ziouzenkova, G. Orasanu, M. Sharlach, T. E. Akiyama,
J. P. Berger, J. Viereck, J. A. Hamilton, G. Tang, G. G.
Dolnikowski, S. Vogel, G. Duester, and J. Plutzky, “Retinaldehyde represses adipogenesis and diet-induced obesity”, Nat Med, Vol. 13, No. 6, pp. 695–702, 2007.
Infuinal tissue
of diet model
( 0.29 ) CELL CYCLE
( 1.58 ) MEIOSIS
Epididymal tissue
of diet model
( −1.51 ) GLYCOLYSIS
( −1.66 ) PI3K CASCADE
Enrichment Score
Figure 7: Boxplot showing the mean methylation fold changes in the three classes of gene expression, up-regulated, down-regulated
and genes not changing in expression in two tissues
Part IV
Genotype to Phenotype
Chapter 7
Discovering phenotypes
The differences between the individuals lie in the variations of their genomes
which govern their phenotypic differences. Therefore, in order to understand
the phenotypes better, it is required to know the genotype of an individual.
The completion of human genome project has provided vast information
about human genetic diversity. Some most prominent international initiatives to catalogue human variations are the International HapMap Project
[226] and the 1000 Genomes Project [227], which have led to the understanding of population specific differences in humans by dividing them in
the major populations around the world. These common variations are
stored in databases like dbSNP [104], which stores information about SNPs
and indels found in multiple organisms along with their allele frequencies
and population specific information. Similarly, medically important human
variation knowledge is critical and therefore, it is collected in resources like
Human Gene Mutation Database (HGMD) [106] and ClinVar [107]. Some of
these databases are disease specific e.g. Catalogue Of Somatic Mutations In
Cancer (COSMIC) [228] and Obesity Gene Atlas in Mammals [220]. Online
Mendelian Inheritance in Man (OMIM) [229] database catalogues diseases
with all known genetic information. OMIM can be queried, either using
gene or phenotype. Another resource documenting phenotypic or disease
information for genes is MalaCards [230] derived from GeneCards [231].
GWAS efforts have resulted in discovery of a large number of SNPs to be
associated with variety of phenotypes. According to GWAS catalogue [232]
in 2011, 1617 SNPs were published as GWAS ‘hits’ with p-value <= 5×10−8
for 249 traits. SNPedia [233] is a wiki based resource for documenting effects
of variations to phenotypes gathered from publications. The two sections of
this chapter describe the efforts in two projects to utilise this variation to
trait association knowledge. In the first project, variation-phenotype associations are used to compare phenotypes between populations. The second
personal genomics project suggests probable phenotypes to the sequenced
ancient genome. The known variations are annotated based on the prior
knowledge where as the novel mutations are indicted to affect by finding
their function in the resultant protein.
7.1 Danish Pan-genome
Genetic variations in populations arise due to natural selection pressure on
genes and alleles. These variations persist over generations as they contribute
to survival. Thus, analytical tools in genetics need to take into consideration
population history, population substructure and admixture of populations.
These factors may confound the genetic results and the findings might be
suboptimal. In the de novo assemble of the Asian and African genomes from
NCBI, 5 Mb of novel sequences where identified which are not present in the
reference human genome [234]. Most of these novel sequences are individual
and population specific and comparative analysis with other species indicated that they may be functional. Thus, the healthy individuals also differ
from the reference genome. Accordingly, when applying the NGS methods
in population specific studies, it is necessary to use the population specific
background data. This quest for population specific requirements guided the
development of population specific genomes called the pan-genome. The current genetic variation catalogues like 1000 Genomes is based on populations
divided into 4 continents and then by subdividing these continents into regions. To analyse population specific data, a population specific reference
genome would be an advantage. Owing to the benefits of pan-genomes, a
project is designed to assemble reference genome for the Danish population.
The study discussed here is a part of danish pan-genome pilot project. The
manuscript of the danish pan-genome pilot project is under preparation.
Data and Analysis
The pilot study for Danish pan-genome is based on sequencing data from ten
randomly selected trios (mother-father-child) from the Copenhagen Family
Bank. They were sequenced on Illumina Hi-seq2000 at an average sequencing depth of 40X. Sequencing data was mapped to reference genome with
BWA-MEM; version 0.7.5a [235]. SAMtools [24] and Picard were used to
prune the alignment files and to mark duplicate reads. GATK was used to
call variants from the mapped data [26].
The SNVs and indels that occur in any of the parents were annotated for
their effect on the proteins using variant effect predictor tool from Ensembl
[105]. The results were concentrated around the loss-of-function (LOF)
variations. The SNPs causing the termination of the protein (stop gain),
change in amino acid in the protein (missense) or affecting a splice site are
considered as LOF variation. Indels were considered LOF if they are frame
shift, splice acceptor or splice donor variant. Some variation mapping to
multiple transcripts may have different consequence on the proteins, such as
truncation, substitution, ablation etc. Therefore, most severe consequence
observed in the set of transcripts is assigned to the variation. SNPs affecting
transcripts without established consensus coding DNA sequence (CCDS)
were filtered from further analysis.
The frequencies of the variations were calculated from the 16 parents
having Danish ancestry till two generations back. To calculate the allele
frequency (AF) in Danish population, the alleles were modelled as binary
variables either being reference or variant, and AF was estimated from a binomial distribution. To account for uncertainty, because of the small sample
size, we calculated confidence interval for the AF at 95% confidence [236].
The AF boundaries from the Danish cohort were compared with the AF for
the European (EUR) population from 1000 Genomes.
Results and Discussion
In the pilot pan-genome dataset, 8.5 million SNVs and 1.8 million indels were
detected. A large part of genomic variations present in Danes seems to be
shared with EUR population. Even though these changes are undoubtedly
deleterious at protein level, some could confer evolutionary advantages, either
by the absence of function or the ability to develop novel ones. The wellknown stop-gained SNP from the European population rs497116 occurring
in the Caspase 12 (CASP12) gene was also observed with 100% frequency
in the pan-genome dataset. TT is the common genotype in Northern Europeans from Utah (CEU), and all Danish participants were homozygous for
this. The derived “T” allele encodes for an inactive CASP12, which leads to
increased resistance to various infections and thus, consequently underwent
positive selection. Eurasians are practically fixed for the inactive variant,
whereas in Sub-Saharan Africa, the active variant is still common ( 24%).
Studies have found it be a pre-neolithic event [237]. Another such event is
a pair of nonsense and missense mutations in coiled-coil alpha-helical rod
protein 1 (CCHCR1). These variations form a pair to determine the risk of
psoriasis. Psoriasis is found to have 0.37 % prevalence in Danish population
The variants with functional impact and an AF that differs significantly
from the parent population, i.e. EUR could be termed as Danes specific
variations. These are undoubtedly of special interest to understand some
of the genomic differences that set the Danish population apart from other
related populations, and could help in determining important phenotypic
traits such as disease risk or drug metabolism. Among the variations with
difference in frequency, we found a novel stop-gain mutation in the gene,
ubiquitin specific peptidase 17-like family member 11 (USP17L11). This
mutation truncates the protein, leaving only 25%. This mutation is present
in 9 individuals from the study. It is found that this mutation removes
proton acceptor and predicted nuclear localisation signal. The function of
the protein is not defined experimentally but by homology and it is inferred
to regulate cellular processes. A set of known SNPs having different AFs
in the Danish cohort and the EUR population were also detected in this
data. The stop-gained mutation is found in H2B histone family member M
(H2BFM ), which is more frequent in Danish dataset than EUR, truncates
72% of the protein. Due to a missense mutation in Cyclin-Dependent Kinase
11A (CDK11A), an arginine is replaced by tryptophan in the 93rd position
of the protein leading to disruption of a highly conserved residue, likely
affecting the structure and nearby phosphorylation sites.
There were high number of SNVs in the non-coding regions with frequency differences between Danish cohort and the EUR population. Some
of them were annotated to lie within the TF binding site or miRNA coding
regions. Haploreg tool[239] was used to annotate the non coding SNPs.
They were only considered functional, if there was experimental evidence of
binding of a TF and also overlapped by DNAase binding site.
Indels annotated for LOF were filtered for unknown CCDS resulting in
1.1M indels. A filter for >95% of gene knock out, provided 335 LOF indels
with 304 frameshift, 11 splice acceptor, 19 splice donor, 1 stop-gain. Most
frequent class of gene disrupted by indels is olfactory receptor followed by
zinc finger proteins and HLA region. Olfactory receptors are significantly
enriched for extremely large proteins whereas HLA undergoes numerous
rearrangements. Due to these reasons, these classes have the tendency of accumulating variations. The number of variation in different functional classes
was compared to the LOF data for CEU individual from 1000 Genomes [240].
An increase in percentage of variations in all functional classes as compared
to the 1000 Genomes data was observed. This annotation data needs more
thorough investigations and validations.
This pilot project will eventually lead to a bigger study with 50 trios sequenced at 80X to establish a Danish population specific reference genome.
This will provide a reference genome to be used in the low-coverage resequencing studies of big cohorts in both evolutionary and medical studies.
The data from the pilot study would be verified in the large study using
higher depth and multiple pipelines to overcome false positives.
Personal genomics
Sequencing price are on a decline in the last few years and data analysis
tools are coming at par with the sequencing data generation, leading to
sequencing of numerous genomes at higher depths and coverage. In 2007,
James Watson and Craig Venter’s genome was published adding a new research milestone to genetics, called personal genomics [241, 242]. There
is a tremendous excitement for these studies and a lot of companies like
23andMe (, deCODEme (, and Navigenics ( started to offer personal genome information
services. Danish science writer, Lone Frank describes her experiences with
the personal genome information in her book “My Beautiful Genome: Exposing Our Genetic Future, One Quirk at a Time”. She guides the readers
through various aspects of the personal genomics information while exploring her own genes, ancestry and behaviour. Through her own self-discoveries,
she tries to explain the benefits of this genomic information in medical future
as predicted by the variations but also points towards the shortcomings of
this data and uncertainty surrounding the interpretation of these evidences
from the genome. Concerns have been raised regarding the clinical utility of
these tests. The results provide a relative risk of having or not having a trait
against the population. Also, other factors like environment and life style are
not incorporated in these tests, adding high uncertainty to the results.
7.2 Ancient Genome
Individual genomes are just not used for finding disease risk, but beyond.
There had been successful studies that use personal genomes from fossils
to track the migration of the human population around the world in past
[243]. Also, comparing the variations in these ancient genomes to the known
trait associated variations; the physical and anthropological characters of
the ancient individuals could be predicted. This was applied for phenotypic
characterization of the Saaqaq genome, an individual from the extinct PalaeoEskimo Saqqaq culture sequenced from a lock of hair preserved in permafrost
[243] and the genome of an Aboriginal Australian sequenced from a lock of
hair found in a museum [244]. A more recent study is about the phenotypic
characterization of the Mesolithic man found in Spain 7000 years ago [245].
In this chapter, I describe a study in which another ancient DNA of a male
infant (Anzick-1) recovered from the Anzick burial site in western Montana,
was sequenced and analysed for phenotypic traits.
Data and Analysis
For the phenotypic analysis of the Anzick-1 genome, we annotated the variations obtained from whole genome sequencing of the Anzick-1 genome for
the functional effects on the resultant proteins using the Ensembl database
[105]. Genes harbouring the LOF variations were annotated for associations
with diseases using GeneCards [230] and we also included traits observed
in Native Americans as documented in OMIM [229]. In a high throughput approach, SNP-phenotype associations were selected from the National
Human Genome Research Institute (NHGRI) GWAS catalog [246] (p-value
<1×10−7 ), 23andMe ( and SNPedia [233] for phenotypes
related to traits which can be classified broadly into appearance and anthropometric traits, cognitive function, nutritional preferences, metabolism,
personality, biochemical traits and diseases. Type 2 diabetes (T2D) associated SNPs were extracted from a recent review [247]. Genetic risk scores
(GRS) were calculated for multi-SNP phenotypes as the count of risk alleles
normalizing by highest possible risk allele count. To minimize the risk of
DNA damage, all heterozygous (C>T) and (G>A) variants were filtered out
(Figure 7.1). For phenotypes with a single known associated SNP, the risk
was estimated by comparing the Anzick-1 genotype to risk allele (Figure
7.2). The details for processing of sequencing data and variant calling can
be found in the article by Rasmussen et al. [248].
The Anzick-1 genotypes for different phenotypes were compared to four
1000 Genomes super-populations namely Ad Mixed American (AMR), East
Asian (ASN), European (EUR) and African (AFR) (Figure 7.1). The variants
associated with interesting phenotypes were mapped to the diploid Anzick1 genome, which was divided into Native American, Asian, European and
African ancestry and visualized using idiographica web tool [249] (Figure
Results and Discussion
As expected physical traits such as dark hair and eyes, medium dark skin
colour and average height were found to be similar to modern day Native
American, the decedents of this ancient genome (Figure 7.1). GRS suggest
increased risk for certain modern lifestyle diseases including T2D, coronary
heart disease, stroke, celiac disease and obesity as indicated by body mass
index. Some of these diseases are present at high prevalence in contemporary
Native Americans populations [250, 251, 252], also reflected in the ancestry
painting (Figure 7.3). Anzick-1 genome has higher number of T2D risk
alleles that the three modern populations (ASN, AMR, EUR) populations,
however it was similar to modern day Africans. The decreased burden of risk
alleles for T2D in non-African population has been suggested to represent
an adaption to agriculture [253]. This hypothesis may support a similar
risk alleles in the Anzick-1 genome, who likely was a hunter-gatherer. The
thrifty phenotype hypothesis proposes high risk of chronic conditions such
as coronary hearth disease, T2D, and stroke to be associated to the limited
nutrition during pregnancy and infant growth [254]. This �thrifty� phenotype
hypothesis was likely an advantage in populations where food supplies were
scarce and sporadic. The APOE e3/e4 genotype along with a missense mutation in the Apolipoprotein E (APOE) gene indicate high risk of Alzheimer’s
disease to Anzick-1. The e4 APOE allele additionally supports the �thrifty�
gene hypothesis in the Anzick-1 genome [255]. The genetic risk of celiac
disease of the Anzick-1 individual was noticeably high as compared to all
modern day populations. This indicates that the individual at that ancient
time might not have tolerated gluten at par with modern populations. This
is possible because modern populations are adapted to gluten-rich diet due
to the advent of agriculture.
Hair colour (5)
Height (115)
Skin colour (9)
Eye colour (6)
Body mass index (21)
(Dark) (Blue)
Type 2 diabetes (36)
(Brown) (Light)
Coronary heart disease (18)
(Dark) (Short)
Stroke (4)
(Tall) (Low)
Celiac disease (14)
C-reactive protein levels (17)
(HR) (Low)
Figure 7.1.
Density plots showing the distribution of GRS for
ten interesting phenotypes across the 1000 Genomes Project superpopulations. ASN (orange), AFR (red), AMR (green) and EUR (blue)
with the Anzick-1 genome score denoted as a dashed vertical black line.
The numbers in the parenthesis are the number of SNP sites used for
calculating GRS. (LR = low risk and HR = high risk).
The genotype of the Anzick-1 individual for the variant associated with
cleft lip suggested an increased risk of cleft lip, similar to Native Americans
who have the highest worldwide frequency of this disease [256]. Absence of
the genotype that is responsible for working copy of the Actinin Alpha 3
(ACTN3), suggests that this individual was more likely to have endurance
type muscles rather than sprinting. This variation in ACTN3 is suggested to
be positively selected variant in recent populations [257].
Additionally, the Anzick-1 genome had a variant in the oxytocin receptor
(OXTR), which has been associated with optimism, social behaviour and
empathy. The genotype of the variant in the vitamin D receptor was associated with increased activity of the protein and two independent variants
suggested in increase pain sensitivity. A missense mutation in BRCA1
Interacting Protein C-Terminal Helicase 1 (BRIP1) is indicator of increased
risk to anaemia and breast cancer. The Anzick-1 individual has a missense
mutation in the gene coding for 4-Aminobutyrate Aminotransferase (ABAT),
leading to increased risk of GABA-transaminase deficiency, which causes
mental abnormalities (Figure 7.2). Also, the Anzick-1 GRS suggested lower
baseline levels of the inflammation marker C-reactive protein (CRP) in the
Muscle performance
Earwax type
Shoveled teeth
Cleft lip
ApoE E4 − Alzheimer's disease
ApoE E4 − Alzheimer's disease
Vitamin D receptor activity
Empathetic behavior
Intracranial volume
Anemia and breast cancer
GABA−transaminase deficiency
Pain sensitivity
Pain sensitivity
Figure 7.2.
Heatmap that compares the Anzick-1 genotypes on the
scale from 0 f(or absence of effect allele) to 1 (for homozygous) for single
phenotype-associated SNPs to the average frequency of the effect allele
in the 1000 Genomes super-populations ASN, AFR, AMR and EUR
blood, which may indicate decreased risk of the inflammation associated
with different metabolic diseases. It should be noted that the GRS as well
as single SNP comparisions were derived from variants that have largely
been identified in population based cohort studies of individuals of European
ancestry, which may lead to ascertainment bias.
During the peer review process, this section of the article received criticisms from one of the three reviewers while other two did not comment.
Following the suggestions from the third reviewer, the section was removed
from the final article. The reviewer’s comments, the clarifications and rationalisations in response to reviewer’s comments are discussed in the following
The third reviewer commented for the usage of appropriate p-value cut
offs from GWAS catalog studies for the trait associated SNP as 5×10−8 and
not 10×10−7 as used in the study. This lowering of cut off would reduce the
number of SNPs in some phenotype as well as eliminate some phenotypes
totally. Also, due to the fact that, there are indications of associations in
the tail of SNPs ranked by p-value, these low significant SNPs were included
in the analysis. For the high risk of metabolic diseases, the reviewer questioned the implication of SNPs discovered in modern, sedentary European
populations and their effect in an ancient hunter-gatherer individual. It was
discussed as an explanation for these phenotypes, that these SNPs have been
Eye colour
Hair colour
Skin colour
Coronary heart disease
Celiac disease
CRP level
Shovel shaped teeth
Wet earwax
10 11 12 13 14 15 16 17 18 19 20 21 22
Figure 7.3. Ancestry painting showing the regions in the Anzick1 genome that are either predicted to be of Asian (yellow), European
(blue), African (red) and Native American (green) ancestry. The locations of SNPs used for the phenotype analysis are marked by triangles
(multi-locus traits) and circles (single-locus traits) and coloured by their
phenotype associations.
associated with thrifty phenotype, which was an advantage for the ancient
population with scarcity of nutrition and food during the prenatal life which
extended to adult life. Also, there is no public GWAS on Native Americans
(to our best knowledge), and similar to this study, making GRS based on
a separate population has been done before [258]. The reviewer also commented that selection of the phenotypes is biased. The phenotypes assessed
in the analysis were broadly based on appearance and anthropometric traits,
cognitive function, nutritional preferences and metabolism, behaviour and
personality. The diseases included are known to occur at high prevalence in
the modern Native American population. The rationale behind using these
phenotypes was to connect the heritability of these traits from the ancient
to the modern population.
The reviewer questioned the comparison with four major populations instead
of the North and South Native Americans. We agree to the reviewer that
it would be a better option to compare individual against the frequency of
North and South Native American populations and not admixed Americans
but since the genomic data for these population is not available publicly, this
comparison was not possible. The comparison was made against the major
world populations as the main aim of the study was to find the migration
wave to the ancient America from other parts of the world. Phenotypic similarity would show the common gene pool shared by the ancient individuals
with the modern sub-populations, and thus their ancesters.
The average frequency of effect allele was used to find the suggestive
phenotypes for the Anzick-1 individual. As the reviewer suggested, the
usage of penetrance of the allele as well as the phenotype, would be really
important for finding the effective phenotypes for the genome in question
[259]. The limitation is that the penetrance information is not available for
a big portion of the phenotypes. It would be an advantage to develop a
method that can use penetrance along with allele frequencies to deduce the
phenotype from genetic information. Also, the usage of cross ethnic SNPs
would help to over come the issues of population bias [253, 193]. Though the
section was removed from the article, the usage of genomic variation and its
association to the phenotypes makes it an important part of this thesis. This
section articulates development of a methodology for genotype to phenotype
association, and reviewer�s inputs would help in designing a robust method.
The methods of associating variations with phenotypes are still under development and progressing consistently. The difficulties in genotype to phenotype association studies are at multiple levels, like inadequate description of
phenotypes, too little data on genotypes, and the underlying complexity of
the networks that regulate cellular functions. Population bias arises when
the background data is based on the studies from specific populations. Effect
of genetics on disease susceptibility vary across different populations. It has
been found that the genetic risk for T2D and pancreatic cancer decreased as
humans migrated out of Africa towards East Asia [260]. The effect of common SNPs contributing to complex traits are modest but consistent across
ancestry groups and these SNPs would only be discovered in trans-ethnic
large cohorts [261]. Systems biology based genotype to phenotype methods
would be advantageous as they would not only consider a single variation or
a gene. Instead, they would account for inheritance of natural variation and
the biological networks [262]. Combining personal genomics with other high
throughput data like gene expression and proteomics along with clinical and
pathological test results would help in revealing the unexpected molecular
Part V
Summary and perspectives
This thesis presents and discusses the state-of-art methods implemented
in analysing and interpreting high throughput data as well as integrating
various data sources to uncover the underlying biological mechanisms. The
phenotypes of interest discussed in this thesis work are multifactorial, therefore it is essential to utilise multiple data sets, data types and resources in
order to investigate these complex phenotypes.
Chapter 2 (Paper I) discusses a GWAS study conducted with regional
imputation and multiple cross ethnicity cohort replications, which was
successful in re-establishing five known childhood asthma genes as well as
discovering a new susceptibility gene CDHR3, for exacerbation phenotype
in asthma. A knowledge-based functional analysis of normal versus mutated
protein showed altered expression on the cell surface of airway epithelium,
suggesting its role in the infections during asthma exacerbations.
Chapter 3 describes an ongoing project work where we have essentially
established a candidate gene panel for childhood asthma study. Currently
we are awaiting the sequencing data from pilot study. Chapter 4 explains
a prediction tool that combines selective genetic features along with clinical
risk features to predict asthma outcome when children are 7 years old. The
method first attempts to reduce the search space pertaining to genetic features by grouping SNPs into a biological pathway and subsequently selecting
the top phenotype associated pathways. This is accomplished by machine
learning based method. Selected pathways are then used to identify the
best predictive SNPs and clinical risk features combinations for childhood
asthma. These studies are aimed at uncovering the biological mechanism
behind the pathophysiology of childhood asthma, which might help in better
prognosis, management and treatment of the disease.
Adipose tissue plays a central role in lipid and glucose metabolism as it
acts as an endocrine organ by secreting multiple hormones and cytokines.
However, imbalances in adipose tissue metabolism leads to obesity and other
related traits like T2D and cardiovascular diseases. Chapter 5 of this thesis
examines the underlying mechanism behind conversion of BAT into WAT
via an intermediate transition state. It has been discovered that two TFs
govern the conversion of BAT to the transition state while five TFs control
transition to WAT conversion. An understanding of various mechanisms
involved in these adipose tissue conversions can lead to the development of
therapeutic measures aimed towards controlling obesity. Just like different
adipose tissue types, there are different kinds of depots in body as well.
Study in chapter 6 found that adipose tissues react differently to genetic
and diet induced obesities. Furthermore, it is also noteworthy that obesity
induces hypomethylation in adipose tissues. Also, the genetic and diet induced obesity causes different effects on adipose tissues, which vary amongst
adipose depots.
Chapter 7.1 focuses on the Danish pan-genome study and addresses the
variations observed between Danish cohort and the European population.
In Chapter 7.2, GRS calculated from a twelve thousand year old ancient
genome are compared against the current populations in order to provide an
assessment of the phenotypes as well as its ancestry.
This thesis work consists of six different projects with diverse goals, which
were accomplished by employing the fundamental principles of data integration, annotation and enrichment analysis. I would like to conclude this thesis
by discussing some aspects surrounding the future perspective of systems
biology-based analysis of variations.
As discussed in this thesis that the variations dictate observed phenotypic differences among individuals but they can also lead to many disorders.
Particularly, SNPs are one of the most discussed and explored variations
when it comes to functional annotations. However, indels along with large
structural variations also need to be annotated with same specificity and uniformity that will lead to an accomplished set of annotated genetic variations.
When it comes to identifying causal variations usually the coding region
of the genome is targeted. Subsets of these coding SNPs do not lead to a
change in amino acid in the protein sequence and are generally considered
non-functional. This is due to degeneracy of codons meaning that multiple
codons code for same amino acid and thus, a change in nucleotide is not
reflected as amino acid change in protein. These variations are called the
synonymous SNPs. They are usually considered to be non-functional, as
they do not change the final protein sequence. But they have been demonstrated to alter the translation kinetics and affect protein folding. This in
turn affects protein structure and function. A recent study by Stergachis
et al found that >14% of codons in human exons, simultaneously specify
both amino acids and regulatory information in the form of transcription
factor recognition sites, also called as duons. These duons are highly conserved and at least 17% of human coding variants (including synonymous,
non-synonymous, and disease-associated variants) lie within duons [263].
On the other hand, ENCODE has also shown multiple regulatory effects in
the non-coding regions of the genome, making them as important as the
coding variations. However, the usage of different available datasets for
non-coding variations is not very straightforward. The recently developed
tools like genome-wide annotation of variants (GWAVA) [264] integrate various genomic and epigenomic annotations for non-coding variations thereby
predicting their functional impact.
In general, when we discuss variation it is considered to be genetic but
as we have discussed in different projects during this thesis that variations
can span beyond the genome of an organism. There are other factors to consider like cell type, state of the cell and cell surroundings, which contribute
to the effect of these variations. Therefore, the annotation strategies should
be designed accordingly in order to take into account these features and
their interplay. In disease state, disruption of biological processes are caused
by multiple variants with each having modest contributions. Pathway and
network based methods agglomerate these variants into clusters and find
a cumulative effect of these low risk factors. This will reflect the molecular landscape underlying the observed phenotype. Eventually, as the high
throughput technologies continue to improve, integrative interactions would
be used to characterise and classify individuals. With further evolution of
the field, along with the interpretation algorithms, advancements in the
visualisation tools and techniques is also necessary. Several tools with nonoverlapping functionalities have been developed for visualisations of “omics”
data namely Gitools [265], Cytoscape [266], Circos [267], NaAViGaTo [268]
etc. The existing and future tools need to be robust enough to handle the
enormous amounts of data.
Barring few variations leading to a single point abnormality in proteome and
metabolome, other variations found in the complex disorders do not have a
one to one relation with the proteins or metabolites. These variations can
also segregate into different genes, which are parts of isolated or interacting
pathways. As we know many genes act cooperatively, a variation in one
of them may lead to a network imbalance effect. This network balance
can be modelled in disease studies if the level of knowledge regarding these
coordinated effects and pathways is complete. Active research in the field
is required to contribute for a better understanding of different biological
In the new era of translational science, there is an explosion of high throughput data, which presents difficulties in data interpretation. Thus, it requires
generating new paradigms for data analysis and knowledge extraction. There
are certain challenges that need to be addressed in translational science like
the lack of maker-disease association information and detailed phenotypic
descriptions. Intelligent data mining of the clinical databases is required for
finding molecular markers for diseases and to reclassify the diseases according
to these markers. Clinical usage of high-throughput genomic measurements
for improved diagnosis, prognosis, disease profiling, and target identification is also required in practice. In the future, there would be immense
data resources derived from multiple “omics” analysis. Generating valuable
information from currently unexploited data resources would be beneficial
towards understanding common and rare diseases. Since the drug responses
could be genotype dependent, precise medication corresponding with the
underlying altered genetics can be seen as future perspective of translational
medicine. This would make clinical trials more cost-effective by reducing the
number of required patients and time.
Part VI
Chapter 8
Paper V - Role of TIMP-1 in
chemotherapy resistant breast
The study aims at elucidating the role of TIMP-1 in chemotherapy resistant
breast cancer cell line by using principles of proteomics discussed in Chapter
1.6. Resistance to chemotherapy is a major cause of death in cancers, and
still, the mechanisms behind are fairly unknown. The TIMP family is known
to inhibit proteolytic activity of matrix metalloproteinases (MMPs) and
earlier studies suggest that TIMP-1 confers resistance to chemotherapy. The
resistance caused by TIMP-1 is towards multiple dugs including topoisomerase 1 (TOP1) and 2 (TOP2) inhibitors and taxanes. In this project,
the global proteome and phosphoproteome of the MCF-7 breast cancer cell
lines expressing high and low levels of TIMP-1 were compared to find the
molecular mechanism behind the resistance in presence of high TIMP-1 levels.
My contribution to the project
My contribution to this study was to annotate the up-regulated and hyper
phosphorylated proteins, generation of PPI network, the functional analysis
including the interpretation of the enrichment data and resultant network.
An interaction network of the up-regulated and hyper-phosphorylated proteins was generated from the STRING database using a cutoff of 0.7 for
confidence level. The moderately high confidence score was employed to
make a compact network with less false positive and predictive data. These
proteins were analysed for pathway and functional enrichment using IPA
and ExPlain. As we discussed in the introduction chapter 1.8, the datasets
behind these tools are incomplete and non-overlapping, the usage of multiple tool ensured a better coverage of annotations. Since, the study was
focused around chemotherapy resistance in breast cancer, the known and
predicted target for the cancer drugs added another layer of information to
this functional analysis. The targets of chemotherapeutic drugs used in the
experiments, epirubicin, irinotecan, etoposide, and cisplatin were queried
from ChemProt database and DrugBank.
All the facts collected about the functional classes were layered on the
highly connected PPI network from STRING using Cytoscape. Since, genes
can be part of multiple functional classes, it was required to visualise this
multi functional data on the network. The color-coding of these proteins for
all their functional classes was done using the MultiColoredNodes plugin for
Cytoscape. This analysis helped in clustering the related genes into classes
like cancer, cell cycle, DNA binding, drug target and drug transporters.
The phosphoproteome data helped in finding which transcription factors are
controlled by phosphorylation, enrichment data identified their function and
the interaction network showed what are their targets. This clustering of
functional classes and PPI network generation helped in hypothesis generation and interpretation of results.
The paper is included in appendix as breast cancer in not the major
theme of this thesis. The functional and network analysis conducted as
a part of the study is a comprehensive illustration of integrative analysis.
This analysis is based on the multiple data types and using the underlying
biological relations between them with support from the knowledge available
in the field. Due to the reason of methodological importance, this study is
relevant to be included in the thesis.
TIMP‑1 Increases Expression and Phosphorylation of Proteins
Associated with Drug Resistance in Breast Cancer Cells
Omid Hekmat,§,# Stephanie Munk,§,# Louise Fogh,†,‡,# Rachita Yadav,‡,∥ Chiara Francavilla,§
Heiko Horn,§ Sidse Ørnbjerg Würtz,†,‡ Anne-Sofie Schrohl,†,‡ Britt Damsgaard,†,‡ Maria Unni Rømer,†,‡
Kirstine C. Belling,†,‡ Niels Frank Jensen,†,‡ Irina Gromova,⊥ Dorte B. Bekker-Jensen,§ José M. Moreira,†,‡
Lars J. Jensen,§ Ramneek Gupta,∥,‡ Ulrik Lademann,†,‡ Nils Brünner,†,‡,# Jesper V. Olsen,*,§,#
and Jan Stenvang*,†,‡,#
Institute of Veterinary Disease Biology, Faculty of Health and Medical Sciences and ‡Sino-Danish Breast Cancer Research Centre,
University of Copenhagen, Dyrlægevej 88, 1., 1870 Frederiksberg C, Denmark
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen,
Blegdamsvej 3b, Bldg. 6.1, 2200, Copenhagen, Denmark
Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Building 208, 2800,
Kongens Lyngby, Denmark
Cancer Proteomics, Genome Integrity Unit, Danish Cancer Society Research Center, DK-2100 Copenhagen, Denmark
S Supporting Information
ABSTRACT: Tissue inhibitor of metalloproteinase 1 (TIMP-1) is a protein
with a potential biological role in drug resistance. To elucidate the unknown
molecular mechanisms underlying the association between high TIMP-1 levels
and increased chemotherapy resistance, we employed SILAC-based quantitative
mass spectrometry to analyze global proteome and phosphoproteome
differences of MCF-7 breast cancer cells expressing high or low levels of
TIMP-1. In TIMP-1 high expressing cells, 312 proteins and 452 phosphorylation
sites were up-regulated. Among these were the cancer drug targets topoisomerase 1, 2A, and 2B, which may explain the resistance phenotype to topoisomerase
inhibitors that was observed in cells with high TIMP-1 levels. Pathway analysis
showed an enrichment of proteins from functional categories such as apoptosis,
cell cycle, DNA repair, transcription factors, drug targets and proteins associated
with drug resistance or sensitivity, and drug transportation. The NetworKIN
algorithm predicted the protein kinases CK2a, CDK1, PLK1, and ATM as likely
candidates involved in the hyperphosphorylation of the topoisomerases. Upregulation of protein and/or phosphorylation levels of topoisomerases in TIMP-1 high expressing cells may be part of the
mechanisms by which TIMP-1 confers resistance to treatment with the widely used topoisomerase inhibitors in breast and
colorectal cancer.
KEYWORDS: tissue inhibitor of metalloproteinase 1, SILAC, quantitative mass spectrometry, phosphoproteomics, topoisomerase,
breast cancer, resistance to chemotherapy, two-dimensional PAGE
impact on cellular sensitivity/resistance to apoptotic stimuli,
including some chemotherapeutic drugs being used in cancer
treatment.6−11 For example, lack of TIMP-1 protein either
alone12 or in combination with topoisomerase 2A (TOP2A)
gene aberrations13 was associated with an increased benefit
from adjuvant treatment with a TOP2 inhibitor (epirubicin
containing combination chemotherapy). Of specific interest
was that this association was not observed in patients treated
with a combination chemotherapy regimen not including a
TOP2 inhibitor.13 Similarly, low versus high TIMP-1 plasma
Resistance to systemic chemotherapy is considered the main
cause for the annual death of thousands of breast cancer
patients worldwide.1,2 Although many different mechanisms for
drug resistance have been suggested it is still neither clinically
possible to predict nor to reverse drug resistance. Tissue
inhibitors of metalloproteinases (TIMPs) are a family with four
members known to regulate the proteolytic activity of matrix
metalloproteinases (MMPs).3,4 However, these protease
inhibitors have other and non-MMP dependent biological
functions, including regulation of cell proliferation, angiogenesis, and apoptosis.4,5 A number of studies suggest that the
regulation of apoptosis by some of the TIMPs may have an
© 2013 American Chemical Society
Received: May 15, 2013
Published: August 5, 2013
4136 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
levels showed an association with an increased objective
response rate, increased progression-free survival and increased
survival of metastatic colorectal cancer patients treated with
combination chemotherapy including the topoisomerase 1
(TOP1) inhibitor irinotecan.14 A similar association was not
seen in metastatic colorectal cancer patients treated with
combination chemotherapy without a TOP1 inhibitor.15 In
addition, many publications support the link between TIMP-1
and tumor cell survival demonstrating a highly statistically
significant association between high tumor or plasma levels of
TIMP-1 and poor cancer patient outcome.16−18 Recent
preclinical studies have supported the above-mentioned
findings exemplified by the fact that human breast cancer
cells, which are genetically modified to overexpress TIMP-1,
showed a massive increase in expression of genes involved in
signal transduction, apoptosis, adhesion and proliferation19 and
the TIMP-1 overexpressing cells had decreased sensitivity to
the TOP2 inhibitor epirubicin and the taxane paclitaxel.10,11
The TIMP-1-mediated decrease in sensitivity to epirubicin and
paclitaxel was associated with enhanced degradation of cyclin
B110 and activation of the PI3K/Akt/NF-kβ pathway.11 Other
possible mechanisms of action of TIMP-1-mediated drug
resistance came from studies demonstrating an antiapoptotic
activity of TIMP-1 being mediated by activation of the Akt cell
survival pathways, focal adhesion kinase (FAK) and the
extracellular signal-regulated kinase (ERK) pathway.6−8 In
addition, TIMP-1 can bind to the tetraspanin cell surface
protein CD6320,21 and in a human breast epithelial cell line this
interaction induced antiapoptotic effects by activation of the
Akt survival pathway.9
Collectively, these studies suggest that TIMP-1 confers
resistance to chemotherapy, including treatment with TOP1
and 2 inhibitors and taxanes, supporting the idea of measuring
the level of TIMP-1 as a predictive biomarker for topoisomerase inhibitor response in patients.12−14,22,23 Moreover, if the
exact biological functions of TIMP-1 in relation to chemotherapy resistance are identified, it might be possible to
interfere with the mechanisms leading to chemotherapy
resistance and possibly reverse the resistance mechanisms.
In order to elucidate the mechanisms underlying the
association between high TIMP-1 levels and increased
chemotherapy resistance, a quantitative global investigation of
high TIMP-1 expressing breast cancer cells is required. Recent
breakthroughs in the proteomics technology of high-resolution
mass spectrometry (MS) instrumentation allows identification
of thousands of proteins in various proteomes, quantification of
thousands of post-translational modifications (PTMs) such as
phosphorylations and determination of protein−protein
interactions.24 In particular, quantitative proteomics, which
combines stable isotope labeling by amino acids in cell culture
(SILAC) with enrichment strategies of modified peptides and
high-performance MS, represents a powerful approach to
monitor intracellular events in a global fashion.
Our laboratories have generated single cell clones from the
human breast cancer cell line MCF-7 expressing high or low
levels of TIMP-1. We selected two clones with low TIMP-1
protein expression and two clones with high TIMP-1 protein
expression. These cell clones were employed in a SILACbased25 quantitative MS approach to investigate the proteome
and phospho-proteome changes between cells expressing high
or low levels of TIMP-1 in two biological replicates.
Cell Cultures and SILAC Labeling
The parental MCF-7S1 breast cancer cell line (kindly provided
by Professor Marja Jäaẗ tela, The Danish Cancer Society,
Copenhagen, Denmark)26 was stably transfected with pcDNA(hyg)-TIMP-1 by FuGENE trasfection reagent (Roche, Denmark) and subsequently single cell cloned by limited dilution.
Eleven single cell clones were screened for TIMP-1 expression
levels and two high and two low expressing TIMP-1 single cell
clones were chosen for further analyses. The cells were
propagated in complete media: RPMI 1640 (Gibco, Invitrogen,
Denmark) with 10% FCS (Gibco, Invitrogen, Denmark) and
100 μg/mL hygromycin (Calbiochem, VWR, Denmark). For
quantitative MS, cells were labeled in SILAC RPMI 1640 (PAA
Laboratories GmbH, Germany)27 supplemented with 10%
dialyzed FCS (Sigma, Denmark) and 200 μM glutamine
(Gibco, Invitrogen, Denmark) for 12 days to ensure complete
incorporation of amino acids (Figure 1). After the 12 days
incorporation of amino acids, cells from each condition were
seeded with same cell density in T300 flasks and media was
changed two days before cell harvest. The two TIMP-1 low
single cell clones were labeled with natural variants (light label)
of the amino acids, one of theTIMP-1 high single clones with
medium variants of amino acids (L-[13C6]Arg (+6) and L[2H4]Lys (+4)), and the second TIMP-1 high expressing single
cell clone was labeled with heavy variants of the amino acids (L[13C6,15N4]Arg (+10) and L-[13C6,15N2]Lys (+8)) (Cambridge
Isotope Laboratories, Andover, MA). Cells were propagated
using 0.1% trypsin/EDTA (Gibco, Invitrogen, Denmark).
TIMP-1 wild type (TWT-III) and TIMP-1 knockout (TKOIII) murine fibrosarcoma cell lines were previously established
in our laboratory as described in ref 28. These cells were grown
in M199 media (Gibco), supplemented with 10% FCS. All cells
were grown at 37 °C in humidified air containing 5% CO2.
Cell Lysis and In-Solution Digestion
Cells from light/medium/heavy SILAC conditions were lysed
separately at 4 °C in ice cold modified RIPA buffer [50 mM
Tris, pH 7.5, 150 mM NaCl, 1% NP-40, 0.1% sodium
deoxycholate, 1 mM EDTA, 5 mM β-glycerolphosphate, 5
mM NaF, 1 mM sodium orthovanadate, 1 complete inhibitor
cocktail tablet per 50 mL (Roche, Basel, Switzerland)]. Proteins
were precipitated overnight at −20 °C in 4-fold excess of ice
cold acetone. The acetone-precipitated proteins were resolubilized in denaturation buffer (10 mM HEPES, pH 8.0, 6 M urea,
2 M thiourea) and the lysates from light/medium/heavy SILAC
conditions were mixed 1:1:1 based on protein concentrations
(Figure 1). The soluble proteins were reduced for 60 min at
room temperature (RT) with 1 mM DTT and alkylated for 60
min at RT with 5.5 mM chloroacetamide (CAA). Endoproteinase Lys-C (Wako, Osaka, Japan) was added (1:100 m/m) and
the samples were incubated for 3 h at RT. The samples were
then diluted 4-fold with deionized water, and digested with
trypsin (modified sequencing grade, Promega, Madison, WI)
(1:100 m/m) overnight at RT. Trypsin and Lys-C activities
were quenched by acidification of the samples (2% v/v of TFA,
pH ∼ 2). For each of the samples, the peptide mixture was
desalted and concentrated on a C18-SepPak cartridge (Waters,
Milford, MA) and eluted with 1 × 2 mL of 40% acetonitrile
(ACN) in 0.1% TFA followed by 1 × 2 mL 60% ACN in 0.1%
TFA. A sample of each of the eluates (total tryptic proteome)
was desalted and concentrated on a C18 STAGE-tip31 and
4137 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
30% ACN) for 30 min, followed by isocratic (100%) buffer B
for 6 min at a flow rate of 1.0 mL/min. Fractions of 2 mL were
collected of which some were pooled.
A sample of each fraction or fraction pool was desalted and
concentrated on a C18 STAGE-tip and eluted with 2 × 10 μL
40% ACN in 0.5% acetic acid before LC-MS/MS (for
proteome analysis).
Phospho-peptides were enriched using Titansphere chromatography as described.29 Briefly, titanium dioxide beads (10 μm
Titansphere, GL Sciences, Japan) were precoated with 2,5dihydroxybenzoic acid (2,5-DHB) by incubating the beads in a
solution of 20 mg/mL 2,5-DHB in 80% ACN, 1% TFA for 20
min at RT. Approximately 1 mg of coated beads was added to
each SCX fraction or fraction pool and incubated under
rotation for 30 min at RT. Early SCX fractions, mostly enriched
in phospho-peptides, were incubated with coated TiO2 beads
twice consecutively for better coverage. The beads were washed
once with 100 μL SCX buffer B and once with 100 μL 40%
ACN in 0.5% TFA and transferred in 50 μL 80% ACN in 0.5%
acetic acid on top of a C8 STAGE-tip. The bound phosphopeptides were eluted directly into a 96-well plate by 2 × 10 μL
5% NH4OH followed by 2 × 10 μL 10% NH4OH/25% ACN,
pH > 11. The eluate was immediately concentrated in a speedvac at 60 °C to a final volume of about 5−10 μL and acidified
using 20 μL 5% ACN in 1% TFA. Each sample was then
desalted and concentrated on a C18 STAGE-tip and eluted with
2 × 10 μL 40% ACN in 0.5% acetic acid before LC-MS/MS.
LC-MS/MS of Peptides
All LC-MS/MS experiments were performed on an EASY-nLC
system (Proxeon Biosystems, Odense, Denmark) interfaced
with a hybrid LTQ-Orbitrap Velos (Thermo Electron, Bremen,
Germany)31 through a nanoelectrospray ion source. All
peptides were autosampled and separated on a 15 cm column
(75 μm internal diameter) packed in-house with 3 μm C18
beads (Reprosil-AQ Pur, Dr. Maisch, Germany), where the tip
of the column formed the electrospray (in-house pulled by a
Sutter P-2000). For liquid chromatography, a linear gradient of
ACN in 0.5% acetic acid (either: 8−24% ACN for 90 min, then
24−48% ACN for 15 min, then 60% ACN for 1 min; or: 8−
24% ACN in 150 min, then 24−48% ACN in 30 min, then 60%
ACN for 1 min) was used at a constant flow rate of 250 nL/
min. The effluent from the HPLC was directly electrosprayed
into the mass spectrometer using 2.1 kV spray voltage through
a liquid junction connection and a heated capillary temperature
of 275 °C. A lock-mass ion (m/z 445.120024) was used for
internal calibration in all experiments as described earlier.32 MS
was performed in a data dependent acquisition mode where up
to the 10 most intense peaks were chosen for fragmentation
after acquiring each full scan using Higher energy Collisional
Dissociation (HCD)33 for all MS/MS events. Dynamic
exclusion was used to avoid picking peaks more than once.
The settings were a mass window of 10 ppm, a max list size of
500, and a time window of 90 s. Full scans were acquired in the
m/z range of 300−2000 with an R = 30,000 at m/z 400 and a
target value of 1e6 ions with a maximum injection time of 500
ms. For fragment scans the settings were an isolation window of
4 Da, a minimum signal intensity of 5000, R = 7500 at m/z 400,
and a target value of 5e4 ions with a maximum injection time of
250 ms.
Figure 1. Experimental workflow of the SILAC-based quantitative
proteomics and phospho-proteomics for the analyses of the biological
role of TIMP-1 in breast cancer. Two TIMP-1 low expressing and two
TIMP-1 high expressing populations derived from MCF-7 human
breast cancer cells were labeled by triple SILAC. Lysates were mixed
0.5:0.5:1:1 as shown. Proteins were digested by endoproteinase Lys-C
and trypsin and tryptic peptides were fractionated by SCX. Phosphopeptides were enriched using TiO2 beads precoated with DHB.
Samples were analyzed by high resolution nanoLC-MS/MS. The
proteome data were determined directly from the SCX fractions.
eluted with 2 × 10 μL 40% ACN in 0.5% acetic acid before LCMS/MS.
SCX Fractionation, Phospho-Peptide Enrichment, and
Proteome Preparation
Peptide fractionation by SCX chromatography29,30 was
performed in a 1 mL Resource S column (GE Healthcare,
Sweden) on an Ä KTA FPLC system (GE Healthcare, Sweden).
The peptide mixture, eluted off C18-SepPak, was loaded directly
onto a 10 mL injection loop and separated by a linear gradient
from 100% SCX buffer A (5 mM KH2PO4, pH 2.7, 30% ACN)
to 30% SCX buffer B (5 mM KH2PO4, pH 2.7, 350 mM KCl,
4138 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Analysis of Total Peptide and Enriched Phospho-Peptide
Data Sets by MASCOT and MaxQuant
All raw Orbitrap full-scan MS and MS/MS data were analyzed
together using the software MaxQuant34 version
Proteins were identified by searching the HCD-MS/MS peak
lists against a total of 174 122 protein entries encompassing a
concatenated forward and reversed version of the International
Protein Index (IPI) database for humans (v. 3.68) supplemented with commonly observed contaminants such as porcine
trypsin and bovine serum proteins using the MASCOT search
engine version 2.3.02. Tandem mass spectra were initially
matched with a mass tolerance of 7 ppm on precursor masses
and 0.02 Da for fragment ions, set to recognize tryptic cleavage
sites and allowed for up to three missed cleavage sites. Cysteine
carbamidomethylation (Cys +57.021464 Da) was searched as a
fixed modification. N-Acetylation of protein (N-term
+42.010565 Da), N-pyro-glutamine (Gln −17.026549 Da),
oxidized methionine (+15.994915 Da), and for phosphopeptides: phosphorylation of serine, threonine, and tyrosine
(Ser/Thr/Tyr +79.966331 Da) were searched as variable
modifications. Labeled lysine and arginine were specified as
fixed or variable modification, depending on prior knowledge
about the parent ion (MaxQuant SILAC triplet identification).
Peptide identifications were filtered based on their Mascot
score, SILAC state, number of arginine and lysine residues and
peptide length (minimum peptide length was specified to be six
amino acids) to achieve a maximum false discovery rate of one
percent. Protein groups were assembled and quantified based
on the Occam’s razor principle. Finally, to pinpoint the actual
phosphorylated amino acid residue(s) within all identified
phospho-peptide sequences, MaxQuant calculated the localization probabilities of all putative serine, threonine, and
tyrosine phosphorylation sites using the PTM score algorithm
as described.35
Statistical Determination of SILAC Ratio Cutoffs for
Expression and Phosphorylation
Medians of the log2-transformed normalized SILAC ratios were
calculated using the SILAC ratio sets from the biological
replicates thus reflecting the high TIMP-1/low TIMP-1 ratios
of expression and phosphorylation for all identified proteins
and phospho-sites, respectively. The statistical P-values were
calculated for detection of significant outlier ratios (Significance
A values). Three levels of significance were chosen, P-value
<0.01, 0.01 ≤ P-value < 0.05, P-value ≥ 0.05, and the median
log2-transformed normalized SILAC ratios were plotted as a
function of the log10-transformed summed peptide intensities
for proteins and as a function of the log10‑transformed phosphopeptide intensities for phospho-sites. The median ratio cutoffs
were then chosen so as to exclude the median ratios with Pvalues >0.05 as shown in Figure 2.
STRING Network and Ingenuity Pathway Analysis
Proteins with median normalized SILAC ratios ≥ 2.3 at the
expression level and/or median normalized SILAC ratios ≥ 3.0
at phosphorylation level (460 entries) were used to build a
protein−protein interaction network from the STRING
database system ( at a reliability score
of at least 0.7.37 The same set of proteins was analyzed for
enrichment of pathways and functional classes using the tools
Ingenuity Pathway Analysis (IPA, and
Explain (Biobase, as
well as in-house phenotypically related gene collections. The
460 UniProt entries, mapping to 453 encoding genes (six
Figure 2. Statistical determination of the median normalized SILAC
ratio cutoffs for proteins up-regulated at expression and/or
phosphorylation. (A) Median log2-transformed normalized SILAC
ratios for proteins plotted as a function of the log10-transformed
summed peptide intensities and categorized based on significance A
values for the regulation. (B) Median log2-transformed normalized
SILAC ratios for class I phospho-sites plotted as a function of the
log10-transformed phospho-peptide intensities and categorized based
on significance A values for the regulation.
4139 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
etoposide cytotoxicity in the MCF-7 cells and cytotoxicity of
epirubicin and SN-38 in the murine TWT-III and TKO-III
cells. All MTT and LDH assays were performed at least three
times and each time in triplicates.
protein entries were obsolete in UniProt, two were not mapped
to genes, and one demerged into two genes) were further
analyzed for different functional class enrichments and drug
interactions using gene ontology. Targets of the used
chemotherapies epirubicin, irinotecan, etoposide, and cisplatin
were queried from ChemProt database38 and DrugBank.39 This
gene annotation data was layered on the protein−protein
interaction network from the STRING database using Cytoscape40 and its MultiColoredNodes plugin.41
Western Blotting and TIMP-1 ELISA
Lysates from each of the four selected cell clones were
harvested individually by scraping off the cells in ice cold PBS,
spun down at 300g at 4 °C and lysed by incubation in
ProteoJET Mammalian Cell Lysis Reagent (Fermentas,
Germany) containing protease inhibitors (Aprotinin, Leupeptin, Pepstatin A and Pefa Block, 1 μg/mL) (Calbiochem, VWR,
Denmark) and phosphatase inhibitors (sodiumfluoride and
sodiumorthovanadate, 1 mM) (Calbiochem, VWR, Denmark)
for 10 min at RT. The cells were then spun at 18 000g for 15
min at 4 °C, and each of the four supernatants was transferred
to a new tube. The total amount of protein was determined by
the BCA Protein Assay kit (Pierce, VWR, Denmark) according
to manufacturer’s instructions.
The NuPAGE system (Invitrogen A/S, Denmark) was used
for SDS-PAGE gel separation of proteins according to
manufacturer’s instructions. In brief, lysates were mixed with
NuPAGELDS loading buffer and NuPAGE sample reducing
agent. Samples were then incubated at 70 °C for 10 min.
Samples were loaded onto NuPAGE Novex 4−12% Bis-Tris
gels with 50 μg/lane and were run in NuPAGE MOPS buffer
with NuPAGE antioxidant according to the manufacturer’s
instructions. Gels were blotted on polyvinylidene difluoride
membranes with 2× NuPAGEtransfer buffer with 20% ethanol.
Blots were blocked in washing buffer (PBS + 0.1% Tween 20)
containing either 5% nonfat dry milk (for TOP1, TOP2B) or
2% ECL prime blocking reagent (for TIMP-1, β-actin, TOP2A)
for 1 h and incubated overnight with the appropriate primary
antibody diluted in blocking reagent: In-house mouse
monoclonal anti-TIMP-1 antibody VT-7, 0.1 μg/mL,45 rabbit
monoclonal anti-TOP1 1:10 000 (Epitomics, Abcam, Burlingame, CA), rabbit monoclonal anti-TOP2A 1:1000 (Cell
signaling, VWR, Denmark), sheep polyclonal anti-TOP2B
1:500 (R&D systems, Trichem, Denmark), and mouse
monoclonal anti-β-actin 1:1 500 000 (Sigma-Aldrich, Denmark). Blots were washed four times for a period of 30 min
in washing buffer and incubated with secondary horseradish
peroxidase-conjugated antibody (Dako A/S, Denmark) diluted
in blocking reagent. The blots were washed four times for a
period of 30 min and developed using the Amersham ECL plus
Western Blotting Detection Kit (for TOP2B) or Amersham
ECL Advance Western Blotting Detection Kit (for TIMP-1, βactin, TOP2A) (GE Health/Amersham Bioscience, VWR,
Denmark) according to the manufacturer’s instructions. Blots
were visualized with a CCD camera (BioSpectrum Imaging
System, UVP BioImaging, Upland, CA).
During experiments, the differences in cellular expression of
TIMP-1 among the selected clones were routinely assayed with
an in-house sandwich ELISA assay employing a sheep
polyclonal anti-TIMP-1 antibody in the catching step and the
MAC15 anti-TIMP-1 monoclonal antibody in the detection
step, as described in ref 46. The levels of murine TIMP-1 in the
wild-type and knockout cells were measured by a commercial
quantikine mouse TIMP-1 ELISA kit (R&D Systems)
according to the manufactures recommendations.
Phosphorylation Sites Sequence Bias Analysis
Sequence bias around the up-regulated phosphorylation sites
was visualized using the IceLogo software42 which compared
class I up-regulated phosphorylation sites (median normalized
ratios ≥ 3.0) with reference class I phosphorylation sites (0.8 ≤
median normalized ratios ≤ 1.2), all from the same data set.
The outcome of the IceLogo analysis was compared to known
kinase substrate motifs (www.phosida.com35) in order to obtain
over-represented, unbiased, and under-represented known
kinase substrate motifs for the up-regulated phospho-sites.
NetPhorest and NetworKIN Kinase Prediction Analysis
The NetworKIN algorithm43 combines kinase consensus
motifs, extracted from the NetPhorest atlas,44 with contextual
information of the kinases and their substrates in protein
association networks extracted from the STRING database. It
was applied on all phosphorylation sites obtained by MS
analysis. Since NetworKIN incorporates data from NetPhorest,
the results include not only the specific kinase but also the
name of the NetPhorest group.
Growth Assay and Sensitivity to Chemotherapy
For the growth assay, 40 000 cells/well of each cell line were
plated in six 6-well plates. Each day, one plate was harvested:
media were removed from all wells and cells were washed with
PBS before the addition of 1 mL trypsin. After incubation for
60 s, 1 mL of media was added and cells were resuspended.
Three individual samples from each cell suspension were
counted using a hemocytometer, and the doubling time for
each cell line was calculated based on three independent
experiments. Growth medium was renewed on the fourth day
after plating the cells.
Viability of the four included TIMP-1 cell clones were tested
upon treatment with the TOP2 inhibitor epirubicin (Meda AS,
Denmark), the TOP1 inhibitor SN-38 (the active metabolite of
irinotecan) (Sigma-Aldrich, Denmark), the TOP2B inhibitor 2(4-((7-chloro-2-quinoxalinyl)oxy)phenoxy)propionic acid (XK
469) (Sigma-Aldrich, Denmark), the TOP2 inhibitor etoposide
(Meda AS, Denmark), and cis-diamminedichloroplatinum
(Cisplatin) (Hospira, Denmark). Cells were seeded in 96-well
plates with 8000 cells/well and allowed to plate overnight. Cells
were then treated with the appropriate drug for 48 h, and cell
viability was determined by addition of MTT (Sigma-Aldrich,
Denmark) dissolved in PBS. MTT was added to the cells in
complete media at a final concentration of 0.5 mg/mL. Cells
were incubated at 37 °C for 3 h and generated formazan
crystals were dissolved with 20% SDS in 0.02 M HCl overnight
and measured at 570 and 690 nm. Based on the dose response
curves generated for the low and high TIMP-1 cell clones in
response to each of chemotherapeutics, the inhibitory
concentration resulting in 50% viability (IC50) for each of
chemotherapeutics was estimated. As previously described28 a
lactate dehydrogenase (LDH) release assay (Cytotoxicity
Detection Kit; Roche A/S, Denmark) was applied to evaluate
Two-Dimensional Gel Electrophoresis and Immunoblotting
Cellular lysates were subjected to IEF (pI 4−7) twodimensional PAGE (2D PAGE) as previously described.47
4140 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
(median normalized ratio ≥ 3.0), whereas 542 proteins
(median normalized ratio ≤ 0.4) and 443 phosphorylation
sites (median normalized ratio ≤0.3) were down-regulated
(Table 1). Spearman’s correlation coefficients of 0.9 between
Between 20 and 30 μL of sample was applied to the first
dimension, and IEF gels were run for each sample. Proteins
were visualized using a silver staining procedure. Immunoblotting using Western blots of lysates were prepared as previously
described. Briefly, proteins were resolved by 2D-gel electrophoresis, blotted onto Hybond-C nitrocellulose membranes
(Amersham Biosciences), and reacted with a TOP1 specific
rabbit antibody (1:2000 TOP1 antibody, Epitomics) followed
by detection of immune complexes with a horseradish
peroxidase-labeled polymer (1:200) (Envision+ detection kit;
DAKO). Blocking of antibody cross-reactivity was done using a
protein-free blocking buffer (Thermo Fischer Scientific,
Waltham, MA). Membranes were reversibly stained with
Ponceau S solution (Sigma-Aldrich) to match the location of
proteins in the membrane with the Western blot signal and to
ensure proper focusing of protein spots. To identify the
phosphorylation state of TOP1, one aliquot of each of low
TIMP-1 A or high TIMP-1 B cell lysates was treated for 30 min
at 37 °C with lambda protein phosphatase (Lambda PP), a
Mn2+-dependent protein phosphatase with activity toward
phosphorylated serine, threonine, and tyrosine residues
according to the manufacturer’s instructions. One aliquot of
each cell clone was mock-treated prior to resolving by 2D gel
Table 1. Summary of Quantitative Proteomics and PhosphoProteomics Data
identified protein groups
identified phospho-sitesc
ratio ≥ 3.0b
ratio ≤ 0.3b
Nonredundant total number identified in both experimental
replicates 1 and 2. bRatios are medians of the normalized High
TIMP-1/low TIMP-1 SILAC ratios from both experimental replicates
1 and 2; ratio cutoffs were determined from the statistical analyses
based on significance A values (Figure 2). cMASCOT score ≥ 10;
PTM score ≥ 25; localization probability ≥ 0.8.
normalized SILAC ratios for proteins identified in both
experiments and coefficients of 0.6−0.8 between normalized
SILAC ratios for phosphorylation sites in both experiments
(Supporting Information Figures 2A and B) were in line with
previous phosphoproteomics experiments.48
Similar to what has been observed in most SILAC
experiments,49,50 the majority of proteins (>75%) and
phosphoproteins (>60%) were found to have SILAC ratios
between 0.5 and 2.0 and to exemplify the general validity of the
data set, housekeeping proteins such as Heat shock 70 kDa
protein 4 (hsp74) (Figure 3B, left) and β-tubulin (data not
shown) were found to be expressed in equal amounts in both
TIMP-1 low and high expressing clones. Differential expression
of TIMP-1 among the cells expressing low and high levels was
verified in the proteome data set (Figure 3B, right), in
concordance with Western blot and ELISA analyses (Figure
Statistical Analysis of Cell Viability Data and Cell Growth
All calculations were performed using SAS software (version
9.2, SAS Software, Inc., Cary, NC). For statistical analyses, the
relationship between TIMP-1 concentration and cell survival
was analyzed with mixed model solution. It is a generalization
of general linear model solution containing both fixed and
random effects. Within the model, drug doses and TIMP-1
levels were set as fixed effects, whereas cell line, plate
placement, and experiment number were set as random effects.
Mean values of the doubling times for the low TIMP-1A/B and
the high TIMP-1A/B cells were analyzed by Student’s t test.
The level of significance was set at P < 0.05.
ratio ≥ 2.3b
ratio ≤ 0.4b
Proteomic and Phosphoproteomic Analysis of TIMP-1
Expressing Cells
Pathway and Functional Category Enrichment in TIMP-1
High Expressing Cells
TIMP-1 transfected single cell clones obtained from the human
breast cancer cell line MCF-7S1 were used as the cellular
model, and the clones were SILAC labeled (Figure 1). Based
on TIMP-1 protein expression levels as determined by ELISA,
we selected two low high and two low expressing clones from
our panel of 11 single cell clones.
From both replicates combined, we found 41 417 unique
peptides originating from 6709 protein groups and 5421 unique
class I phospho-sites mapped to 1640 protein groups
(Supporting Information Tables 1−4). The overlap of the
identified protein groups between the two biological replicates
was 68%, (Supporting Information Figure 1A). Serine (Ser),
threonine (Thr), and tyrosine (Tyr) phosphorylation sites
comprised 92.2%, 7.4%, and 0.4% of the total phosphorylation
sites, respectively (Supporting Information Table 5), with
similar percentages for the up-regulated Ser/Thr/Tyr sites.
Moreover, one or two phosphorylation sites were detected in
most phosphorylated peptides (Supporting Information Figures
1B and C). Comparative analysis of the proteomic data from
the TIMP-1 clones revealed that the TIMP-1 high expressing
cells overexpressed 312 proteins (median normalized ratio ≥
2.3) and 452 class I phosphorylation sites were up-regulated
Proteins found to be up-regulated (median normalized ratio ≥
2.3) and/or hyper-phosphorylated (median normalized ratio ≥
3.0) in the TIMP-1 high expressing cells were selected for
further analysis. The cutoff values were selected based on the
significance of regulation at both expression and phopshorylation levels (Figure 2). Using these cutoff values, a combined
list of 460 highly up-regulated proteins was generated. An
interaction network of 146 nodes was obtained for these
proteins at a high confidence level (0.7) in the STRING
database (Figure 4). In Table 2A, we list the up-regulated and/
or hyperphosphorylated proteins with known biological relation
to TIMP-1. Interestingly, TIMP-1 was directly connected to the
CD44 antigen (up-regulated 2.7 fold, Table2A), which has been
shown to bind TIMP-1,51 and to clusterin (CLU) (up-regulated
4 fold, Table 2A), which has been associated with drug
resistance to both TOP1 and TOP2 inhibitors.52,53 These 460
proteins were used for pathway and functional enrichment
analysis (Supporting Information Table 7). IPA mapped 460
proteins to 453 entries in ingenuity database. The JAK/STAT
signaling pathway and cell cycle G2/M DNA damage
checkpoint regulation pathway were among the significantly
4141 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Figure 3. Validation of the model system. (A) Western blot analysis of TIMP-1 with β-actin as a loading control. The quantitative ELISA
measurements of TIMP-1 are shown below the blot. (B) Representative peptide from HSP74 showing 1:1:1 ratios independent of TIMP-1 (left) and
representative unmodified peptide from TIMP-1 (right). All peptides are in SILAC triplets. Different colors correspond to the colors in Figure 1,
representing the SILAC L/M/H labels.
members JunB (up-regulated 3.0 fold in the TIMP-1 high
expressing cell lines) and Fos-related antigen 2 (FRA2 or
FosL2) (up-regulated 2.6 fold in the TIMP-1 high expressing
cell lines) are also in the network (Figure 4). Noteworthy
among other proteins of particular interest in relation to TIMP1 that did not show up in the protein interaction network but
were nevertheless found to be up-regulated in TIMP-1 high
expressing cells (Table 2A) was the membrane protein CD63,
previously shown to bind to TIMP-1.9,21,51 CD63 was upregulated approximately 2-fold (statistical cutoff 2.3).
perturbed in the data set. The JAK/STAT pathway is one of the
main signaling pathways in eukaryotic cells and is involved in
the control of cell proliferation, differentiation, survival, and
apoptosis.54 The G2/M damage checkpoint is often deficient in
cancer, resulting in survival of cells with DNA damage and
mutations leading to resistance and sustained proliferation.55
The same gene set was further analyzed for biological function,
molecular processes enrichment and drug interactions. Proteins
overexpressed in TIMP-1 high expressing cells participate in
several functional categories including: apoptosis (e.g., CLU,
FosL2, mTOR, TIMP-1, CD44, TOP1, TOP2B, ABCC1), cell
cycle (e.g., mTOR, TOP1, TOP2B), DNA repair (e.g., TOP1,
CLU), drug resistance or sensitivity (e.g., NDRG1, TOP2A,
CD59, CLU), drug targets (e.g., TOP1, TOP2A, TOP2B), and
drug transport (e.g., ABCC1, -3, -6). For a complete list of
functional groups and the proteins discovered in each group,
see Supporting Information Table 7. The most relevant
functional classes were layered on the protein−protein
interaction network from STRING with color-coding representing different functional classes (Figure 4). This analysis
aimed to search for novel links between TIMP-1 and cancer
related pathways, thereby identifying potential new functional
roles of TIMP-1 in cancer. In addition to TIMP-1 and its direct
interactors CD44 and CLU in the functional network (Figure
4), noteworthy among other proteins in the network are TOP1,
TOP2A, and TOP2B, all of which are involved in maintaining
DNA topology during DNA replication, transcription, or
repairing DNA double strand breaks. Interestingly, we also
identified the mammalian target of rapamycin (mTOR), which
has been implicated in the resistance to TOP2 inhibitors.56
Activator protein-1 (AP-1) transcription factor complex
High TIMP-1 Protein Level Is Associated with Increased
Levels and Phosphorylation of Topoisomerases
DNA topoisomerases were found in the enriched functional
classes (Figure 4). More specifically, TOP2A displayed
increased expression (8-fold) in TIMP-1 high expressing cells
(Table 2A) and the proteomics data was validated by Western
blotting (Figure 5B). Proteomics data also showed that TOP1
was 1.8 fold higher expressed (Table 2A) in the TIMP-1 high
expressing cells, however this slight fold up-regulation is lower
than the statistically determined cutoff value of 2.3 and higher
expression of TOP1 is not detectable in the Western blot
(Figure 5B). There was no differential expression of TOP2B
between TIMP-1 low and high expressing cells (Table 2A), also
validated by the Western blotting (Figure 5B).
Many phosphorylation sites on the topoisomerases were
found to be up-regulated in the TIMP-1 high expressing cells
(Table 2B). TOP1 had three phosphorylation sites (Ser 2, 10
and 112), where phosphorylation was up-regulated in TIMP-1
high expressing cells (Figure 5A, left and Table 2B). The
phosphorylation at Ser 2 was only detected in the first replicate
with a fold-change of about 3. Phosphorylation at Ser 10 and
4142 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Figure 4. Functional class distribution of protein−protein interaction network of the identified proteins up-regulated at expression and/or
phosphorylation levels in TIMP-1 high expressing cells. The nodes in the STRING network are sectored by different colors for functional
annotations. Circular nodes originate from the proteome data set (median normalized ratio ≥ 2.3), triangular nodes originate from the phosphoproteome data set (median normalized ratio ≥ 3.0), whereas the octagon represents the proteins detected in both data sets. TOP1, TOP2A, TOP2B,
TIMP-1, CD44, and CLU are highlighted.
immunoblots + phosphatase) showed substantially fewer
modified forms as compared with the untreated samples,
indicating that the majority of the more acidic forms are due to
phosphorylations, which supports our MS-based analysis.
The most heavily phosphorylated topoisomerase enzyme was
TOP2B, in which several Ser sites were phosphorylated in the
TIMP-1 high expressing cells: Ser 1336, 1340, 1342, 1344,
1400, 1413, 1461, 1466, 1522, 1524, and 1526 (Figure 5A, right
and Table 2B). The phosphorylations on Ser 1461 and 1466
were only detected in the first experiment, but were found to
have an 11-fold increase. All other phosphorylation sites were
found to be 2−5-fold up-regulated in TIMP-1 high expressing
cells. Since TOP2B is similarly expressed between TIMP-1 low
and high expressing cells (Figure 5B and Table 2A), the SILAC
ratios for the phosphorylations indicate true up-regulation of
several phosphorylations at a post-translational level.
112 were detected in both replicates with up-regulations of
about two and three folds respectively in the TIMP-1 high
expressing cells. TOP2A had one identified phosphorylation
site at Ser 1328 which was about 13-fold more phosphorylated
in the TIMP-1 high expressing cells although this was only
detected in replicate one. The fact that SILAC ratios for the
phosphorylations of TOP1 and TOP2A are generally higher
than the SILAC ratios for their expressions (Table 2), between
TIMP-1 low and high expressing cells, indicates some upregulation of phosphorylation at a post-translational level. To
confirm these differences in the phosphorylation states of
TOP1 between low and high TIMP-1 expressing cells, we
exploited the fact that phosphorylated protein will almost
always have a more acidic pI than its corresponding
unphosphorylated form. IEF followed by immunoblotting
allows detecting more acidic forms of a protein and the PTM
state of a protein. Indeed, 2D gel-based comparative analysis of
TOP1 in low TIMP-1 and high TIMP-1 cells showed that
TOP1 exists in a state of at least four modified forms in low
TIMP-1 expressing cells, and that TIMP-1 overexpression
affects TOP1 gain of additional modifications, with a clear shift
toward multiple modification states (Figure 5C, TOP1
immunoblots). Since PTMs other than phosphorylation can
cause changes in the pI, we treated cell lysates with lambda
phosphatase prior to gel analysis to show that the multiple
forms identified were mainly due to phosphorylation events.
The TOP1 patterns obtained in this manner (Figure 5C, TOP1
Kinase Motif Analysis of Upregulated Phosphorylation
Sites in TIMP-1 High Expressing Cells
In order to visualize the kinase motifs over-represented in the
up-regulated phophorylation sites compared to the unregulated
phophorylation sites, the sequence windows aligned around all
class I up-regulated phosphorylation sites with median
normalized ratios equal to or higher than the statistical cutoff
of 3.0 were compared to those of unregulated class I sites and
demonstrated a bias against arginine in several minus and plus
subsites (especially −1 to −4) and against proline in +1 subsite
(Figure 6A). This indicates an under-representation of the
4143 | J. Proteome Res. 2013, 12, 4136−4151
2.8 ± 0.4
high TIMP-1/low TIMP-1e
NetPhorest group
PLK_group (all three sites)
CDK2_CDK3_CDK1_CDK5 group
PLK1 (all three sites)
NetworKIN score
(A) Median protein SILAC ratios from quantitative proteomics. (B) Marker phospho-peptides from topoisomerases identified by quantitative phospho-proteomics. Median phospho-peptide SILAC
ratios are reported. Potential protein kinases responsible for the up-regulated phosphorylation sites (NetworKIN) in topoisomerases are reported along with their respective scores. bAll protein ratios (total
peptide counts ≥ 2) are medians of the normalized SILAC ratios from experimental replicates 1 and 2. cOnly identified in experimental replicate 1. dA site localization probability cutoff of 0.80 was used.
All ratios are medians of the normalized SILAC ratios from experimental replicates 1 and 2. fOnly identified in experimental replicate 1.
1340, 1342, 1344
1524, 1526
protein name
high TIMP-1/low TIMP-1b
10 ± 1
1.8 ± 0.2
2.7 ± 0.6
2.1 ± 0.2
3.0 ± 0.5
2.6 ± 0.8
topoisomerases: Marker phospho-peptides identified by quantitative phospho-proteomics
UniProt name
(A) Effect of TIMP-1 expression levels on those of topoisomerases and others: Median protein ratios from quantitative proteomics
metalloproteinase inhibitor 1
DNA topoisomerase 1
DNA topoisomerase 2-alpha
DNA topoisomerase 2-beta
CD44 antigen
CD63 antigen
transcription factor jun-B
Fos-related antigen 2
(B) Effect of TIMP-1 expression levels on the phosphorylation levels of
protein name
Table 2. Regulated Proteins and Phospho-Proteins with Known Biological Relation to TIMP-1a
Journal of Proteome Research
Article | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Figure 5. Effect of TIMP-1 overexpression on the expression and phosphorylation levels of topoisomerases. (A) A representative phosphorylated
peptide from pTOP1 (left) and a phosphorylated peptide from pTOP2B (right). All peptides are in SILAC triplets. Different colors correspond to
the colors in Figure 1, representing the SILAC L/M/H labels. (B) Western blot analysis of TOP1, TOP2A, and TOP2B with β-actin as a loading
control. The expression for TIMP-1 is also shown. (C) Two-dimensional immunoblot analysis of TOP1 expression and PTMs patterns in low
TIMP-1 A (upper panel) and high TIMP-1 B (lower panel) cell line clones. The IEF gels were either silver stained (left-hand panels) or
immunoblotted for TOP1 (right-hand panels). Arrowheads indicate multiple forms of TOP1. Treatment of lysates with lambda protein phosphatase
prior to gel analysis is shown (right-hand panel, TOP1 immunoblot + phosphatase).
4145 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Figure 6. Effect of TIMP-1 overexpression on the global phosphorylation patterns and cell growth. (A) Phospho-peptides were aligned around the
class I phosphorylation sites, thereby comparing up-regulated phospho-peptides (median normalized ratios ≥ 3.0) with reference phospho-peptides
(0.8 ≤ median normalized ratios ≤ 1.2). Observed sequence bias was visualized with the IceLogo software tool. Over-represented, unbiased, and
under-represented known kinase substrate motifs are also shown. (B) Growth curves for the cell lines low TIMP-1 A, low TIMP-1 B, high TIMP-1
A, and high TIMP-1 B. Cells were counted in triplicates with 24 h intervals and the best-fitted exponential lines were layered on top of the data.
Three independent experiments were performed and a representative experiment is shown. Error bars represent SE. Doubling times were calculated
from the curves and the differences between the low TIMP-1A/B and the high TIMP-1A/B cells were statistically significant (P = 0.0003).
TIMP-1 High Expressing Cells Are More Resistant toward
Topoisomerase Inhibitors but Not toward Cisplatin
baseophilic kinases such as PKA and PKC, as well as the
proline-directed cyclin-dependent kinases and MAP kinases,57
which are of particular interest since TIMP-1 high expressing
cells showed a significantly longer doubling time (25 h)
compared to the TIMP-1 low expressing cells (22 h) (P =
0.0003) (Figure 6B). A preference was seen for glutamic acids
in the minus subsites (−4 to −1) and for serine in the distal
minus subsites (−6 to −3) (Figure 6A) in TIMP-1 high cells,
which indicates an over-representation of kinases such as PLK,
PLK1, and CK1. There is no bias for or against kinases such as
CK2 and ATM/ATR (Figure 6A).
In order to combine the sequence bias information with the
protein association network information, a NetworKIN analysis
was used to identify candidate kinases involved in the hyperphosphorylation of topoisomerases (Table 2B). Five highscoring kinases were identified: ATM, CDK1, CK2 alpha, PKC
delta and PLK1. PLK1 was the only kinase, which was
expressed at a slightly higher level in the TIMP-1 high
expressing cells (1.4-fold induction, Supporting Information
Table 1).
To test whether the increased protein levels and/or
phosphorylation status of the topoisomerase enzymes in
TIMP-1 high expressing cells were associated with a changed
sensitivity to targeted inhibition of the topoisomerases, we
performed cell viability assays of the high and low TIMP-1
expressing clones treated with different topoisomerase
inhibitors. Each cell line was exposed to increasing concentrations of specific topoisomerase inhibitors to analyze the cell
viability response (Figure 7A−C). The relationship between
cellular TIMP-1 protein levels and sensitivity to the TOP1
inhibitor SN-38 was statistically highly significant as determined
by mixed model analysis (P < 0.0001), with TIMP-1 high
expressing cells being significantly less sensitive to SN-38
compared to TIMP-1 low expressing cells (Figure 7A).
TIMP-1 high expressing cells were also significantly (P =
0.035) less sensitive to epirubicin (general TOP2 inhibitor)
induced reduction of cell viability as compared to TIMP-1 low
expressing cells (Figure 7B). This was confirmed by exposure of
the cells to etoposide, another TOP2 inhibitor. In full
agreement with the epirubicin data, TIMP-1 high expressing
4146 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Figure 7. Cell viability of cells treated with chemotherapy for 48 h. (A) Cells treated with the TOP1 inhibitor SN-38 (0, 0.256, 1.28, 6.4, 32, 160, 800
nM). (B) Cells treated with the TOP2 inhibitor Epirubicin (0, 0.061, 2.4, 9.8, 39, 156, 625 nM). (C) Cells treated with the TOP2B inhibitor XK 469
(0, 25, 50, 75, 100, 200, 300, 500 μM). (D) Cells treated with cisplatin (0, 3.13, 6.25, 12.50, 25, 50, 100 μM). Data is presented as percent of
untreated cells, and the concentrations of drugs are indicated in the figures.
observed following cisplatin treatment (Figure 7D). The IC50
values are shown in Table 3.
cells were less sensitive (P < 0.0001) to etoposide induced cell
death as compared to TIMP-1 low expressing cells (Supporting
Information Figure 3A). As an important extension, we applied
murine fibrosarcoma cell lines to generalize our findings. These
data demonstrated that SN-38 and epirubicin caused
significantly more cell death in mouse fibroblast cells
established from a TIMP-1 genetically knock out mouse
compared to wild type mouse fibroblasts28 (Supporting
Information Figure 3B). To test if high TIMP-1 also influenced
sensitivity to a specific TOP2B inhibitor, we exposed the cells
to the TOP2B inhibitor XK 469. We showed, that the TIMP-1
high expressing cells were significantly less sensitive to this
TOP2B inhibitor (P = 0.023) as compared to TIMP-1 low
expressing cells (Figure 7C).
The observed differences in sensitivity to chemotherapeutic
drugs could be associated to TIMP-1 mediated differences in
cellular growth. Therefore, we compared the growth of the 4
MCF-7 sublines and found that the TIMP-1 overexpressing
cells had a small but significantly longer doubling time (25 h)
compared to the TIMP-1 low expressing cells (22 h) (P =
0.0003) (Figure 6B). To exclude the possibility that overexpression of TIMP-1 in MCF-7S1 cells led to a more general
chemoresistant phenotype, perhaps related to the observed
differences in growth rate, we tested the cell viability response
to the chemotherapeutic drug cisplatin that has a different
mode of action. This drug does not target any of the
topoisomerases, but cross-links DNA thereby preventing
normal cell cycle regulation which eventually triggers
apoptosis.58 No significant differences in cell viability (P =
0.13) between TIMP-1 low and high expressing cells were
Table 3. IC50 Values of TOP Inhibitors and Cisplatin for the
Four Low and High TIMP-1 Cell Clonesa
IC50 values for the four cell clones
XK 469
230 nM
150 nM
460 μM
52 μM
150 nM
130 nM
450 μM
44 μM
The half maximal inhibitory concentration (IC50) is read from the
dose−response curves for each cell line exposed to either SN-38,
epirubicin, XK 469, or cisplatin.
In this study, we employed SILAC based quantitative MS to
analyze global proteome and phosphoproteome differences of
MCF-7 breast cancer cells genetically manipulated to express
high or low levels of TIMP-1. We prioritized to investigate
proteins being potentially biologically associated with our
preclinical findings and our clinical observations that high levels
of TIMP-1 in cancer cells significantly associate with resistance
to treatment with topoisomerase inhibitors.12,22 We confirmed
the previous findings that high cellular expression of TIMP-1 is
associated with increased resistance to topoisomerase inhibitors, and we also observed that murine TIMP-1 wild-type
4147 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
fibrosarcoma cells are more resistant to topoisomerase
inhibitors than their gene deficient counterparts. Moreover,
our data showed that cells expressing high levels of TIMP-1
have increased expression and/or phosphorylation of the
topoisomerases, which may explain the resistance phenotype
observed in cells with high TIMP-1 levels.
The proteomic and phosphoproteomic data revealed
regulation of hundreds of proteins and hundreds of
phosphorylation sites in cells with high TIMP-1 levels
compared to those with low levels. We mapped the most upregulated proteins and phospho-proteins to functional classes
using IPA and found enrichment for processes that TIMP-1 is
believed to be involved in, for example, apoptosis, cell cycle,
transcription factors, DNA repair, drug transport, and drug
resistance/sensitivity.7−11,28,59 It is particularly interesting that
all the identified topoisomerases were either hyper-phosphorylated or overexpressed since this may explain why previous
studies found high TIMP-1 levels in tumor or plasma to be
associated with decreased benefit from topoisomerase inhibitor
containing chemotherapy,12−15 as both topoisomerase 1 and 2
activity is positively dependent on phosphorylation.60−62 It is
therefore intriguing to speculate that increased expression levels
and phosphorylation status of topoisomerases may cause the
chemotherapy resistance phenotype. To investigate the functional relevance of the increased protein expression and/or
phosphorylation of topoisomerases found in the two TIMP-1
high-expressing cell lines, we exposed all cell lines to TOP1 and
TOP2 inhibitors and found significantly decreased sensitivity to
both inhibitors in TIMP-1 high expressing cells.
Although there is abundant evidence that high TIMP-1 levels
are associated with topoisomerase inhibitor resistance, the
underlying mode of action is to date not clear. TIMP-1 may
bind to the cell surface and be transported into the nucleus as
shown in a previous study in MCF-7 human breast cancer
cells.63 As such, TIMP-1 has also been shown to bind to the cell
surface proteins CD63 and CD44. The binding of TIMP-1 to
these proteins initiates intracellular signal transduction that
leads to an antiapoptotic phenotype.9,12−15,20,21,51 We found
both CD63 and CD44 to be up-regulated at the expression
level, which suggests a positive feedback mechanism. This
opens new doors in developing anticancer therapeutic
interventions, as disruption of the TIMP-1 complex with
plasma membrane proteins could potentially reduce the
antiapoptotic signaling from the complex.
TIMP-1 has been suggested to initiate many different
intracellular signaling pathways, which could explain the
chemotherapy resistance phenotype seen in TIMP-1 high
expressing cells and tumors. To determine which kinases, and
hence pathways, may be hyperactivated in TIMP-1 high
expressing cells, we analyzed the kinase motifs for all the upregulated phosphorylation sites against unchanged phosphorylation sites (reference) from the same data set and found a
bias against proline-directed kinases. This is interesting because
the proline-directed kinases, Akt and ERK, play a role in TIMP1 overexpressing cells, and have been related to resistance to
breast cancer treatment.7−9,11,19,59,64,65 Second, although
contradictory to earlier studies, we found TIMP-1 high
expressing cells to grow slightly but significantly slower than
TIMP-1 low expressing cells. This could be explained by the
underrepresentation of the proline-directed kinases that
promote proliferation.66−68 Consistent with this we have
recently reported an inverse relationship between TIMP-1
protein levels and the proliferation marker Ki67 in clinical
breast cancer samples.69
The motif analysis showed that the recognition motif for
polo-like kinases was overrepresented. Moreover, polo-like
kinase 1 (PLK1) phosphorylates TOP2A70 and NetworKIN
predicted PLK1 also to be responsible for the phosphorylation
of six up-regulated phosphorylation sites in TOP2B. PLK1 also
phosphorylates numerous other cell-cycle proteins, including
PKMYT1 and CCNB1, both of which we found to be upregulated in cells with high TIMP-1 levels. These phosphorylations lead to inhibition of PKMYT171 and promoted nuclear
import of CCNB,72 promoting progression through M-phase.
This is consistent with the slower growth of TIMP-1 high
expressing cells and the observed increase in expression and
phosphorylation of the topoisomerases.
Our data set revealed increased expression of hundreds of
proteins in the TIMP-1 high expressing cells compared to
TIMP-1 low expressing, and it is possible that the mere
regulation of protein expression plays a role. As such, we
observed an up-regulation of several transcription factors in
TIMP-1 high expressing cells, which may explain the massive
amounts of proteins being up-regulated in these cells. Two
transcription factors belonging to the activator protein-1 (AP1) complex family, namely, JunB and FosL2, were found to be
up-regulated in TIMP-1 high expressing cells and are present in
the STRING TIMP-1 interaction network. A previous 293 AP1 reporter cell line study showed that exposure to recombinant
TIMP-1 resulted in elevated level of AP-1 activity, suggesting
that TIMP-1 can activate this transcription factor complex
either directly or indirectly.19 While the PI3K/Akt/NF-kβ
signaling pathway has also been proposed as a candidate in
another TIMP-1 high related TOP2 inhibitor resistant model,11
we did not observe a differential expression of NF-kβ in TIMP1 low and high expressing cells. This does not exclude that the
protein could possess different activity in different cell lines.
This study is the first global, unbiased, and quantitative
proteomic investigation of low and high TIMP-1 expressing
breast cancer cells, and it shows for the first time that
overexpression of TIMP-1 results in up-regulation and hyperphosphorylation of a number of proteins being either directly
or indirectly associated with drug resistance mechanisms. Of
particular interest is the observed association between high
TIMP-1 protein expression and resistance to topoisomerase
inhibitors, which is likely due to the observed up-regulation
and/or hyper-phosphorylation of the three major DNA
topoisomerases, TOP1, TOP2A, and TOP2B. However, the
exact relationship between topoisomrase phosphorylation and
sensitivity to topoisomerase inhibitors remains to be elucidated.
In particular, it should be tested whether phosphorylated
topoisomerases are likely candidates as biomarkers for
topoisomerase inhibitor resistance in TIMP-1 high expressing
tumors. Importantly, our data from the experimental model
system recapitulates fundamental aspects of increased resistance
to topoisomerase inhibitors observed in vivo for TIMP-1
overexpressing cells.
4148 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
Di, L. A.; Albain, K.; Swain, S.; Piccart, M.; Pritchard, K. Comparisons
between different polychemotherapy regimens for early breast cancer:
meta-analyses of long-term outcome among 100,000 women in 123
randomised trials. Lancet 2012, 379 (9814), 432−444.
(3) Egeblad, M.; Werb, Z. New functions for the matrix
metalloproteinases in cancer progression. Nat. Rev. Cancer 2002, 2
(3), 161−174.
(4) Stetler-Stevenson, W. G. Tissue inhibitors of metalloproteinases
in cell signaling: metalloproteinase-independent biological activities.
Sci. Signaling 2008, 1 (27), re6.
(5) Wurtz, S. O.; Schrohl, A. S.; Sorensen, N. M.; Lademann, U.;
Christensen, I. J.; Mouridsen, H.; Brunner, N. Tissue inhibitor of
metalloproteinases-1 in breast cancer. Endocr.-Relat. Cancer 2005, 12
(2), 215−227.
(6) Airola, K.; Karonen, T.; Vaalamo, M.; Lehti, K.; Lohi, J.;
Kariniemi, A. L.; Keski-Oja, J.; Saarialho-Kere, U. K. Expression of
collagenases-1 and −3 and their inhibitors TIMP-1 and −3 correlates
with the level of invasion in malignant melanomas. Br. J. Cancer 1999,
80 (5−6), 733−743.
(7) Liu, X. W.; Bernardo, M. M.; Fridman, R.; Kim, H. R. Tissue
inhibitor of metalloproteinase-1 protects human breast epithelial cells
against intrinsic apoptotic cell death via the focal adhesion kinase/
phosphatidylinositol 3-kinase and MAPK signaling pathway. J. Biol.
Chem. 2003, 278 (41), 40364−40372.
(8) Liu, X. W.; Taube, M. E.; Jung, K. K.; Dong, Z.; Lee, Y. J.; Roshy,
S.; Sloane, B. F.; Fridman, R.; Kim, H. R. Tissue inhibitor of
metalloproteinase-1 protects human breast epithelial cells from
extrinsic cell death: a potential oncogenic activity of tissue inhibitor
of metalloproteinase-1. Cancer Res. 2005, 65 (3), 898−906.
(9) Jung, K. K.; Liu, X. W.; Chirco, R.; Fridman, R.; Kim, H. R.
Identification of CD63 as a tissue inhibitor of metalloproteinase-1
interacting cell surface protein. EMBO J. 2006, 25 (17), 3934−3942.
(10) Wang, T.; Lv, J. H.; Zhang, X. F.; Li, C. J.; Han, X.; Sun, Y. J.
Tissue inhibitor of metalloproteinase-1 protects MCF-7 breast cancer
cells from paclitaxel-induced apoptosis by decreasing the stability of
cyclin B1. Int. J. Cancer 2010, 126 (2), 362−370.
(11) Fu, Z. Y.; Lv, J. H.; Ma, C. Y.; Yang, D. P.; Wang, T. Tissue
inhibitor of metalloproteinase-1 decreased chemosensitivity of MDA435 breast cancer cells to chemotherapeutic drugs through the PI3K/
AKT/NF-small ka, CyrillicB pathway. Biomed. Pharmacother. 2011, 65
(3), 163−167.
(12) Willemoe, G. L.; Hertel, P. B.; Bartels, A.; Jensen, M. B.; Balslev,
E.; Rasmussen, B. B.; Mouridsen, H.; Ejlertsen, B.; Brunner, N. Lack of
TIMP-1 tumour cell immunoreactivity predicts effect of adjuvant
anthracycline-based chemotherapy in patients (n = 647) with primary
breast cancer. A Danish Breast Cancer Cooperative Group Study. Eur.
J. Cancer 2009, 45 (14), 2528−2536.
(13) Ejlertsen, B.; Jensen, M. B.; Nielsen, K. V.; Balslev, E.;
Rasmussen, B. B.; Willemoe, G. L.; Hertel, P. B.; Knoop, A. S.;
Mouridsen, H. T.; Brunner, N. HER2, TOP2A, and TIMP-1 and
responsiveness to adjuvant anthracycline-containing chemotherapy in
high-risk breast cancer patients. J. Clin. Oncol. 2010, 28 (6), 984−990.
(14) Sorensen, N. M.; Bystrom, P.; Christensen, I. J.; Berglund, A.;
Nielsen, H. J.; Brunner, N.; Glimelius, B. TIMP-1 is significantly
associated with objective response and survival in metastatic colorectal
cancer patients receiving combination of irinotecan, 5-fluorouracil, and
folinic acid. Clin. Cancer Res. 2007, 13 (14), 4117−4122.
(15) Frederiksen, C.; Qvortrup, C.; Christensen, I. J.; Glimelius, B.;
Berglund, A.; Jensen, B. V.; Nielsen, S. E.; Keldsen, N.; Nielsen, H. J.;
Brunner, N.; Pfeiffer, P. Plasma TIMP-1 levels and treatment outcome
in patients treated with XELOX for metastatic colorectal cancer. Ann.
Oncol. 2011, 22 (2), 369−375.
(16) Schrohl, A. S.; Christensen, I. J.; Pedersen, A. N.; Jensen, V.;
Mouridsen, H.; Murphy, G.; Foekens, J. A.; Brunner, N.; HoltenAndersen, M. N. Tumor tissue concentrations of the proteinase
inhibitors tissue inhibitor of metalloproteinases-1 (TIMP-1) and
plasminogen activator inhibitor type 1 (PAI-1) are complementary in
determining prognosis in primary breast cancer. Mol. Cell. Proteomics
2003, 2 (3), 164−172.
S Supporting Information
Additonal experimental details as described in the text. This
material is available free of charge via the Internet at http://
Accession Codes
All the MS raw data associated with this manuscript can be
found at The
name of the zipfile containing all the raw files is “TIMP1_in_relation_to_drug_resistance_in_breast_cancer_cells”.
The password is pTOP_2b.
Corresponding Author
*(J.V.O.) E-mail: [email protected] Telephone: +45 35
32 50 22. Fax: +45 35 32 50 01. (J.S.) E-mail: [email protected] Telephone: +45 35 33 37 53. Fax: +45 35 33 27 55.
Author Contributions
O.H., S.M., L.F., N.B., J.V.O., and J.S.: Shared authorship.
The authors declare no competing financial interest.
The authors would like to thank Mr. Anatoliy Dmytriyev for
uploading the raw data, and Dr. Christian D. Kelstrup and Dr.
Sebastian A. Wagner for helpful discussions. Ms. Vibeke Jensen
is acknowledged for technical assistance on cell culture and
TIMP-1 analysis. We thank the Danish Natural Research
Foundation, The Danish Strategic Research Council (TIPCAT), The Medical Research Council, The Danish Cancer
Society, The Danish Center for Translational Breast Cancer
Research, and A Race Against Breast Cancer for financial
support. Work at the Center for Protein Research is supported
by a generous donation from the Novo Nordisk Foundation.
Part of this work has been funded by PRIME-XS a seventh
Framework Programme of the European Union (Contract No.
262067- PRIME-XS). C.F. is supported by Marie Curie and
EMBO postdoctoral fellowships.
TIMP, tissue inhibitor of metalloproteinase; SILAC, stable
isotope labeling by amino acids in cell culture; TOP,
topoisomerase; FAK, focal adhesion kinase; ERK, extracellular
signal-regulated kinase; PTM, post-translational modifications;
TWT-III, TIMP-1 wild type murine fibrosarcoma cell lines;
TKO-III, TIMP-1 knock out murine fibrosarcoma cell lines;
CAA, chloroacetamide; 2,5-DHB, 2,5-dihydroxybenzoic acid;
HCD, higher energy collisional dissociation; IPI, International
Protein Index; LDH, lactate dehydrogenase; 2D PAGE, twodimensional PAGE; lambda PP, lambda protein phosphatase;
hsp74, Heat shock 70 kDa protein 4; CLU, clusterin; mTOR,
mammalian target of rapamycin; AP-1, activator protein-1;
PLK1, polo-like kinase 1
(1) Early Breast Cancer Trialists’ Collaborative Group (EBCTCG).
Effects of chemotherapy and hormonal therapy for early breast cancer
on recurrence and 15-year survival: an overview of the randomised
trials. Lancet 2005, 365 (9472), 1687−1717.
(2) Peto, R.; Davies, C.; Godwin, J.; Gray, R.; Pan, H. C.; Clarke, M.;
Cutter, D.; Darby, S.; McGale, P.; Taylor, C.; Wang, Y. C.; Bergh, J.;
4149 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
(17) Wurtz, S. O.; Christensen, I. J.; Schrohl, A. S.; Mouridsen, H.;
Lademann, U.; Jensen, V.; Brunner, N. Measurement of the
uncomplexed fraction of tissue inhibitor of metalloproteinases-1 in
the prognostic evaluation of primary breast cancer patients. Mol. Cell.
Proteomics 2005, 4 (4), 483−491.
(18) Birgisson, H.; Nielsen, H. J.; Christensen, I. J.; Glimelius, B.;
Brunner, N. Preoperative plasma TIMP-1 is an independent
prognostic indicator in patients with primary colorectal cancer: a
prospective validation study. Eur. J. Cancer 2010, 46 (18), 3323−3331.
(19) Bigelow, R. L.; Williams, B. J.; Carroll, J. L.; Daves, L. K.;
Cardelli, J. A. TIMP-1 overexpression promotes tumorigenesis of
MDA-MB-231 breast cancer cells and alters expression of a subset of
cancer promoting genes in vivo distinct from those observed in vitro.
Breast Cancer Res. Treat. 2009, 117 (1), 31−44.
(20) Stilley, J. A.; Sharpe-Timms, K. L. TIMP1 contributes to ovarian
anomalies in both an MMP-dependent and -independent manner in a
rat model. Biol. Reprod. 2012, 86 (2), 47.
(21) Egea, V.; Zahler, S.; Rieth, N.; Neth, P.; Popp, T.; Kehe, K.;
Jochum, M.; Ries, C. Tissue inhibitor of metalloproteinase-1 (TIMP1) regulates mesenchymal stem cells through let-7f microRNA and
Wnt/beta-catenin signaling. Proc. Natl. Acad. Sci. U.S.A. 2012, 109 (6),
(22) Schrohl, A. S.; Meijer-van Gelder, M. E.; Holten-Andersen, M.
N.; Christensen, I. J.; Look, M. P.; Mouridsen, H. T.; Brunner, N.;
Foekens, J. A. Primary tumor levels of tissue inhibitor of metalloproteinases-1 are predictive of resistance to chemotherapy in patients
with metastatic breast cancer. Clin. Cancer Res. 2006, 12 (23), 7054−
(23) Sorensen, N. M.; Schrohl, A. S.; Jensen, V.; Christensen, I. J.;
Nielsen, H. J.; Brunner, N. Comparative studies of tissue inhibitor of
metalloproteinases-1 in plasma, serum and tumour tissue extracts from
patients with primary colorectal cancer. Scand. J. Gastroenterol. 2008,
43 (2), 186−191.
(24) Cox, J.; Mann, M. Quantitative, high-resolution proteomics for
data-driven systems biology. Annu. Rev. Biochem. 2011, 80, 273−299.
(25) Ong, S. E.; Blagoev, B.; Kratchmarova, I.; Kristensen, D. B.;
Steen, H.; Pandey, A.; Mann, M. Stable isotope labeling by amino acids
in cell culture, SILAC, as a simple and accurate approach to expression
proteomics. Mol. Cell. Proteomics 2002, 1 (5), 376−386.
(26) Jaattela, M.; Benedict, M.; Tewari, M.; Shayman, J. A.; Dixit, V.
M. Bcl-x and Bcl-2 inhibit TNF and Fas-induced apoptosis and
activation of phospholipase A2 in breast carcinoma cells. Oncogene
1995, 10 (12), 2297−2305.
(27) Cox, J.; Matic, I.; Hilger, M.; Nagaraj, N.; Selbach, M.; Olsen, J.
V.; Mann, M. A practical guide to the MaxQuant computational
platform for SILAC-based quantitative proteomics. Nat. Protoc. 2009,
4 (5), 698−705.
(28) Davidsen, M. L.; Wurtz, S. O.; Romer, M. U.; Sorensen, N. M.;
Johansen, S. K.; Christensen, I. J.; Larsen, J. K.; Offenberg, H.;
Brunner, N.; Lademann, U. TIMP-1 gene deficiency increases tumour
cell sensitivity to chemotherapy-induced apoptosis. Br. J. Cancer 2006,
95 (8), 1114−1120.
(29) Macek, B.; Mann, M.; Olsen, J. V. Global and site-specific
quantitative phosphoproteomics: principles and applications. Annu.
Rev. Pharmacol. Toxicol. 2009, 49, 199−221.
(30) Olsen, J. V.; Macek, B. High accuracy mass spectrometry in
large-scale analysis of protein phosphorylation. Methods Mol. Biol.
2009, 492, 131−142.
(31) Olsen, J. V.; Schwartz, J. C.; Griep-Raming, J.; Nielsen, M. L.;
Damoc, E.; Denisov, E.; Lange, O.; Remes, P.; Taylor, D.; Splendore,
M.; Wouters, E. R.; Senko, M.; Makarov, A.; Mann, M.; Horning, S. A
dual pressure linear ion trap Orbitrap instrument with very high
sequencing speed. Mol. Cell. Proteomics 2009, 8 (12), 2759−2769.
(32) Olsen, J. V.; de Godoy, L. M.; Li, G.; Macek, B.; Mortensen, P.;
Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Parts per
million mass accuracy on an Orbitrap mass spectrometer via lock mass
injection into a C-trap. Mol. Cell. Proteomics 2005, 4 (12), 2010−2021.
(33) Olsen, J. V.; Macek, B.; Lange, O.; Makarov, A.; Horning, S.;
Mann, M. Higher-energy C-trap dissociation for peptide modification
analysis. Nat. Methods 2007, 4 (9), 709−712.
(34) Cox, J.; Mann, M. MaxQuant enables high peptide identification
rates, individualized p.p.b.-range mass accuracies and proteome-wide
protein quantification. Nat. Biotechnol. 2008, 26 (12), 1367−1372.
(35) Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.;
Mortensen, P.; Mann, M. Global, in vivo, and site-specific
phosphorylation dynamics in signaling networks. Cell 2006, 127 (3),
(36) Szklarczyk, D.; Franceschini, A.; Kuhn, M.; Simonovic, M.;
Roth, A.; Minguez, P.; Doerks, T.; Stark, M.; Muller, J.; Bork, P.;
Jensen, L. J.; von, M. C. The STRING database in 2011: functional
interaction networks of proteins, globally integrated and scored.
Nucleic Acids Res. 2011, 39 (Database issue), D561−D568.
(37) von, M. C.; Jensen, L. J.; Snel, B.; Hooper, S. D.; Krupp, M.;
Foglierini, M.; Jouffre, N.; Huynen, M. A.; Bork, P. STRING: known
and predicted protein-protein associations, integrated and transferred
across organisms. Nucleic Acids Res. 2005, 33 (Database issue), D433−
(38) Taboureau, O.; Nielsen, S. K.; Audouze, K.; Weinhold, N.;
Edsgard, D.; Roque, F. S.; Kouskoumvekaki, I.; Bora, A.; Curpan, R.;
Jensen, T. S.; Brunak, S.; Oprea, T. I. ChemProt: a disease chemical
biology database. Nucleic Acids Res. 2011, 39 (Database issue), D367−
(39) Knox, C.; Law, V.; Jewison, T.; Liu, P.; Ly, S.; Frolkis, A.; Pon,
A.; Banco, K.; Mak, C.; Neveu, V.; Djoumbou, Y.; Eisner, R.; Guo, A.
C.; Wishart, D. S. DrugBank 3.0: a comprehensive resource for ’omics’
research on drugs. Nucleic Acids Res. 2011, 39 (Database issue),
(40) Smoot, M. E.; Ono, K.; Ruscheinski, J.; Wang, P. L.; Ideker, T.
Cytoscape 2.8: new features for data integration and network
visualization. Bioinformatics 2011, 27 (3), 431−432.
(41) Warsow, G.; Greber, B.; Falk, S. S.; Harder, C.; Siatkowski, M.;
Schordan, S.; Som, A.; Endlich, N.; Scholer, H.; Repsilber, D.; Endlich,
K.; Fuellen, G. ExprEssence–revealing the essence of differential
experimental data in the context of an interaction/regulation net-work.
BMC Syst. Biol. 2010, 4, 164.
(42) Colaert, N.; Helsens, K.; Martens, L.; Vandekerckhove, J.;
Gevaert, K. Improved visualization of protein consensus sequences by
iceLogo. Nat. Methods 2009, 6 (11), 786−787.
(43) Linding, R.; Jensen, L. J.; Ostheimer, G. J.; van Vugt, M. A.;
Jorgensen, C.; Miron, I. M.; Diella, F.; Colwill, K.; Taylor, L.; Elder, K.;
Metalnikov, P.; Nguyen, V.; Pasculescu, A.; Jin, J.; Park, J. G.; Samson,
L. D.; Woodgett, J. R.; Russell, R. B.; Bork, P.; Yaffe, M. B.; Pawson, T.
Systematic discovery of in vivo phosphorylation networks. Cell 2007,
129 (7), 1415−1426.
(44) Miller, M. L.; Jensen, L. J.; Diella, F.; Jorgensen, C.; Tinti, M.;
Li, L.; Hsiung, M.; Parker, S. A.; Bordeaux, J.; Sicheritz-Ponten, T.;
Olhovsky, M.; Pasculescu, A.; Alexander, J.; Knapp, S.; Blom, N.; Bork,
P.; Li, S.; Cesareni, G.; Pawson, T.; Turk, B. E.; Yaffe, M. B.; Brunak,
S.; Linding, R. Linear motif atlas for phosphorylation-dependent
signaling. Sci. Signaling 2008, 1 (35), ra2.
(45) Moller, S. N.; Dowell, B. L.; Stewart, K. D.; Jensen, V.; Larsen,
L.; Lademann, U.; Murphy, G.; Nielsen, H. J.; Brunner, N.; Davis, G. J.
Establishment and characterization of 7 new monoclonal antibodies to
tissue inhibitor of metalloproteinases-1. Tumour Biol. 2005, 26 (2),
(46) Holten-Andersen, M. N.; Murphy, G.; Nielsen, H. J.; Pedersen,
A. N.; Christensen, I. J.; Hoyer-Hansen, G.; Brunner, N.; Stephens, R.
W. Quantitation of TIMP-1 in plasma of healthy blood donors and
patients with advanced cancer. Br. J. Cancer 1999, 80 (3−4), 495−503.
(47) Cabezon, T.; Gromova, I.; Gromov, P.; Serizawa, R.;
Timmermans, W., V; Kroman, N.; Celis, J. E.; Moreira, J. M.
Proteomic Profiling of Triple-negative Breast Carcinomas in
Combination With a Three-tier Orthogonal Technology Approach
Identifies Mage-A4 as Potential Therapeutic Target in Estrogen
Receptor Negative Breast Cancer. Mol. Cell. Proteomics 2013, 12 (2),
4150 | J. Proteome Res. 2013, 12, 4136−4151
Journal of Proteome Research
(48) Pines, A.; Kelstrup, C. D.; Vrouwe, M. G.; Puigvert, J. C.; Typas,
D.; Misovic, B.; de, G. A.; von, S. L.; van de Water, B.; Danen, E. H.;
Vrieling, H.; Mullenders, L. H.; Olsen, J. V. Global phosphoproteome
profiling reveals unanticipated networks responsive to cisplatin
treatment of embryonic stem cells. Mol. Cell. Biol. 2011, 31 (24),
(49) de Godoy, L. M.; Olsen, J. V.; Cox, J.; Nielsen, M. L.; Hubner,
N. C.; Frohlich, F.; Walther, T. C.; Mann, M. Comprehensive massspectrometry-based proteome quantification of haploid versus diploid
yeast. Nature 2008, 455 (7217), 1251−1254.
(50) Selbach, M.; Schwanhausser, B.; Thierfelder, N.; Fang, Z.;
Khanin, R.; Rajewsky, N. Widespread changes in protein synthesis
induced by microRNAs. Nature 2008, 455 (7209), 58−63.
(51) Lambert, E.; Bridoux, L.; Devy, J.; Dasse, E.; Sowa, M. L.; Duca,
L.; Hornebeck, W.; Martiny, L.; Petitfrere-Charpentier, E. TIMP-1
binding to proMMP-9/CD44 complex localized at the cell surface
promotes erythroid cell survival. Int. J. Biochem. Cell Biol. 2009, 41 (5),
(52) Lourda, M.; Trougakos, I. P.; Gonos, E. S. Development of
resistance to chemotherapeutic drugs in human osteosarcoma cell lines
largely depends on up-regulation of Clusterin/Apolipoprotein J. Int. J.
Cancer 2007, 120 (3), 611−622.
(53) Mizutani, K.; Matsumoto, K.; Hasegawa, N.; Deguchi, T.;
Nozawa, Y. Expression of clusterin, XIAP and survivin, and their
changes by camptothecin (CPT) treatment in CPT-resistant PC-3 and
CPT-sensitive LNCaP cells. Exp. Oncol. 2006, 28 (3), 209−215.
(54) Wang, Y. H.; Huang, M. L. Organogenesis and tumorigenesis:
insight from the JAK/STAT pathway in the Drosophila eye. Dev. Dyn.
2010, 239 (10), 2522−2533.
(55) Bucher, N.; Britten, C. D. G2 checkpoint abrogation and
checkpoint kinase-1 targeting in the treatment of cancer. Br. J. Cancer
2008, 98 (3), 523−528.
(56) Gaur, S.; Chen, L.; Yang, L.; Wu, X.; Un, F.; Yen, Y. Inhibitors
of mTOR overcome drug resistance from topoisomerase II inhibitors
in solid tumors. Cancer Lett. 2011, 311 (1), 20−28.
(57) Ubersax, J. A.; Ferrell, J. E., Jr. Mechanisms of specificity in
protein phosphorylation. Nat. Rev. Mol. Cell Biol. 2007, 8 (7), 530−
(58) Alborzinia, H.; Can, S.; Holenya, P.; Scholl, C.; Lederer, E.;
Kitanovic, I.; Wolfl, S. Real-time monitoring of cisplatin-induced cell
death. PLoS One 2011, 6 (5), e19714.
(59) Li, G.; Fridman, R.; Kim, H. R. Tissue inhibitor of
metalloproteinase-1 inhibits apoptosis of human breast epithelial
cells. Cancer Res. 1999, 59 (24), 6267−6275.
(60) Ackerman, P.; Glover, C. V.; Osheroff, N. Phosphorylation of
DNA topoisomerase II by casein kinase II: modulation of eukaryotic
topoisomerase II activity in vitro. Proc. Natl. Acad. Sci. U.S.A. 1985, 82
(10), 3164−3168.
(61) Bandyopadhyay, K.; Gjerset, R. A. Protein kinase CK2 is a
central regulator of topoisomerase I hyperphosphorylation and
camptothecin sensitivity in cancer cell lines. Biochemistry 2011, 50
(5), 704−714.
(62) Hackbarth, J. S.; Galvez-Peralta, M.; Dai, N. T.; Loegering, D.
A.; Peterson, K. L.; Meng, X. W.; Karnitz, L. M.; Kaufmann, S. H.
Mitotic phosphorylation stimulates DNA relaxation activity of human
topoisomerase I. J. Biol. Chem. 2008, 283 (24), 16711−16722.
(63) Ritter, L. M.; Garfield, S. H.; Thorgeirsson, U. P. Tissue
inhibitor of metalloproteinases-1 (TIMP-1) binds to the cell surface
and translocates to the nucleus of human MCF-7 breast carcinoma
cells. Biochem. Biophys. Res. Commun. 1999, 257 (2), 494−499.
(64) Baselga, J. Targeting the phosphoinositide-3 (PI3) kinase
pathway in breast cancer. Oncologist 2011, 16 (Suppl 1), 12−19.
(65) McCubrey, J. A.; Steelman, L. S.; Chappell, W. H.; Abrams, S.
L.; Wong, E. W.; Chang, F.; Lehmann, B.; Terrian, D. M.; Milella, M.;
Tafuri, A.; Stivala, F.; Libra, M.; Basecke, J.; Evangelisti, C.; Martelli, A.
M.; Franklin, R. A. Roles of the Raf/MEK/ERK pathway in cell
growth, malignant transformation and drug resistance. Biochim.
Biophys. Acta 2007, 1773 (8), 1263−1284.
(66) Hayakawa, T.; Yamashita, K.; Tanzawa, K.; Uchijima, E.; Iwata,
K. Growth-promoting activity of tissue inhibitor of metalloproteinases1 (TIMP-1) for a wide range of cells. A possible new growth factor in
serum. FEBS Lett. 1992, 298 (1), 29−32.
(67) Luparello, C.; Avanzato, G.; Carella, C.; Pucci-Minafra, I. Tissue
inhibitor of metalloprotease (TIMP)-1 and proliferative behaviour of
clonal breast cancer cells. Breast Cancer Res. Treat. 1999, 54 (3), 235−
(68) Peng, L.; Yanjiao, M.; Ai-guo, W.; Pengtao, G.; Jianhua, L.; Ju,
Y.; Hongsheng, O.; Xichen, Z. A fine balance between CCNL1 and
TIMP1 contributes to the development of breast cancer cells. Biochem.
Biophys. Res. Commun. 2011, 409 (2), 344−349.
(69) Bjerre, C.; Knoop, A.; Bjerre, K.; Larsen, M. S.; Henriksen, K. L.;
Lyng, M. B.; Ditzel, H. J.; Rasmussen, B. B.; Brunner, N.; Ejlertsen, B.;
Laenkholm, A. V. Association of tissue inhibitor of metalloproteinases1 and Ki67 in estrogen receptor positive breast cancer. Acta Oncol.
2013, 52 (1), 82−90.
(70) Li, H.; Wang, Y.; Liu, X. Plk1-dependent phosphorylation
regulates functions of DNA topoisomerase IIalpha in cell cycle
progression. J. Biol. Chem. 2008, 283 (10), 6209−6221.
(71) Nakajima, H.; Toyoshima-Morimoto, F.; Taniguchi, E.; Nishida,
E. Identification of a consensus motif for Plk (Polo-like kinase)
phosphorylation reveals Myt1 as a Plk1 substrate. J. Biol. Chem. 2003,
278 (28), 25277−25280.
(72) Toyoshima-Morimoto, F.; Taniguchi, E.; Shinya, N.; Iwamatsu,
A.; Nishida, E. Polo-like kinase 1 phosphorylates cyclin B1 and targets
it to the nucleus during prophase. Nature 2001, 410 (6825), 215−220.
4151 | J. Proteome Res. 2013, 12, 4136−4151
Crick, F. H. C. The biological replication of macromolecules. Symp. Soc. Exp. Biol
XII, 138 (1958). 3
Crick, F. Central dogma of molecular biology. Nature 227, 561–3 (1970). URL 3, 4
Kuska, B. Beer, bethesda, and biology: how ”genomics” came into being. J Natl
Cancer Inst 90, 93 (1998). URL 5
Watson, J. D. & Crick, F. H. Molecular structure of nucleic acids; a structure for
deoxyribose nucleic acid. Nature 171, 737–8 (1953). URL http://www.ncbi.nlm.nih.
gov/pubmed/13054692. 5
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of
gene expression patterns with a complementary dna microarray. Science 270, 467–70
(1995). URL 5
Consortium, C. e. S. Genome sequence of the nematode c. elegans: a platform for
investigating biology. Science 282, 2012–8 (1998). URL http://www.ncbi.nlm.nih.
gov/pubmed/9851916. 6
Arabidopsis Genome, I. Analysis of the genome sequence of the flowering plant
arabidopsis thaliana. Nature 408, 796–815 (2000). URL http://www.ncbi.nlm.nih.
gov/pubmed/11130711. 6
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature
409, 860–921 (2001). URL 6
Sanger, F. & Coulson, A. R. A rapid method for determining sequences in dna
by primed synthesis with dna polymerase. J Mol Biol 94, 441–8 (1975). URL 6
Sanger, F., Nicklen, S. & Coulson, A. R. Dna sequencing with chain-terminating
inhibitors. Proc Natl Acad Sci U S A 74, 5463–7 (1977). URL http://www.ncbi.nlm. 6, 9
Ware, J. S., Roberts, A. M. & Cook, S. A. Next generation sequencing for clinical
diagnostics and personalised medicine: implications for the next generation cardiologist. Heart 98, 276–81 (2012). URL
Korlach, J. et al. Long, processive enzymatic dna synthesis using 100% dye-labeled
terminal phosphate-linked nucleotides. Nucleosides Nucleotides Nucleic Acids 27,
1072–83 (2008). URL 7
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre
reactors. Nature 437, 376–80 (2005). URL
16056220. 8
Valouev, A. et al. A high-resolution, nucleosome position map of c. elegans reveals
a lack of universal sequence-dictated positioning. Genome Res 18, 1051–63 (2008).
Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined
by rna sequencing. Science 320, 1344–9 (2008). URL
pubmed/18451266. 9, 17
Down, T. A. et al. A bayesian deconvolution strategy for immunoprecipitation-based
dna methylome analysis. Nat Biotechnol 26, 779–85 (2008). URL http://www.ncbi. 10, 22
Smith, Z. D., Gu, H., Bock, C., Gnirke, A. & Meissner, A. High-throughput bisulfite
sequencing in mammalian genomes. Methods 48, 226–32 (2009). URL http://www. 10, 22
Ren, B. et al. Genome-wide location and function of dna binding proteins. Science
290, 2306–9 (2000). URL 10
Song, L. & Crawford, G. E. Dnase-seq: a high-resolution technique for mapping
active gene regulatory elements across the genome from mammalian cells. Cold
Spring Harb Protoc 2010, pdb prot5384 (2010). URL
pubmed/20150147. 10
Waki, H. et al. Global mapping of cell type-specific open chromatin by faire-seq
reveals the regulatory role of the nfi family in adipocyte differentiation. PLoS Genet
7, e1002311 (2011). URL 10
Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer
traces using phred. i. accuracy assessment. Genome Res 8, 175–85 (1998). URL 11
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memoryefficient alignment of short dna sequences to the human genome. Genome Biol 10,
R25 (2009). URL 11
Li, H. & Durbin, R. Fast and accurate short read alignment with burrows-wheeler
transform. Bioinformatics 25, 1754–60 (2009). URL
pubmed/19451168. 11
Li, H. et al. The sequence alignment/map format and samtools. Bioinformatics 25,
2078–9 (2009). URL 11, 12, 118
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and snp calling
from next-generation sequencing data. Nat Rev Genet 12, 443–51 (2011). URL 12
McKenna, A. et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res 20, 1297–303 (2010). URL 12, 118
Danecek, P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–8
(2011). URL 13
van den Oord, E. J. Controlling false discoveries in genetic studies. Am J Med Genet
B Neuropsychiatr Genet 147B, 637–44 (2008). URL
pubmed/18092307. 14
Purcell, S. et al. Plink: a tool set for whole-genome association and population-based
linkage analyses. Am J Hum Genet 81, 559–75 (2007). URL http://www.ncbi.nlm. 14
van der Sluis, S., Posthuma, D. & Dolan, C. V. Tates: efficient multivariate genotypephenotype analysis for genome-wide association studies. PLoS Genet 9, e1003235
(2013). URL 14
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint
method for genome-wide association studies by imputation of genotypes. Nat Genet
39, 906–13 (2007). URL 14
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic
phase. Am J Hum Genet 78, 629–44 (2006). URL
pubmed/16532393. 14
Stephens, M. & Donnelly, P. A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73, 1162–9 (2003).
URL 14
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS
Genet 5, e1000529 (2009). URL 14
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and
haplotype-phase inference for large data sets of trios and unrelated individuals. Am
J Hum Genet 84, 210–23 (2009). URL
Uh, H. W. et al. How to deal with the early gwas data when imputing and combining
different arrays is necessary. Eur J Hum Genet 20, 572–6 (2012). URL http://www. 14
Mullis, K. B. & Faloona, F. A. Specific synthesis of dna in vitro via a polymerasecatalyzed chain reaction. Methods Enzymol 155, 335–50 (1987). URL http://www. 15
Okou, D. T. et al. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 4, 907–9 (2007). URL
17934469. 15
Albert, T. J. et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 4, 903–5 (2007). URL
17934467. 15
Hodges, E. et al. Genome-wide in situ exon capture for selective resequencing. Nat
Genet 39, 1522–7 (2007). URL 15
Bau, S. et al. Targeted next-generation sequencing by specific capture of multiple
genomic loci using low-volume microfluidic dna arrays. Anal Bioanal Chem 393,
171–5 (2009). URL 15
Meyer, M., Stenzel, U., Myles, S., Prufer, K. & Hofreiter, M. Targeted highthroughput sequencing of tagged nucleic acid samples. Nucleic Acids Res 35, e97
(2007). URL 15
Jordan, B. Historical background and anticipated developments. Ann N Y Acad Sci
975, 24–32 (2002). URL 16
Stoughton, R. B. Applications of dna microarrays in biology. Annu Rev Biochem
74, 53–82 (2005). URL 16
Dufva, M. Fabrication of dna microarray. Methods Mol Biol 529, 63–79 (2009).
URL 16
Hardiman, G. Microarray platforms–comparisons and contrasts. Pharmacogenomics
5, 487–502 (2004). URL 16
Tanaka, A. et al. All-in-one tube method for quantitative gene expression analysis
in oligo-dt(30) immobilized pcr tube coated with mpc polymer. Anal Sci 25, 109–14
(2009). URL 16
Smyth, G. K. Limma: linear models for microarray data, 397–420 (Springer, 2005).
Boguski, M. S., Tolstoshev, C. M. & Bassett, J., D. E. Gene discovery in dbest.
Science 265, 1993–4 (1994). URL 17
Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene
expression. Science 270, 484–7 (1995). URL
7570003. 18
Kodzius, R. et al. Cage: cap analysis of gene expression. Nat Methods 3, 211–22
(2006). URL 18
Brenner, S. et al. Gene expression analysis by massively parallel signature sequencing
(mpss) on microbead arrays. Nat Biotechnol 18, 630–4 (2000). URL http://www.ncbi. 18
Siddiqui, A. S. et al. A mouse atlas of gene expression: large-scale digital geneexpression profiles from precisely defined developing c57bl/6j mouse tissues and cells.
Proc Natl Acad Sci U S A 102, 18485–90 (2005). URL
pubmed/16352711. 18
Hegedus, Z. et al. Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. Mol Immunol 46, 2918–30 (2009). URL http://www.ncbi.nlm. 18
t Hoen, P. A. et al. Deep sequencing-based expression analysis shows major advances
in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res 36, e141 (2008). URL
Morrissy, A. S. et al. Next-generation tag sequencing for cancer gene expression
profiling. Genome Res 19, 1825–35 (2009). URL
19541910. 18
Kircher, M., Heyn, P. & Kelso, J. Addressing challenges in the production and
analysis of illumina sequencing data. BMC Genomics 12, 382 (2011). URL http:
// 19
Morrissy, S. et al. Digital gene expression by tag sequencing on the illumina genome
analyzer. Curr Protoc Hum Genet Chapter 11, Unit 11 11 1–36 (2010). URL 18
Anders, S. Htseq: Analysing high-throughput sequencing data with python . 18
Roberts, A., Pimentel, H., Trapnell, C. & Pachter, L. Identification of novel transcripts in annotated genomes using rna-seq. Bioinformatics 27, 2325–9 (2011). URL 18, 20
Team, R. C. A language and environment for statistical computing (2013). URL 18
Anders, S. & Huber, W. Differential expression analysis for sequence count data.
Genome Biol 11, R106 (2010). URL
18, 20
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edger: a bioconductor package
for differential expression analysis of digital gene expression data. Bioinformatics
26, 139–40 (2010). URL 20
Malone, J. H. & Oliver, B. Microarrays, deep sequencing and the true measure of the
transcriptome. BMC Biol 9, 34 (2011). URL
21627854. 20
Matouk, C. C. & Marsden, P. A. Epigenetic regulation of vascular endothelial gene
expression. Circ Res 102, 873–87 (2008). URL
18436802. 21
Friso, S. et al. Global dna hypomethylation in peripheral blood mononuclear cells as
a biomarker of cancer risk. Cancer Epidemiol Biomarkers Prev 22, 348–55 (2013).
URL 21
Shi, H. et al. Expressed cpg island sequence tag microarray for dual screening of
dna hypermethylation and gene silencing in cancer cells. Cancer Res 62, 3214–20
(2002). URL 22
Wolff, G. L., Kodell, R. L., Moore, S. R. & Cooney, C. A. Maternal epigenetics and
methyl supplements affect agouti gene expression in avy/a mice. Faseb Journal 12,
949–57 (1998). URL 22
Egger, G., Liang, G., Aparicio, A. & Jones, P. A. Epigenetics in human disease and
prospects for epigenetic therapy. Nature 429, 457–63 (2004). URL http://www.ncbi. 22
Razin, A. & Riggs, A. D. Dna methylation and gene function. Science 210, 604–10
(1980). URL 22
Jaenisch, R. Dna methylation and imprinting: why bother? Trends Genet 13, 323–9
(1997). URL 22
Bestor, T. H. The dna methyltransferases of mammals. Hum Mol Genet 9, 2395–402
(2000). URL 22
Bibikova, M. et al. Genome-wide dna methylation profiling using infinium(r) assay.
Epigenomics 1, 177–200 (2009). URL
Brinkman, A. B. et al. Whole-genome dna methylation profiling using methylcapseq. Methods 52, 232–6 (2010). URL
Stevens, M. et al. Estimating absolute methylation levels at single-cpg resolution
from methylation enrichment and restriction enzyme sequencing methods. Genome
Res 23, 1541–53 (2013). URL 22
Gu, H. et al. Genome-scale dna methylation mapping of clinical samples at singlenucleotide resolution. Nat Methods 7, 133–6 (2010). URL http://www.ncbi.nlm.nih.
gov/pubmed/20062050. 22
Bock, C. et al. Quantitative comparison of genome-wide dna methylation mapping
technologies. Nat Biotechnol 28, 1106–14 (2010). URL
pubmed/20852634. 22
Li, C. C. et al. A sustained dietary change increases epigenetic variation in isogenic
mice. PLoS Genet 7, e1001380 (2011). URL
21541011. 23
Milagro, F. I. et al. A dual epigenomic approach for the search of obesity biomarkers:
Dna methylation in relation to diet-induced weight loss. Faseb Journal 25, 1378–89
(2011). URL 23
Dabelea, D. & Crume, T. Maternal environment and the transgenerational cycle of
obesity and diabetes. Diabetes 60, 1849–55 (2011). URL http://www.ncbi.nlm.nih.
gov/pubmed/21709280. 23
Boks, M. P. et al. Current status and future prospects for epigenetic psychopharmacology. Epigenetics 7, 20–8 (2012). URL
22207355. 23
Ong, S. E., Foster, L. J. & Mann, M. Mass spectrometric-based approaches in
quantitative proteomics. Methods 29, 124–30 (2003). URL http://www.ncbi.nlm.nih.
gov/pubmed/12606218. 24
Anderson, L. & Seilhamer, J. A comparison of selected mrna and protein abundances
in human liver. Electrophoresis 18, 533–7 (1997). URL
pubmed/9150937. 24
Cohen, P. The role of protein phosphorylation in human health and disease. the
sir hans krebs medal lecture. Eur J Biochem 268, 5001–10 (2001). URL http:
// 24
Manning, G., Whyte, D. B., Martinez, R., Hunter, T. & Sudarsanam, S. The protein
kinase complement of the human genome. Science 298, 1912–34 (2002). URL http:
// 24
Cohen, P. T. Protein phosphatase 1–targeted in many directions. J Cell Sci 115,
241–56 (2002). URL 24
Ong, S. E. et al. Stable isotope labeling by amino acids in cell culture, silac, as
a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1,
376–86 (2002). URL 24
Gruhler, A. et al. Quantitative phosphoproteomics applied to the yeast pheromone
signaling pathway. Mol Cell Proteomics 4, 310–27 (2005). URL http://www.ncbi. 24
Yang, J. & Honavar, V. Feature subset selection using a genetic algorithm, 117–136
%@ 146137622X (Springer, 1998). 25
McCulloch, W. S. & Pitts., W. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics 115–133 (1943). 25
Murphy, K. P. Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series) (2012). 28
Sun, Z., Rao, X., Peng, L. & Xu, D. Prediction of protein supersecondary structures
based on the artificial neural network method. Protein Eng 10, 763–9 (1997). URL 28
Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S. & Brunak, S. Prediction
of post-translational glycosylation and phosphorylation of proteins from the amino
acid sequence. Proteomics 4, 1633–49 (2004). URL
pubmed/15174133. 28
Saha, S. & Raghava, G. P. Prediction of continuous b-cell epitopes in an antigen
using recurrent neural network. Proteins 65, 40–8 (2006). URL http://www.ncbi.nlm. 28
Eftekhar, B., Mohammad, K., Ardebili, H. E., Ghodsi, M. & Ketabchi, E. Comparison of artificial neural network and logistic regression models for prediction
of mortality in head trauma based on initial clinical data. BMC Med Inform
Decis Mak 5, 3 (2005). URL
// 28
Meisler, M. H. Evolutionarily conserved noncoding dna in the human genome: how
much and what for? Genome Res 11, 1617–8 (2001). URL http://www.ncbi.nlm.nih.
gov/pubmed/11591637. 29
McLaren, W. et al. Deriving the consequences of genomic variants with the ensembl
api and snp effect predictor. Bioinformatics 26, 2069–70 (2010). URL http://www. 29
Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous
variants on protein function using the sift algorithm. Nat Protoc 4, 1073–81 (2009).
URL 29
Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human
missense mutations using polyphen-2. Curr Protoc Hum Genet Chapter 7, Unit7
20 (2013). URL 29
[100] Liu, X., Jian, X. & Boerwinkle, E. dbnsfp: a lightweight database of human nonsynonymous snps and their functional predictions. Hum Mutat 32, 894–9 (2011). URL 29
[101] Gonzalez-Perez, A. & Lopez-Bigas, N. Improving the assessment of the outcome
of nonsynonymous snvs with a consensus deleteriousness score, condel. Am J Hum
Genet 88, 440–9 (2011). URL 29, 44
[102] Cingolani, P. et al. A program for annotating and predicting the effects of single
nucleotide polymorphisms, snpeff: Snps in the genome of drosophila melanogaster
strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012). URL http://www.ncbi.nlm. 29
[103] Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome
sequencing data. Brief Bioinform 15, 256–78 (2014). URL http://www.ncbi.nlm.nih.
gov/pubmed/23341494. 29
[104] Sherry, S. T. et al. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res
29, 308–11 (2001). URL 30, 117
[105] Flicek, P. et al. Ensembl 2013. Nucleic Acids Res 41, D48–55 (2013). URL
// 30, 118, 121
[106] Stenson, P. D. et al. The human gene mutation database (hgmd) and its exploitation
in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinformatics Chapter 1, Unit1 13 (2012). URL
22948725. 30, 117
[107] Landrum, M. J. et al. Clinvar: public archive of relationships among sequence
variation and human phenotype. Nucleic Acids Res 42, D980–5 (2014). URL http:
// 30, 117
[108] Khatri, P., Sirota, M. & Butte, A. J. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8, e1002375 (2012). URL 30, 33
[109] Freeling, M. & Subramaniam, S. Conserved noncoding sequences (cnss) in higher
plants. Curr Opin Plant Biol 12, 126–32 (2009). URL
pubmed/19249238. 30
[110] Bernstein, B. E. et al. An integrated encyclopedia of dna elements in the human genome. Nature 489, 57–74 (2012). URL
22955616. 31
[111] Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common
disease through whole-genome sequencing. Nat Rev Genet 11, 415–25 (2010). URL 31
[112] Ashburner, M. et al. Gene ontology: tool for the unification of biology. the gene
ontology consortium. Nat Genet 25, 25–9 (2000). URL http://www.ncbi.nlm.nih.
gov/pubmed/10802651. 31
[113] Carbon, S. et al. Amigo: online access to ontology and annotation data. Bioinformatics 25, 288–9 (2009). URL 31
[114] Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. Gorilla: a tool for discovery and visualization of enriched go terms in ranked gene lists. BMC Bioinformatics
10, 48 (2009). URL 31
[115] Zhou, X. & Su, Z. Easygo: Gene ontology-based annotation and functional enrichment analysis tool for agronomical species. BMC Genomics 8, 246 (2007). URL 31
[116] Doniger, S. W. et al. Mappfinder: using gene ontology and genmapp to create a
global gene-expression profile from microarray data. Genome Biol 4, R7 (2003).
URL 31
[117] Huang da, W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis
of large gene lists using david bioinformatics resources. Nat Protoc 4, 44–57 (2009).
URL 31
[118] Salwinski, L. et al. The database of interacting proteins: 2004 update. Nucleic Acids
Res 32, D449–51 (2004). URL 32
[119] Licata, L. et al. Mint, the molecular interaction database: 2012 update. Nucleic
Acids Res 40, D857–61 (2012). URL
[120] Kerrien, S. et al. The intact molecular interaction database in 2012. Nucleic Acids
Res 40, D841–6 (2012). URL 32
[121] Bader, G. D., Betel, D. & Hogue, C. W. Bind: the biomolecular interaction network
database. Nucleic Acids Res 31, 248–50 (2003). URL
pubmed/12519993. 32
[122] Breitkreutz, B. J., Stark, C. & Tyers, M. The grid: the general repository for
interaction datasets. Genome Biol 4, R23 (2003). URL http://www.ncbi.nlm.nih.
gov/pubmed/12620108. 32
[123] Keshava Prasad, T. S. et al. Human protein reference database–2009 update. Nucleic
Acids Res 37, D767–72 (2009). URL
[124] Lage, K. et al.
A human phenome-interactome network of protein
complexes implicated in genetic disorders.
Nat Biotechnol 25, 309–16
(2007). URL
journal/v25/n3/pdf/nbt1295.pdf. 32
[125] Portales-Casamar, E. et al. Jaspar 2010: the greatly expanded open-access database
of transcription factor binding profiles. Nucleic Acids Res 38, D105–10 (2010). URL 32
[126] Matys, V. et al. Transfac and its module transcompel: transcriptional gene regulation
in eukaryotes. Nucleic Acids Res 34, D108–10 (2006). URL http://www.ncbi.nlm. 32
[127] Lachmann, A. et al. Chea: transcription factor regulation inferred from integrating
genome-wide chip-x experiments. Bioinformatics 26, 2438–44 (2010). URL http:
// 32
[128] Qin, B. et al. Cistromemap: a knowledgebase and web server for chip-seq and
dnase-seq studies in mouse and human. Bioinformatics 28, 1411–2 (2012). URL 32
[129] Ziebarth, J. D., Bhattacharya, A. & Cui, Y. Ctcfbsdb 2.0: a database for ctcfbinding sites and genome organization. Nucleic Acids Res 41, D188–94 (2013).
URL 32
[130] Yang, J. H., Li, J. H., Jiang, S., Zhou, H. & Qu, L. H. Chipbase: a database for
decoding the transcriptional regulation of long non-coding rna and microrna genes
from chip-seq data. Nucleic Acids Res 41, D177–87 (2013). URL http://www.ncbi. 32
[131] Heinemeyer, T. et al. Databases on transcriptional regulation: Transfac, trrd and
compel. Nucleic Acids Res 26, 362–7 (1998). URL
pubmed/9399875. 32
[132] Messeguer, X. et al. Promo: detection of known transcription regulatory elements
using species-tailored searches. Bioinformatics 18, 333–4 (2002). URL http://www. 32
[133] Bailey, T. L. et al. Meme suite: tools for motif discovery and searching. Nucleic
Acids Res 37, W202–8 (2009). URL
[134] Chekmenev, D. S., Haid, C. & Kel, A. E. P-match: transcription factor binding site
search by combining patterns and weight matrices. Nucleic Acids Res 33, W432–7
(2005). URL 32
[135] Fazius, E., Shelest, V. & Shelest, E. Sitar: a novel tool for transcription factor
binding site prediction. Bioinformatics 27, 2806–11 (2011). URL http://www.ncbi. 32
[136] Glazko, G. V. & Emmert-Streib, F. Unite and conquer: univariate and multivariate
approaches for finding differentially expressed gene sets. Bioinformatics 25, 2348–54
(2009). URL 32
[137] Kanehisa, M. & Goto, S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic
Acids Res 28, 27–30 (2000). URL 32
[138] D’Eustachio, P. Reactome knowledgebase of human biological pathways and processes. Methods Mol Biol 694, 49–61 (2011). URL
pubmed/21082427. 32
[139] MacRae, C. A. Action and the actionability in exome variation. Circ Cardiovasc
Genet 5, 597–8 (2012). URL 32
[140] Kel, A. et al. Explain: finding upstream drug targets in disease gene regulatory
networks. SAR QSAR Environ Res 19, 481–94 (2008). URL http://www.ncbi.nlm. 32
[141] Warde-Farley, D. et al. The genemania prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 38,
W214–20 (2010). URL 32
[142] Franceschini, A. et al. String v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 41, D808–15 (2013). URL 32
[143] Chen, E. Y. et al. Enrichr: interactive and collaborative html5 gene list enrichment
analysis tool. BMC Bioinformatics 14, 128 (2013). URL http://www.ncbi.nlm.nih.
gov/pubmed/23586463. 32
[144] Shannon, P. et al.
a software environment for integrated
models of biomolecular interaction networks.
Genome Res 13, 2498–504
(2003). URL
content/13/11/2498.full.pdf. 32
[145] Reich, M. et al. Genepattern 2.0. Nat Genet 38, 500–1 (2006). URL 33
[146] Zhao, J., Gupta, S., Seielstad, M., Liu, J. & Thalamuthu, A. Pathway-based analysis
using reduced gene subsets in genome-wide association studies. BMC Bioinformatics
12, 17 (2011). URL 33
[147] Dold, S., Wjst, M., von Mutius, E., Reitmeir, P. & Stiepel, E. Genetic risk for
asthma, allergic rhinitis, and atopic dermatitis. Arch Dis Child 67, 1018–22 (1992).
URL 33
[148] O’Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical
implications for exome and genome sequencing. Genome Med 5, 28 (2013). URL 34
[149] Nabholz, C. E. & von Overbeck, J. Gene-environment interactions and the complexity of human genetic diseases. J Insur Med 36, 47–53 (2004). URL http:
// 35
[150] John, B. & Lewis, K. R. Chromosome variability and geographic distribution in
insects. Science 152, 711–21 (1966). URL
17797432. 35
[151] Mannino, D. M. et al. Surveillance for asthma–united states, 1980-1999. MMWR
Surveill Summ 51, 1–13 (2002). URL
[152] Martinez, F. D. et al. Asthma and wheezing in the first six years of life. the group
health medical associates. N Engl J Med 332, 133–8 (1995). URL http://www.ncbi. 39
[153] Akhabir, L. & Sandford, A. J. Genome-wide association studies for discovery of genes involved in asthma.
Respirology 16, 396–406 (2011).
[154] Bisgaard, H., Bonnelykke, K. & Stokholm, J. Immune-mediated diseases and microbial exposure in early life. Clin Exp Allergy 44, 475–81 (2014). URL http:
// 39
[155] Cookson, W. O. & Moffatt, M. F. Asthma: an epidemic in the absence of infection?
Science 275, 41–2 (1997). URL 39, 53
[156] Mantzouranis, E., Papadopouli, E. & Michailidi, E. Childhood asthma: recent
developments and update. Curr Opin Pulm Med 20, 8–16 (2014). URL http:
// 39
[157] Gilliland, F. D. et al. Effects of glutathione s-transferase m1, maternal smoking
during pregnancy, and environmental tobacco smoke on asthma and wheezing in
children. Am J Respir Crit Care Med 166, 457–63 (2002). URL http://www.ncbi. 39
[158] Young, S. et al. The influence of a family history of asthma and parental smoking
on airway responsiveness in early infancy. N Engl J Med 324, 1168–73 (1991). URL 39
[159] Illi, S. et al. Perennial allergen sensitisation early in life and chronic asthma in
children: a birth cohort study. Lancet 368, 763–70 (2006). URL http://www.ncbi. 39
[160] Weitzman, M., Gortmaker, S. & Sobol, A. Racial, social, and environmental risks
for childhood asthma. Am J Dis Child 144, 1189–94 (1990). URL http://www.ncbi. 40
[161] Von Ehrenstein, O. S. et al. Reduced risk of hay fever and asthma among children
of farmers. Clin Exp Allergy 30, 187–93 (2000). URL
pubmed/10651770. 40
[162] Moffatt, M. F. et al. A large-scale, consortium-based genomewide association study
of asthma. N Engl J Med 363, 1211–21 (2010). URL
pubmed/20860503. 40, 55
[163] Potaczek, D. P. et al. Different fcer1a polymorphisms influence ige levels in asthmatics and non-asthmatics. Pediatr Allergy Immunol 24, 441–9 (2013). URL 40
[164] Sleiman, P. M. et al. Variants of dennd1b associated with asthma in children. N
Engl J Med 362, 36–44 (2010). URL
40, 55
[165] Murphy, S. K. & Hollingsworth, J. W. Stress: a possible link between genetics,
epigenetics, and childhood asthma. Am J Respir Crit Care Med 187, 563–4 (2013).
URL 40
[166] Hancock, D. B. et al. Meta-analyses of genome-wide association studies identify
multiple loci associated with pulmonary function. Nat Genet 42, 45–52 (2010).
URL 41
[167] Repapi, E. et al. Genome-wide association study identifies five loci associated with
lung function. Nat Genet 42, 36–44 (2010). URL
pubmed/20010834. 41
[168] Martinez, F. D. Managing childhood asthma: challenge of preventing exacerbations.
Pediatrics 123 Suppl 3, S146–50 (2009). URL
19221157. 43
[169] Schatz, M. et al. Relationships among quality of life, severity, and control measures
in asthma: An evaluation using factor analysis. Journal of Allergy and Clinical
Immunology 115, 1049–1055 (2005). URL <GotoISI>://WOS:000229055100023. 43
[170] Takeichi, M. Cadherins: a molecular family important in selective cell-cell adhesion.
Annu Rev Biochem 59, 237–52 (1990). URL
2197976. 44
[171] Wheelock, M. J. & Johnson, K. R. Cadherins as modulators of cellular phenotype. Annu Rev Cell Dev Biol 19, 207–35 (2003). URL http://www.ncbi.nlm.nih.
gov/pubmed/14570569. 44
[172] Nawijn, M. C., Hackett, T. L., Postma, D. S., van Oosterhout, A. J. & Heijink, I. H.
E-cadherin: gatekeeper of airway mucosa and allergic sensitization. Trends Immunol
32, 248–55 (2011). URL 44
[173] Koppelman, G. H. et al. Identification of pcdh1 as a novel susceptibility gene for
bronchial hyperresponsiveness. Am J Respir Crit Care Med 180, 929–35 (2009).
URL 44
[174] Yanai, I. et al. Genome-wide midrange transcription profiles reveal expression level
relationships in human tissue specification. Bioinformatics 21, 650–9 (2005). URL 44
[175] Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20
regions of the human cns. Neurogenetics 7, 67–80 (2006). URL http://www.ncbi.nlm. 44
[176] Watkins, N. A. et al. A haematlas: characterizing gene expression in differentiated human blood cells. Blood 113, e1–9 (2009). URL
pubmed/19228925. 44
[177] Ross, A. J., Dailey, L. A., Brighton, L. E. & Devlin, R. B. Transcriptional profiling
of mucociliary differentiation in human airway epithelial cells. Am J Respir Cell Mol
Biol 37, 169–85 (2007). URL 44
[178] Grad, R. & Morgan, W. J. Long-term outcomes of early-onset wheeze and asthma.
J Allergy Clin Immunol 130, 299–307 (2012). URL
pubmed/22738675. 53
[179] Nievas, I. F. & Anand, K. J. Severe acute asthma exacerbation in children: a stepwise
approach for escalating therapy in a pediatric intensive care unit. J Pediatr Pharmacol Ther 18, 88–104 (2013). URL
[180] Bisgaard, H. et al. Chromosome 17q21 gene variants are associated with asthma and
exacerbations but not atopy in early childhood. Am J Respir Crit Care Med 179,
179–85 (2009). URL 54, 55
[181] Granell, R. et al. Examination of the relationship between variation at 17q21 and
childhood wheeze phenotypes. J Allergy Clin Immunol 131, 685–94 (2013). URL 54
[182] Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing.
Nat Methods 7, 111–8 (2010). URL 54
[183] Moffatt, M. F. et al. Genetic linkage of t-cell receptor alpha/delta complex to specific
ige responses. Lancet 343, 1597–600 (1994). URL
pubmed/7911920. 54
[184] Moffatt, M. F., Traherne, J. A., Abecasis, G. R. & Cookson, W. O. Single nucleotide
polymorphism and linkage disequilibrium within the tcr alpha/delta locus. Hum Mol
Genet 9, 1011–9 (2000). URL 54
[185] Palmer, C. N. et al. Common loss-of-function variants of the epidermal barrier
protein filaggrin are a major predisposing factor for atopic dermatitis. Nat Genet
38, 441–6 (2006). URL 54
[186] Ferreira, M. A. et al. Identification of il6r and chromosome 11q13.5 as risk loci
for asthma. Lancet 378, 1006–14 (2011). URL
21907864. 55
[187] Hirota, T. et al. Genome-wide association study identifies three new susceptibility
loci for adult asthma in the japanese population. Nat Genet 43, 893–6 (2011). URL 56
[188] Torgerson, D. G. et al. Meta-analysis of genome-wide association studies of asthma
in ethnically diverse north american populations. Nat Genet 43, 887–92 (2011). URL 56
[189] Li, X. et al. The c11orf30-lrrc32 region is associated with total serum ige levels in
asthmatic patients. J Allergy Clin Immunol 129, 575–8, 578 e1–9 (2012). URL 56
[190] Marenholz, I. et al. The eczema risk variant on chromosome 11q13 (rs7927894) in
the population-based alspac cohort: a novel susceptibility factor for asthma and hay
fever. Hum Mol Genet 20, 2443–9 (2011). URL
21429916. 56
[191] Bonnelykke, K. et al. Meta-analysis of genome-wide association studies identifies
ten loci influencing allergic sensitization. Nat Genet 45, 902–6 (2013). URL http:
// 56
[192] Li, X. et al. Genome-wide association study of asthma identifies rad50-il13 and
hla-dr/dq regions. J Allergy Clin Immunol 125, 328–335 e11 (2010). URL http:
// 56
[193] Bonnelykke, K. et al. A genome-wide association study identifies cdhr3 as a susceptibility locus for early childhood asthma with severe exacerbations. Nat Genet 46,
51–5 (2014). URL 56, 126
[194] Norgaard-Pedersen, B. & Hougaard, D. M. Storage policies and use of the danish
newborn screening biobank. Journal of inherited metabolic disease 30, 530–6 (2007).
URL <GotoISI>://MEDLINE:17632694. 56
[195] Hollegaard, M. V. et al. Genome-wide scans using archived neonatal dried blood
spot samples. BMC Genomics 10, 297 (2009). URL
pubmed/19575812. 56
[196] Hollegaard, M. V. et al. Robustness of genome-wide scanning using archived dried
blood spot samples as a dna source. BMC Genet 12, 58 (2011). URL http://www. 56
[197] Jorgensen, T. J. et al. Hypothesis-driven candidate gene association studies: practical design and analytical considerations. Am J Epidemiol 170, 986–93 (2009). URL 58
[198] Wjst, M. et al. Asthma families show transmission disequilibrium of gene variants
in the vitamin d metabolism and signalling pathway. Respir Res 7, 60 (2006). URL 58
[199] Hwang, S. et al. A protein interaction network associated with asthma. J Theor
Biol 252, 722–31 (2008). URL 58
[200] Liu, Y. & Liu, S. Protein-protein interaction network analysis of children atopic
asthma. Eur Rev Med Pharmacol Sci 16, 867–72 (2012). URL http://www.ncbi.nlm. 58
[201] Indap, A. R., Cole, R., Runge, C. L., Marth, G. T. & Olivier, M. Variant discovery
in targeted resequencing using whole genome amplified dna. BMC Genomics 14,
468 (2013). URL 59
[202] Longmate, J. A., Larson, G. P., Krontiris, T. G. & Sommer, S. S. Three ways of
combining genotyping and resequencing in case-control association studies. PLoS
One 5, e14318 (2010). URL 60
[203] Rivas, M. A. et al. Deep resequencing of gwas loci identifies independent rare variants
associated with inflammatory bowel disease. Nat Genet 43, 1066–73 (2011). URL 60
[204] Tabor, H. K., Risch, N. J. & Myers, R. M. Candidate-gene approaches for studying
complex genetic traits: practical considerations. Nat Rev Genet 3, 391–7 (2002).
URL 60
[205] Adeyemo, A. & Rotimi, C. Genetic variants associated with complex human diseases
show wide variation across multiple populations. Public Health Genomics 13, 72–9
(2010). URL 60
[206] Marigorta, U. M. & Navarro, A. High trans-ethnic replicability of gwas results implies
common causal variants. PLoS Genet 9, e1003566 (2013). URL http://www.ncbi.nlm. 60
[207] Yang, X. Use of functional genomics to identify candidate genes underlying human
genetic association studies of vascular diseases. Arterioscler Thromb Vasc Biol 32,
216–22 (2012). URL 60
[208] Rosenwasser, L. J. & Borish, L. Genetics of atopy and asthma: the rationale behind
promoter-based candidate gene studies (il-4 and il-10). Am J Respir Crit Care Med
156, S152–5 (1997). URL 60
[209] Kaimal, V. et al. Integrative systems biology approaches to identify and prioritize
disease and drug candidate genes. Methods Mol Biol 700, 241–59 (2011). URL 60
[210] Middleton, F. A. et al. Integrating genetic, functional genomic, and bioinformatics
data in a systems biology approach to complex diseases: application to schizophrenia.
Methods Mol Biol 401, 337–64 (2007). URL
18368374. 60
[211] Flier, J. S. Obesity wars: molecular progress confronts an expanding epidemic. Cell
116, 337–50 (2004). URL 77
[212] Spiegelman, B. M. & Flier, J. S. Obesity and the regulation of energy balance. Cell
104, 531–43 (2001). URL 77
[213] Friedman, J. M. A war on obesity, not the obese. Science 299, 856–8 (2003). URL 77
[214] O’Rahilly, S. Human genetics illuminates the paths to metabolic disease. Nature
462, 307–14 (2009). URL 77
[215] Rosen, E. & Spiegelman, B. What we talk about when we talk about fat. Cell 156,
20–44 (2014). URL
78, 79
[216] Nedergaard, J., Bengtsson, T. & Cannon, B. Three years with adult human brown
adipose tissue. Ann N Y Acad Sci 1212, E20–36 (2010). URL http://www.ncbi.nlm. 78
[217] Symonds, M. E. Brown adipose tissue growth and development. Scientifica (Cairo)
2013, 305763 (2013). URL 79
[218] Rhodes, P. et al. Adult-onset obesity reveals prenatal programming of glucose-insulin
sensitivity in male sheep nutrient restricted during late gestation. PLoS One 4, e7393
(2009). URL 80
[219] Rankinen, T. et al. The human obesity gene map: the 2005 update. Obesity (Silver
Spring) 14, 529–644 (2006). URL 80
[220] Kunej, T. et al. Obesity gene atlas in mammals. J Genomics 1, 45–55 (2012). 80,
[221] Bell, C. G. et al. Integrated genetic and epigenetic analysis identifies haplotypespecific methylation in the fto type 2 diabetes and obesity susceptibility locus. PLoS
One 5, e14040 (2010). URL 80
[222] Cordero, P. et al. Leptin and tnf-alpha promoter methylation levels measured by
msp could predict the response to a low-calorie diet. J Physiol Biochem 67, 463–70
(2011). URL 80
[223] Digel, W. & Lubbert, M. Dna methylation disturbances as novel therapeutic target
in lung cancer: preclinical and clinical results. Crit Rev Oncol Hematol 55, 1–11
(2005). URL 80
[224] Martinez, J. A., Milagro, F. I., Claycombe, K. J. & Schalinske, K. L. Epigenetics in
adipose tissue, obesity, weight loss, and diabetes. Adv Nutr 5, 71–81 (2014). URL 80
[225] Jiang, Y. H., Bressler, J. & Beaudet, A. L. Epigenetics and human disease. Annu
Rev Genomics Hum Genet 5, 479–510 (2004). URL
pubmed/15485357. 80
[226] The international hapmap project. Nature 426, 789–96 (2003). URL 117
[227] Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012). URL
23128226. 117
[228] Forbes, S. A. et al. Cosmic: mining complete cancer genomes in the catalogue of
somatic mutations in cancer. Nucleic Acids Res 39, D945–50 (2011). URL http:
// 117
[229] Amladi, S. Online mendelian inheritance in man ’omim’. Indian J Dermatol Venereol
Leprol 69, 423–4 (2003). URL 117, 122
[230] Rappaport, N. et al. Malacards: an integrated compendium for diseases and their
annotation. Database (Oxford) 2013, bat018 (2013). URL http://www.ncbi.nlm.nih.
gov/pubmed/23584832. 117, 122
[231] Safran, M. et al. Genecards version 3: the human gene integrator. Database (Oxford)
2010, baq020 (2010). URL 117
[232] Hindorff LA, M. J. E. B. I. J. H. H. P. K. A., MacArthur J (European Bioinformatics Institute) & TA, M. A catalog of published genome-wide association studies.
Available at: (2013). 117
[233] Cariaso, M. & Lennon, G. Snpedia: a wiki supporting personal genome annotation,
interpretation and analysis. Nucleic Acids Res 40, D1308–12 (2012). URL http:
// 117, 122
[234] Li, R. et al. Building the sequence map of the human pan-genome. Nat Biotechnol
28, 57–63 (2010). URL 118
[235] Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem.
arXiv preprint arXiv 1303.3997 (2013). 118
[236] Agresti, A. Approximate is better than “exact” for interval estimation of binomial
proportions. The American statistician 52, 119 (1998). 119
[237] Hervella, M. et al. The loss of functional caspase-12 in europe is a pre-neolithic event.
PLoS One 7, e37022 (2012). URL 119
[238] Blair, D. R. et al. A nondegenerate code of deleterious variants in mendelian loci
contributes to complex disease risk. Cell 155, 70–80 (2013). URL http://www.ncbi. 119
[239] Ward, L. D. & Kellis, M. Haploreg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants.
Nucleic Acids Res 40, D930–4 (2012). URL
22064851. 120
[240] MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human
protein-coding genes. Science 335, 823–8 (2012). URL
pubmed/22344438. 120
[241] Wheeler, D. A. et al. The complete genome of an individual by massively parallel
dna sequencing. Nature 452, 872–6 (2008). URL
18421352. 121
[242] Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5,
e254 (2007). URL 121
[243] Rasmussen, M. et al. Ancient human genome sequence of an extinct palaeo-eskimo.
Nature 463, 757–62 (2010). URL 121
[244] Rasmussen, M. et al. An aboriginal australian genome reveals separate human dispersals into asia. Science 334, 94–8 (2011). URL
21940856. 121
[245] Olalde, I. et al. Derived immune and ancestral pigmentation alleles in a 7,000-yearold mesolithic european. Nature 507, 225–8 (2014). URL http://www.ncbi.nlm.nih.
gov/pubmed/24463515. 121
[246] Welter, D. et al. The nhgri gwas catalog, a curated resource of snp-trait associations.
Nucleic Acids Res 42, D1001–6 (2014). URL
24316577. 122
[247] Sanghera, D. K. & Blackett, P. R. Type 2 diabetes genetics: Beyond gwas. J Diabetes
Metab 3 (2012). URL 122
[248] Rasmussen, M. et al. The genome of a late pleistocene human from a clovis burial
site in western montana. Nature 506, 225–9 (2014). URL http://www.ncbi.nlm.nih.
gov/pubmed/24522598. 122
[249] Kin, T. & Ono, Y. Idiographica: a general-purpose web application to build idiograms on-demand for human, mouse and rat. Bioinformatics 23, 2945–6 (2007).
URL 122
[250] Howard, B. V. et al. Rising tide of cardiovascular disease in american indians. the
strong heart study. Circulation 99, 2389–95 (1999). URL http://www.ncbi.nlm.nih.
gov/pubmed/10318659. 122
[251] Lee, E. T. et al. Diabetes and impaired glucose tolerance in three american indian
populations aged 45-74 years. the strong heart study. Diabetes Care 18, 599–610
(1995). URL 122
[252] Sinclair, K. A., Bogart, A., Buchwald, D. & Henderson, J. A. The prevalence of
metabolic syndrome and associated risk factors in northern plains and southwest
american indians. Diabetes Care 34, 118–20 (2011). URL http://www.ncbi.nlm.nih.
gov/pubmed/20864516. 122
[253] Chen, R. et al. Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases. PLoS Genet 8,
e1002621 (2012). URL 122, 126
[254] Chakravarthy, M. V. & Booth, F. W. Eating, exercise, and ”thrifty” genotypes: connecting the dots toward an evolutionary understanding of modern chronic diseases.
J Appl Physiol (1985) 96, 3–10 (2004). URL
14660491. 122
[255] Corbo, R. M. & Scacchi, R. Apolipoprotein e (apoe) allele distribution in the world.
is apoe*4 a ’thrifty’ allele? Ann Hum Genet 63, 301–10 (1999). URL http://www. 123
[256] Tinanoff, N. Cleft lip and palate. Nelson Textbook of Pediatrics, 18th edition.
Philadelphia: Saunders Elsevier 1532–1533 (2007). 123
[257] MacArthur, D. G. et al. Loss of actn3 gene function alters mouse muscle metabolism
and shows evidence of positive selection in humans. Nat Genet 39, 1261–5 (2007).
URL 123
[258] Belsky, D. W. et al. Development and evaluation of a genetic risk score for obesity. Biodemography Soc Biol 59, 85–100 (2013). URL
pubmed/23701538. 125
[259] Steckel, R. H. & Rose, J. C. The backbone of history: health and nutrition in the
Western Hemisphere, vol. 2 %@ 0521801672 (Cambridge University Press, 2002).
[260] Corona, E. et al. Analysis of the genetic basis of disease in the context of worldwide
human relationships and migration. PLoS Genet 9, e1003447 (2013). URL http:
// 126
[261] Replication, D. I. G. & Meta-analysis, C. Genome-wide trans-ancestry meta-analysis
provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat
Genet 46, 234–44 (2014). URL 126
[262] Benfey, P. N. & Mitchell-Olds, T. From genotype to phenotype: systems biology
meets natural variation. Science 320, 495–7 (2008). URL http://www.ncbi.nlm.nih.
gov/pubmed/18436781. 126
[263] Stergachis, A. B. et al. Exonic transcription factor binding directs codon choice and
affects protein evolution. Science 342, 1367–72 (2013). URL http://www.ncbi.nlm. 131
[264] Ritchie, G. R., Dunham, I., Zeggini, E. & Flicek, P. Functional annotation of noncoding sequence variants. Nat Methods 11, 294–6 (2014). URL http://www.ncbi.nlm. 131
[265] Perez-Llamas, C. & Lopez-Bigas, N. Gitools: analysis and visualisation of genomic
data using interactive heat-maps. PLoS One 6, e19541 (2011). URL http://www. 131
[266] Smoot, M. E., Ono, K., Ruscheinski, J., Wang, P. L. & Ideker, T. Cytoscape 2.8: new
features for data integration and network visualization. Bioinformatics 27, 431–2
(2011). URL 131
[267] Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics.
Genome Res 19, 1639–45 (2009). URL
[268] Brown, K. R. et al. Navigator: Network analysis, visualization and graphing
toronto. Bioinformatics 25, 3327–9 (2009). URL
pubmed/19837718. 131