genetical genomics applied to haemopoietic development using
Transcription
genetical genomics applied to haemopoietic development using
GENETICAL GENOMICS APPLIED TO HAEMOPOIETIC DEVELOPMENT USING MOUSE MODELS BIOINFORMATICS RESEARCH INTO ILLUMINA MICROARRAY DATA Name: Student number: Email: Project: Date: Danny Arends S1276891 [email protected] Bachelor Thesis: Life science & Technology -GPB2007 Table of Content Genetical genomics applied to haemopoietic development using mouse models .................................................... 1 Content ..................................................................................................................................................................2 Acknowledgment ...................................................................................................................................................3 Abstract .................................................................................................................................................................4 Dutch abstract .......................................................................................................................................................5 Introduction............................................................................................................................................................6 Materials & Method................................................................................................................................................ 7 Materials............................................................................................................................................................7 Methode ............................................................................................................................................................7 Results...................................................................................................................................................................9 Discussion ...........................................................................................................................................................16 Future perspectives .............................................................................................................................................19 Literature .............................................................................................................................................................20 Web resources ....................................................................................................................................................21 Acknowledgment During the writing of this thesis a lot of support was offered from a group of wonderful people, helping and guiding me. I don’t think that without them this work would have been possible. I learned a lot during the last few months that I’ve been in this group, and feel challenged and stimulated to continue in this field and make a contribution to bioinformatics as a developing field of research. I would like to thank the following people from the Groningen Bioinformatics Centre: Ritsert Jansen Yang Li Bruno Tesson Morris Swertz Gonzalo Vera - For welcoming me into the Bioinformatics group, and giving me the opportunity to do this bachelor thesis. - For supervising me during this project. Also for the open door, and the help on writing this thesis. - For your help in finding errors, pointing out the obvious and useful input. - For a listening ear, feedback and enjoyable lunch breaks. - For help on building R-packages and many laughs in the sun. All the other staff/students at the GBIC department who enabled me to do this work and helped me with my questions, or asked questions so that thinking about answers and possible problems was necessary. Also I would like to extend my thanks to the Department of Cell Biology, Stem Cell Biology: Gerald de Haan Leonid v. Bystrykh - For trusting me with the Illumina datasets, and insight into the experiment. - For the useful feedback and biological focus, and questions. Abstract Illumina technology microarrays are a novel way of doing whole genome analysis. These arrays are manufactured in a completely new fashion. Oligonucliotide probes are attached to glass beads, pooled together and then distributed across an etched surface. The beads will randomly locate themselves (in these etched wells). The manufacturing process of these bead arrays produces arrays that are unique. To overcome this problem of no two arrays being the same, decoding is needed to identify each bead located at each position. This decoding is done by sequential hybridization. Because Illumina arrays are completely different than normal (Affymetrix or spotted oligo) microarrays, bioinformatics tools developed for those platforms aren’t (directly) suited to be used on Illumina arrays. In this thesis a method is proposed (and implemented) to automatically generate data quality control plots. Also the method provides pre-processing support and it generates different files used in the follow up analysis of the data. To develop this method Illumina array data for 62 samples (mus musculus) were provided by the department of stem cell biology at the UMCG. These 62 samples were drawn from 4 different cell types (stem cell, progenitor, erythrocyte and granulocyte) to allow differential expression of genes to be observed. The biological question in this research: “How do haemopoietic cells differentiate into other cell types?” is an ongoing field of research studied by many groups all over the world. The new approach to uncover regulatory elements on differentiation is by sampling Recombinant Inbred Lines (RILs). These inbred mus musculus have the unique ability that their genome is a mosaic of the parents strains. Using molecular markers genotyping of these animals at specific genome locations was done. Because there are only 2 origins of genetic diversity (the parents) a map can be made for each RIL. On these RIL populations genetical genomics can be performed. Genetical genomics is the mapping of a trait (in our case: expression) onto the most likely molecular marker (expression QTL), this enables researchers to find regulators for gene expression and genetic hotspots governing differentiation. A method for this kind of analysis (together with differential expression) is also provided in the proposed method. The method has been implemented using the R-programming language to facilitate easy access for the novice user, but at the same time give more experienced users a tool that will help them analyze Illumina data in less time and with greater accuracy. Dutch abstract Illumina microarrays bieden een nieuwe manier van volledige genoom analyse. Oligonucleotides probes worden vastgemaakt aan glazen kraaltjes (beads), daarna worden ze samen gemixt en verspreid over een drageroppervlakte die d.m.v. etst technieken zeer kleine gaatjes (wells) bevat. De kraaltjes uit de kralenpool zullen in de gaatjes vallen en zich willekeurig verdelen over dit oppervlak. Het op deze manier produceren van microarrays heeft als gevolg dat elke microarray uniek is. Om het probleem op te lossen dat geen twee microarrays gelijk zijn, moet er en decoder stap toegepast worden om de identiteit van elk kraaltje vast te stellen. Decoderen gebeurt d.m.v. herhaaldelijk hybridizeren van de microarray met verf gelabelde oligonucleotide decoders. Omdat Illumina arrays complete anders zijn dan gewone (bijv. Affimetix of geprinte oligonucliotide) arrays, zijn bio-informatica gereedschap gemaakt voor deze technologieën niet (direct) bruikbaar bij de analyse van Illumina arrays. In deze bachelor scriptie word een manier voorgesteld (en aangeboden) om automatisch verschillende data bewerkingsstappen te doen aan Illumina microarrays. Deze stappen omvatten: Data kwaliteit controle grafieken, voor bewerking van de gegevens en uitvoer naar verschillende programma gebruikt bij verdere analyse van de gegevens. Om deze methode te ontwikkelen werden 62 samples (mus musculus) geanalyseerd met behulp van Illumina sentrix microarrays. De samples zijn beschikbaar gesteld door de afdeling stamcel biologie in het UMCG. Deze 62 samples kwamen van 4 verschillende celtypes (stamcel, voorloper (progenitor), erytrocyten en granulocyten) om differentiële expressie van genen te observeren. De biologische vraag achter dit onderzoek: ”Hoe differentiëren haemopoietische cellen tot andere celtypes?” is een onderzoeksvraag waaraan groepen vanuit de hele wereld werken. De nieuwigheid van dit onderzoek om genetische regulatoren te vinden is het gebruik van recombinant inbred lines (RILs), deze inteelt muizen stammen hebben de unieke eigenschap dat ze genetisch stabiel zijn en een mozaïek van de genomen van hun ouders. Met behulp van moleculaire markers is het genetische mozaïek van deze muizen stammen in kaart gebracht. Omdat er maar twee ouders zijn die bijdragen aan de genetische diversiteit is deze kaart uitermate geschikt voor genetical genomics.. Genetical genomics is het vakgebied waarin een eigenschap (trait, in ons geval expressie) wordt geassocieerd met dit mozaïek van moleculaire marker. Dit stelt onderzoekers in staat om regulatoren te vinden voor gen expressie en/of genetische locaties die een regulerende functie hebben op ontwikkeling en differentiatie. Dit soort analyse methodes (samen met differentiële expressie) zijn ook aanwezig in de voorgestelde methode. Deze methode is geïmplementeerd door middel van de programmeertaal R om gemakkelijk toegankelijk te zijn voor de ‘biologische’ gebruiker, maar ook zodat experts een hulpmiddel hebben om Illumina microarray data analyse te doen in minder tijd en met meer nauwkeurigheid. Introduction Expression profiling provides to researchers a valuable tool to understanding life sciences. With the complete sequence of the mus musculus genome available and variation of the genome still under active investigation. It is possible for researchers to find and map individual genetic variation, and use this information to associate genetic differences with diseases or unravel genetics underlying development of the diverse tissues that we see in eukaryotes. This field of research is currently called genetical genomics[1-3]. Mapping of these complex traits is done by using a specific kind of animal models, recombinant inbred lines[4] (RILs). RILs are homozygous inbred mouse species. These RILs are genetically stable and a mosaic of the parents genomes. This mosaic can be unraveled by using molecular markers to determine the origin of each marker, and when comparing a quantitative trait with these marker maps correlation between the map and the trait could indicate that the mapped location is ‘responsible’ for the trait. This is called QTL mapping. QTL mapping provides researcher with a valuable tool for unraveling genetic variation responsible for numerous traits seen in nature. With the ability to measure thousands of genes at the same time, due to improvements in microarray technology, another kind of trait mapping can now take place: ExpressionQTL mapping. This mapping of gene expression data combined onto marker data from a RIL can then be used to draw biologically important conclusions[2]. Because the quality of the data influences the results of statistical analysis, pre-processing and data handling should be the highest priority of every researcher currently involved in the field of theoretical biology. To uncover the genetics behind the developmental stages of haemopoietic stem cells in mus musculus the department of stem cell biology at the UMCG in Groningen set up a genetical genomics Fig 1. Differentiation model. experiment. Four different types of cells were From stem cell to either erythroid or granulocyte. extracted from different RILs, these cells represent four critical developmental stages in the haemopoietic differentiation (see fig 1.). A stem cell differentiates into a progenitor cell, this then differentiates into either an granulocyte or a erythrocyte. Cells were extracted from mus musculus of the BXD recombinant inbred line. The BXD strain is a recombinant inbred line derived by crossing C57BL/6J (B6) and DBA/2J (D2) and then inbreeding of progeny for many years. Around 100 different RIL strains of this mixture are made from late 1970 until now. One of the advantages of this strain is that both the origin strains are sequenced. Cells extracted from several of these RIL strains were separated using a cell sorter based on expressed surface proteins. Next from these cells mRNA was extracted and this was prepared to be analyzed using Illumina[5] microarrays. After preparation and hybridization the arrays were scanned and these images were analyzed to obtain raw intensity values (bead level). The datasets from this experiment are the focus of this thesis. The goal of this experiment was to find regulatory switches that govern differentiation into either granulocyte or erythrocyte. These switches can be a master gene controlling differentiation (e.g. like HOX genes) but it could also be a complex trait governed by many genes and their interactions. When it is a complex trait genetical genomics research will focus on the genes involved in differentiation but also in the pathways they participate. Data quality control will be done by standard bioinformatics tools, adapted to suit Illumina microarray needs. The results from this analysis will be the basis for further research into the development of a protocol for processing Illumina array data. A genetical genomics approach will be used to find regulatory switches governing differentiation. Other research in this field has been done by the UMCG on the level of stemcells by using Affymetrix technology [1, 6]. Groups around the world are working on haemopoietic stemcells, they are working not only the differentiation of haemopoietic stemcells into other cell types but also their ability to proliferate without exhaustion[7]. Research into the differentiation by using all four cell types and genetical genomics is a novel approach. The genetical genomics approach has been used successfully before on other cell types[2] to find regulatory locations. Research into the decoding of the beadarrays and data processing afterwards is still an active field of research [8-12], because Illumina is a new technology of which not every detail is fully known and understood. Also the designing of microarray experiments for this kind of research is an active field of research, because much analytical power is to be gained from better microarray design. Research is done into optimal design [13] of microarray experiments. Materials & Method Materials AriseSoft Winsyntax 2.0 - Freeware text editor RGui 2.5.0 - R programming language interpreter Illumina sentrix beadarray 13 pieces Illumina summary data on 62 samples 20 Stem cell (BXD SCA+ KIT+) 14 Progenitory (BXD SCA- KIT+) 14 Erythrocyte (BXD TER119+) 14 Granulocytes (BXD GR1+) Annotation file (provided by G. de Haan) Array/Sample description files The following analysis tools / R-packages were used during analysis of the data: Illumina GUI Bioconductors Affy GO GOCluster gPlots IlluminaMousev1p1 matrix2png Cytoscape BINGO BioNetBuilder (Graphic user interface for creating Bead Summary Files) (Standard bioinformatics toolkit) (Gene ontology annotation) (Gene ontology clustering) (Advanced plotting options for R) (Annotation provided by Illumina as an R-package) (Used for some graphics) (Visualization of networks) (Plug-in for Cytoscape: goAnnotation) (Plug-in for Cytoscape: gene interaction networks) Method Different techniques to manufacturing a microarray exist, a new and emerging technology is the sentrix beadchip[14] developed by Illumina. One of the main advantages of Illumina sentrix beadarrays is the level of miniaturization, this enables researchers to monitor more genes/probes. Manufacturing of Illumina sentrix arrays starts with the synthesis of long-oligonucliotides. These oligonucliotides (probes) are 50 base long and serve as targets for synthesized cDNA made from mRNA extracted from the biological samples. Also a 13 base ‘barcode’ sequence is attached to each oligonucliotide. These oligonucliotides are then attached to glass beads[15]. With this technique each bead is loaded with hundreds of thousands of covalently bound oligonucliotides probes of which ~90 % is available for hybridization[16]. These beads with the probes attached are then pooled together in beadpools, and then distributed across an etched microwell substrate. This is a carrier surface with several thousands (to millions) etched wells and is made by using the latest technologies available by different industries (fiber optics, computer industry and silicon manufacturing). The beads will randomly assemble themselves in the wells creating an unique microarray. Because it is not know what probe is at a certain coordinate [x,y] on the microarray, decoding has to take place. This is done by using dye labeled oligonucliotide decoders. Sequential hybridization of these oligo-nucleotide decoders with the probes on the microarray will reveal the ‘beadtype’. When the decoder DNA binds to the barcode on the illumine probe it gives a signal. Illumina decoding uses 2 color signals (Cy3 and Cy5) and a not hybridized signal. When using multiple rounds of hybridization, scanning, de-hybridization a signal library can be made for each probe. The order of ON and OFF signals (and the colors of the ON signals) will determine which type of bead was positioned at a certain coordinate [x,y]. A more detailed introduction about decoding randomly ordered beads is given by Gunderson et al 2004[12]. Possible decoding problems and solutions are also discussed there in more detail. After decoding a file (with the location and identity of all the probes on a that array) is delivered with the array. The array can now be used for hybridization with the sample of interest. After hybridization, the array is scanned and the image obtained is decoded into datafiles which contain spot intensity data, spot probe type (annotated from the datafile obtained during decoding) and detection probability values. On these files statistical analysis can be done to uncover genetic regulatory networks. STATISTICAL ANALYSIS: DIFFERENTIALLY EXPRESSED GENES From the normalized dataset a list of differentially expressed probes was made, this was done by applying a one-way analysis of variance testing procedure onto the Illumina intensity data. The factor yi set as outcome in this model is the log2 transformed intensity data, and there are four levels in our model (each cell type is a ANOVA level) . The ANOVA model for detecting differential expression: yi = + Ci + The resulting test statistic from the ANOVA model follows an F-statistic, and thus we can test for differences between group means. The ANOVA analysis gives information on which probes are significantly different expressed when analyzing the different groups (Stem, Progenitor, Erythrocyte or Granulocyte). The assumptions under which this analysis is valid are independent cases, normal distribution and around the same variance in the groups. Because of the preprocessing steps (which include normalization of the data) variance is about equal for all groups. The probes that came up as suspect to be differentially expressed were selected and a T-test was used to determine specificity for a certain group. So for each marker found in the ANOVA analysis four T-test are preformed to find group specificity. The T-tests used: In this model the T is the calculated test statistic, X and Y are the means of the groups and n and m are the number of observations in those groups. S is the weighted standard deviation, which is assumed to be normally distributed and around equal for both groups. STATISTICAL ANALYSIS: EXPRESSION QTL MAPPING Genetical Genomics is the study of gene expression combined with the mapping of this expression on positions on the chromosome. The basic model of genetical genomics is simple P=G+E. A trait is composed of 2 parts a genetic part and an environmental part. The genetical part is the part interesting to researchers. Genetic markers were used to genotype the RILs, these markers divide the chromosome and QTL can be associated with them. This is done by fitting a linear model to the data using ANOVA. The ANOVA model used for mapping eQTLs: yi = + Qi + In this model yi is again the log2 transformed value of the gene expression, is the mean and Qi is the genotype effect (-1, 0, +1). This model is used to find expression QTLs considering the genetic map and the expression levels. The genetic sequence of the BXD strain was obtained from Genenetwork. CREATING A R-PACKAGE The R programming language is a valuable tool in statistical analysis of large data sets. When publishing a method to pre-process illumine sentrix array data files, and automatically generate quality control files and input files for goCluster, BINGO and BioNetBuilder, the usability of created methods depend on the way other researchers can access them. Because of the wide spread use of R as THE language to process and analyze biological data, a R-package containing the developed functions and the pipeline was created. This package can be easily downloaded and installed into the R programming environment, after installation the pipeline can be executed (with default settings) by a simple text command. Creating a package for R enables novice users to easily do pre-processing themselves. But more experienced users are able to fully utilize the package and customize it to their own needs. Using R packages has more advantages, The need to document code and functions is a major advantage, this enables end-users to better understand what and how the functions work/should be used. Also it forces the programmer to think/re-think every step they made when creating the package. Results A R-package was made which supplies the used/created methods to other researchers. DATA PREPARATION Files were made from the summarized genetic profiles (SGP) data file provided by Illumina. This SGP file contained intensities, pValues and averaged numbers of beads for all samples used during the experiment. Each category intensities, pValue and number of bead was used to build a new file called after the category name. These files are smaller than the original dataset and can be more easily manipulated in terms of processing power and memory usage. The created files are saved as CSV-files for further usage with R. Probe annotation was provided by the Department of stem cell biology at the UMCG. These annotation files contained Illumina probeID’s as well as the corresponding Affymetrix probeID’s. The modified probe annotation file also contained an Illumina to Affymetrix probeID conversion. Thus annotation of the probes could be done by merging two files, the annotation file and the intensity file. QUALITY CONTROL: HISTOGRAMS AND BOXPLOTS Analysis started with the generation of histograms to estimate the threshold for signal to noise ratios. Also histograms give information about the quality of hybridization of the arrays. Analysis of the summarized histograms (Fig 2. above) before and after normalization looks like a normal distribution profile for log2 transformed microarray data. To detect any faulty arrays (in terms of over/under hybridization) data distributions were made and plotted for each array grouped by cell type. Data distributions of the different arrays showed no clear difference between samples/cell types. All these histogram plotting options have been added to the created R package. Boxplots were made of the Log2 transformed intensities, this reduces the domain in which the intensities are found and provides a better picture of the state of the data. Boxplots were made for raw data as well as normalized data. Boxplots for each sentrix array could show a batch effect, this can be detected by eye using box plots. Also box plots were made of the biological replicates to see how much variation these contained before normalization. The boxplots from the raw data (Fig 2) show only a difference in distribution, but the means of the intensity values seems to be fairly equal (across sentrix arrays). After quantile normalization method the normalized box plot shows less variation between arrays. Fig 2a. Histograms Summarized over all samples, left log2 transformed before normalization and right afterwards. No over/under hybridization is seen and the data looks like a normal distribution (with more higher values than lower ones). Fig 2b. Histograms Bar graphs of all samples grouped to cell type. The x-range of these 4 plots ranges from 0 to 20 (stepsize 2) No overhybridization is seen in any of the arrays and the distributions are what would be expected of microarray data. The bar graphs show no clear outlying arrays, so all samples were used in the follow up analysis. Raw Data Normalized Data Fig 3: Box plots Biological replicates Numbers 1 and 2: BXD28 1 SKA-Kit+ Numbers 3 and 4: BXD28 2 SKA-Kit+ Numbers 5 and 6: BXD33 2 SKA-Kit+ Between arrays variation of the samples is less after quantile normalization DATA PROCESSING: NORMALIZATION There are numerous normalization methods available, each has its own merits and drawbacks. For Affymetrix normalization is usually done with the Affymetrix RMA methode. This method uses probe perfect match and probe mismatch information to normalize the data. The data received from Illumina has already been compared to the background, thus our normalization method of choice would be to only do a quantile normalization. This normalization method has been implemented in the normalize.quantiles function from the bioconductor affy[17] package. This method was used because it was most suited to the data distributions seen when boxplotting the data (Fig 3: Raw Data). As stated earlier there are a lot of normalization schemes available, and there is still ongoing debate on different normalization methods[18-21], and which to use in what cases. This paper will not discuss normalization methods in detail, for this is beyond the scope of this paper. DIFFERENTIALLY EXPRESSED GENES After normalization differentially expressed probes were identified as described in the method section of this thesis. This function has also been implemented in the created R-package. Because all probes show some variation, a threshold was determined by setting an arbitrary cut-off p-value searching for differentially expressed genes. A summary of threshold vs. hits is shown in table 1. 1,0E-02 1,0E-08 1,0E-10 1,0E-14 Stemcells 15176 3340 2088 815 Progenitor 10416 2234 1383 501 Erythrocyte 18523 5769 4200 2128 Granulocyte 15028 5030 3460 1655 Table 1. Threshold and hits This table contains the number of differentially expressed genes found at a certain threshold (probability value). A threshold of 10-14 was chosen for follow up analysis When a p-value of 10-14 is chosen there are enough hits to do the follow up analysis, and not that many that it is (computationally) impossible. Lists of differentially expressed genes were constructed using a p-value of 10-14. HEATMAP ANALYSIS These differentially expressed probes were then heat mapped using the heatmap.2 function from the gplots package. With the heat map of the selected differentially expressed probes it is possible to check by eye if the intensities of the selected probes is different from other groups. This indicates that probes that are suspected to be differently expressed are really differently expressed. Clustering analysis can also be done on heatmaps. This uses distance (differences between the intensities) to generate a dendogram, this dendogram can then be analyzed by eye to see if the clustering of the differentially expressed genes yields back the four cell types used during the experiment. Expected is that the differentially expressed genes should cluster back to their cell type. Heatmap clustering analysis shows that clustering of the differentially expressed genes cluster back to the cell types (1 = Stem cell, 2 = Progenitor, 3 = Erythrocyte & 4 = Granulocyte). To confirm these results also clustering of all the genes took place. Clustering all genes also resulted in the groups defined by the four cell types. Fig 4a: Heatmap of stem cell specific genes. Fig 4b: Heatmap of progenitor specific genes. Distance clustering shows 4 distinct groups (corresponding to each cell-type). Also specificity for the 1 (Stemcells) can be seen, because the distance between this cell type and the other three is largest. Distance clustering shows distinct groups (corresponding to each cell-type), with 1 exception: Stem cells aren’t clustered back into their original group. Specificity of the genes for progenitor cell type seems to overlap with that of 1 (Stemcells). This is observed because of the small(er) distance between cell type 1 and 2. Fig 4c: Heatmap of erythroid specific genes. Distance clustering shows the four distinct groups (corresponding to each cell-type). Also specificity for only erythroid cell type is observed Fig 4d: Heatmap of granulocyte specific genes. Distance clustering shows the four distinct groups (corresponding to each cell-type). The distance between cell type 4 and the other three cell types indicates specificity for only the granulocyte cell type. Fig 5: Heat map of all genes. Distance clustering shows the four distinct groups (corresponding to each cell-type) when clustered. The stemcells (cell type 1) seem to consist of 2 groups that are somewhat different from each other, but still the variance within a group is higher than between groups. The progenitor group (cell type 2) has an outlier (left side) this outlier has a relative large distance to the progenitor cell group. Cell types 1 and 2 seem more related (less distance between them), which is logical because progenitor cells are the descendants of stemcells. GENE ONTOLOGY[22]: ANNOTATION AND CLUSTERING Gene Ontology (GO) annotation was retrieved using the GO R-package. A function was made to convert the differentially expressed gene lists into lists of EntreZ Genebank[23, 24] identifiers. This entrezID is translated into a GO object, using the function GOENTREZID2GO() from the GO package. The output from this function contains 3 vectors, GOID the Gene Ontology IDs associated with this entrezID, GO ontology the parent ontology group and a GO Evidence vector. These 3 lists of data describes in which ontology’s this gene is found and what evidence led researchers to annotate this gene with this gene ontology annotation. Creating input for goCluster[25] is also handles by the package created. goCluster uses the lists of differentially expressed genes, modified to suit the input requirements. After loading in the input goCluster retrieves the matching ontology from the GOdatabase online. Clustering of found ontology’s is done and compared to the known ontology groups. If one (or more) clusters are overrepresented in the data supplied, this will be reported as an enriched GO pathway. The GOcluster package need more information after submitting the input to goCluster. Other parameters need to be set (e.g. clustering algorithm, false detection rate and taxonomy), and this enables users to do a lot of tweaking in the options of goCluster. It is e.g. possible to select four different clustering algorithms and six distance measurements. The GoCluster output shown was made by using the HClust clustering algorithm, Euclidean distance and a false discovery rate, set to control the number of false positives, at P = 0.05. Fig 6a and 6b: Typical output of goCluster Fig 6a: Molecular function in progenitor cells (left) Fig 6b: Biological processes in granulocytes (right) For each cell type 2 plots were made: Biological process and molecular function. In each of the 2 plots shown here 3 gene ontology groups were found to be enriched. These ontology codes were then translated into group names by using the gene ontology website. These ontology groups that are overrepresented give insight into which functional processes or biological process could be involved in differentiation. The genes that make up these groups can be selected for further analysis. Because GO clustering isn’t an exact science, it is based on statistical analyses of ‘associated groups’ which are defined by the gene ontology consortium. Results from goCluster were verified by using BINGO[26] a plug-in for Cytoscape[27]. The input for the BINGO plug-in is somewhat different than that for goCluster, BINGO doesn’t take into account the expression levels of the genes. It finds enrichment in lists of entrezIDs. The BINGO plug-in enables users of Cytoscape to do GO clustering. BINGO has less clustering options that the GOcluster package but as an advantage it generates a gene ontology network. This network can then be viewed and analyzed by using Cytoscape thus enabling user of this package to visually inspect overrepresentation of certain GO groups, and their relationships. Stem cell Progenitory Biological process Ubiquitin Cycle cellular macromolecule metabolic process Molecular Function None Found immune system process unfolded protein binding monovalent inorganic cation transporter activity phosphoric monoester hydrolase activity Table 2. Summary of goCluster In this table a summary is given from the goCluster analysis. (continued on next page) Biological process Granulocyte Erythrocyte nucleobase, nucleoside, nucleotide cellular carbohydrate metabolic process nucleic acid metabolic process methylation-dependent chromatin silencing Macromolecule metabolic process M phase cell communication cell cycle biogenic amine catabolic process Autophagy Molecular Function nucleic acid binding cobalt ion transporter activity RNA binding cytoskeletal protein binding signal transducer activity urea transporter activity phospholipid-translocating ATPase activity microtubule motor activity Table 2 (continued). Summary of goCluster In this table a summary is given from the goCluster analysis. For each of the 2 main categories in the GO database the pathways found to be enriched are given. The GO annotation confirms that the cell types extracted and analyzed are indeed the correct ones. Also pathways can be identified which could have an impact on the development of haemopoietic cells In these BINGO generated networks the node color denotes the significance of enrichment (darker is more significant). The results from Bingo are shown in Fig 7a, 7b and 7c. Highlighted parts are parts of interest and are explained in the caption of the picture, only the parts of interest are shown here. All of the GO-pathways (except the ubiquitin cycle) found in the goCluster analysis were also seen to be overrepresented in the BINGO analysis. The ubiquitin cycle is seen in progenitor cells to be overrepresented in BINGO so perhaps the transition from stem cell to progenitor cell is associated with changes in this ontology group. Fig 7a: Cytoscape visualization of BINGO generated Stem cell ontology network. The outlined groups are development and intracellular signaling. These gene ontology groups are expected to be over expressed in stem cells. No trace of the ubiquitin cycle can be found in the BINGO analysis. Fig 7b: Cytoscape visualization of BINGO generated Granulocyte ontology network. The outlined groups are lytic vacuole and lysosome. These gene ontology localization groups are expected to be over expressed in granulocytes. Which are known to have organelles with a very low PH. Also there is high concordance between goCluster and BINGO between biological functions found to be enriched. (Data not shown) Fig 7c: Cytoscape visualization of BINGO generated Erythroid ontology network. The outlined groups are heme biosynthesis and protein modification. These gene ontology groups are expected to be over expressed in erythroid cells, because of their function as oxygen carriers. Also there is high concordance between goCluster and BINGO between biological process found to be enriched. (Data not shown) INTERACTION NETWORKS Gene interactions are important when further analyzing the data, the interactions between genes give insight into which genes in the differentially expressed groups are know to be associated with each other. These known interactions are available at a number of databases and from these databases the interactions can be retrieved (see Web resources & Public databases used). For building the network itself the BioNetBuilder[28] plug-in for Cytoscape can be used. This plugin can search several databases for known interactions between genes. Before a network could be generated the differentially expressed probes were annotated using the provided annotation file. From this file a list was made to be inputted to BioNetBuilder, GI (gene identification) numbers are used by BioNetBuilder to find known interactions. Different databases give different information, this leads to different interactions that can all be included in the network e.g. Protein-Protein interactions or Protein-Gene interactions. These interactions are then visualized by Cytoscape and give insight into what is really going on biologically in the observed clusters of genes. Different manipulations can be applied to the networks like subtracting networks, merging them and finding differences. Fig 8 is a typical output from the BioNetBuilder plug-in, different line colors denote different types of evidence/database for that interaction. The package creates input for the plug-in and thus gives researchers an easy way to visualize the interaction in selected gene groups. Striking results from the interactions between differentially expressed genes were: proteasome group in stem cells (perhaps related to the ubiquitin cycle) and ubiquione group in progenitor cells, also in both cell-types protein kinases can be seen. In granulocytes ubiquitin and ubiquione genes were seen in the differentially expressed genes. Fig 8: Cytoscape view of known biological interactions In this picture the interactions between the differentially expressed genes of the stem cell group are shown. (only sub-networks with >2 edges are shown) . These kind of interaction plots supply information about which genes are know to interact and new interactions can be searched for. Also it gives insight into which kind of interactions are present in between differently expressed genes. EXPRESSION QTL MAPPING Likelihood of expression of a gene was associated with molecular markers. These molecular markers divide up the entire genome of the mus musculus strain BXD. Information about the positions of markers and their origin was used from the online Genenetwork database. When plotting the location of the eQTL of a gene and the location of that gene in a 2 dimensional plot, 2 phenomena are seen: Cis acting genes, genes of which the highest likelihood of regulation falls around the same region as where the gene is located on the genome. This is the diagonal of the plot and normally the most striking effect seen in a eQTL plot. Trans acting genes, genes that show places of regulation located somewhere else on the genome. These transacting genes usually fall inside transbands, seen as horizontal (or vertical) lines in the eQTL plot, a single location on the chromosome which regulates the expression of a lot of genes on other locations. The Cis-acting genes in these transbands could be biologically important in the differentiation from stem cell to erythrocyte or granulocyte. Expression QTL plots (Fig 9) were made, and the algorithm for eQTL analysis can be found in the created R-package. eQTLs plots made for each cell type shows clear transbands which can be analyzed for possible candidate genes regulating differentiation. Normally the Cis acting genes to show up on the diagonal of the plots, and all the Cis acting genes combined would give the clearest line on the diagonal of the plot. The eQTL in stem cells, erythrocyte and granulocyte has more Cis acting genes and only some faint transbands. For the progenitor cell type there is a strange looking transband. Which could have biological importance e.g. a master switch/control of gene expression in that cell type. But it could also be an artifact of the experiment, this transband (and the other transbands) can be analyzed in further research. Fig 9: Expression QTL plots for all cell types. For each cell type transbands are seen, these locations on the genome have a statistical influence on the expression of genes genome wide. The very clear transband in the progenitor cell type is still under active investigation. Not because of biological interest but because it is more visible than the Cis acting genes. Other transbands identify regions which are biological interesting, the genes at those location can be listed and go analyzed or trough interaction networks. This will decrease the number of possible candidate genes and aid researchers in focusing resources. Discussion Pathway information can be reconstructed from an genetical genomics experiment using bioinformatics. This can help researchers in finding novel drug targets or uncovering new metabolic routes or differentiation pathways. Microarray driven research will be continued to be used in medicine and fundamental research. Novel array design and analysis methods will continue to speed up and increase size of the experiments done, thus increasing our knowledge of the genetics underlying complex traits. With this increase in complexity from larger experiments also new methods and ways have to be developed to cope with these amounts of data. Data used during this bachelor thesis was supplied by Illumina on a hard drive and raw signal size was 17,7 gigabytes, summary of this data still takes around 1 gigabyte. Compressing this much data into a thesis or other comprehensible format (word document ~1.5 megabytes) is only possible due to the advances in bioinformatics made in the past. Data quality control should have a big focus in any genetical genomics experiment because of those huge amounts of data and quality checking should have the utmost priority. In this thesis the obtained data was relatively good, however we can’t verify which processing steps were undertaken by Illumina before this data was supplied. Because the data presented by Illumina was not the raw bead level data but a summary of all the beads on 1 array (per bead type) no conclusions can be drawn about the decoding steps from Illumina. Normalization of microarray data always presents a challenge, the normalization method chosen during the experiment should suit the data. We would expect data distributions to follow (in general terms) a somewhat normal distribution and when there is more information available in the dataset (e.g. Affymetrix technology which contains match and mismatch probes) this should be used in the normalization step during pre-processing. Because there a numerous normalization methods and each method has at least several variants perhaps it is better to let the user of the package created choose which method should be used. With an R package it is relatively easy to add other normalization methods and let the user choose which normalization should be applied. This does not mean that normalization of data should be taken to lightly, because choosing an unsuited normalization approach will lead to data corruption and this will inevitable lead to wrong conclusions. The detection of differentially expressed genes is a relative simple statistical test (T-test). Several other statistical tests can be used to detect significant changes in intensity between conditions, and as with normalization the method used usually depends on the type of experiment and the size of the different groups/conditions participating. With detection of differentially expressed genes a threshold is associated, currently the package uses a static threshold set by the user. It would be better to calculate this threshold from the data itself thus simplifying the steps of choosing differently expressed probes. Another problem is the multiple testing problem and the falsely positive differently expressed genes. Genes that aren’t significantly changed that are picked up by the statistical test. This problem is hard to circumvent while all statistical tests have a certain false positive rate. Setting a false detection rate to correct for multiple comparisons will decrease the number of false positives, however with an FDR of 0,05 and 46000 probes analyzed there is still a lot of room for errors. Also the lack of discriminating power sometimes will not pickup every differentially expressed gene, when using FDR power will suffer even more. All of these problems have an effect on gene ontology clustering (which is also a statistical process) which could lead to a certain pathway being overrepresented (while it is not) and vise versa. Expression QTL mapping is a relatively new technique, there aren’t a lot of known problems associated with it other than false positive/negative QTLs due to the statistical tests done. This can lead to false transbands and thus instead of helping researchers find locations that are of interest, steering them away to uninteresting regions because of falsely significant QTLs. The transbands found in this thesis should be compared to the transbands found in the earlier research into stemcells done by the UMCG. When these match more will be known about interactions between within each cell and how these interactions change between cell type. In this way eQTL mapping will help understanding the basics of haemopoietic development. Future perspectives To continue this research and find out which genes are involved in the complex trait of differentiation several steps should be taken. These steps include scaling up the experiment to increase sensitivity, but also analysis of the ‘suspected’ genes by using recombinant techniques. These suspected genes can be found at the locations at which transbands are seen. Combined analysis of gene ontology, interaction networks and expression QTL maps can narrow down the suspected genes in those regions. Also the analysis of the progenitor cells has to be checked, and there has to be a why the progenitor cell type has such a clear transband. Identifying this source of error /or an other effect is important for further analysis. Also implementing a view which will provide the possibility to see QTL transband shifts is necessary for biological users to actively use this software. The protocol for analysis and the created package can be extended to include more normalization methods or other annotation schemes, this will be dependant on future requirements/questions from biologists and life science researchers. Further analysis will reveal which genes are responsible for differentiation and proliferation, this knowledge can then be used in many fields associated with stem cell biology. Also parallels can be draw to humans, because mus musculus has always been used as a model animal for human diseases/development. Identifying which pathways/genes or transbands are associated with erythrocyte / granulocyte development will help researchers of tomorrow develop novel applications in several fields. Another step that has to be taken in future is usage of bead level data to check probe results, some probes will not always function as expected or will show cross hybridization from multiple sources interfering with the results of the analysis. By studying those effects in more detail other sources of variation can be excluded and thus increasing power of detection and obtaining a cleaner dataset. This approach also includes checking the barcodes that are used by Illumina to see if they could interfere with the experiment. Comparison with the Affymetrix genechips to validate the cross-platform compatibility of experiments and results will also have to be done in the future. This can be done by comparing QTLs obtained from the Illumina experiment to the Affymetrix experiment. If similar genes are found and they have similar behavior (Cis/Trans and Up/Down regulation) on both platforms this will validate the results obtained from the Illumina experiment. In future perhaps a protocol for handling Illumina data that is more generalized than the method proposed here should be implemented, or a redesign of the created package to include more advanced analysis to the user. End users of the created R-package should be able to do more with this package if it is ever going to be used in pre-processing and analyzing Illumina data. Perhaps adding a graphic user interface (GUI) for more biological users, or adding a web interface to ease the usage of the package. Also the package with the functions it provides should be checked and submitted to CRAN (Comprehensive R Archive Network). A package being accepted to CRAN will provide easy access to that package from all over the world. Combining eQTL analysis with gene ontology or interaction networks would be helpful for biological users to quicker understand what is going on in the analyzed cell. This combination allows gene ontology to utilize the information generated during QTL analysis to increase knowledge about pathways and their regulation(ors). This is also the case for protein interaction networks, information from the QTL analysis could be integrated with into the interaction network to find new interaction partners of genes / or identify regions controlling changes in interaction during haemopoietic development. Literature 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Jansen RC, Nap JP: Genetical genomics: the added value from segregation. Trends Genet 2001, 17(7):388-391. de Koning DJ, Haley CS: Genetical genomics in humans and model organisms. Trends Genet 2005, 21(7):377-381. Shen R, Fan JB, Campbell D, Chang W, Chen J, Doucet D, Yeakley J, Bibikova M, Wickham Garcia E, McBride C et al: High-throughput SNP genotyping on universal bead arrays. Mutat Res 2005, 573(1-2):70-82. Jansen RC: Studying complex biological systems using multifactorial perturbation. Nat Rev Genet 2003, 4(2):145-151. Steemers FJ, Gunderson KL: Illumina, Inc. Pharmacogenomics 2005, 6(7):777-782. Bystrykh L, Weersing E, Dontje B, Sutton S, Pletcher MT, Wiltshire T, Su AI, Vellenga E, Wang J, Manly KF et al: Uncovering regulatory pathways that affect haemopoietic stem cell function using 'genetical genomics'. Nat Genet 2005, 37(3):225-232. Kamminga LM, Bystrykh LV, de Boer A, Houwer S, Douma J, Weersing E, Dontje B, de Haan G: The Polycomb group gene Ezh2 prevents haemopoietic stem cell exhaustion. Blood 2006, 107(5):2170-2179. Steemers FJ, Gunderson KL: Whole genome genotyping technologies on the BeadArray platform. Biotechnol J 2007, 2(1):41-49. Eggle D, Schultze J: IlluminaGUI: Graphical User Interface for analyzing gene expression data generated on the Illumina platform. Bioinformatics 2007. Fan JB, Gunderson KL, Bibikova M, Yeakley JM, Chen J, Wickham Garcia E, Lebruska LL, Laurent M, Shen R, Barker D: Illumina universal bead arrays. Methods Enzymol 2006, 410:57-73. Kuhn K, Baker SC, Chudin E, Lieu MH, Oeser S, Bennett H, Rigault P, Barker D, McDaniel TK, Chee MS: A novel, high-performance random array platform for quantitative gene expression profiling. Genome Res 2004, 14(11):2347-2356. Gunderson KL, Kruglyak S, Graige MS, Garcia F, Kermani BG, Zhao C, Che D, Dickinson T, Wickham E, Bierle J et al: Decoding randomly ordered DNA arrays. Genome Res 2004, 14(5):870-877. Bueno Filho JS, Gilmour SG, Rosa GJ: Design of microarray experiments for genetical genomics studies. Genetics 2006, 174(2):945-957. Verdugo RA, Medrano JF: Comparison of gene coverage of mouse oligonucleotide microarray platforms. BMC Genomics 2006, 7:58. Steinberg G, Stromsborg K, Thomas L, Barker D, Zhao C: Strategies for covalent attachment of DNA to beads. Biopolymers 2004, 73(5):597-605. Joos B, Kuster H, Cone R: Covalent attachment of hybridizable oligonucleotides to glass supports. Anal Biochem 1997, 247(1):96-101. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307-315. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185-193. Boes T, Neuhauser M: Normalization for Affymetrix GeneChips. Methods Inf Med 2005, 44(3):414-417. Stoyanova R, Querec TD, Brown TR, Patriotis C: Normalization of single-channel DNA array data by principal component analysis. Bioinformatics 2004, 20(11):1772-1784. Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H: Normalization strategies for cDNA microarrays. Nucleic Acids Res 2000, 28(10):E47. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2007, 35(Database issue):D21-25. McEntyre J: Linking up with Entrez. Trends Genet 1998, 14(1):39-40. Wrobel G, Chalmel F, Primig M: goCluster integrates statistical analysis and functional interpretation of microarray expression data. Bioinformatics 2005, 21(17):3575-3577. 26. 27. 28. Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21(16):3448-3449. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498-2504. Avila-Campillo I, Drew K, Lin J, Reiss DJ, Bonneau R: BioNetBuilder: automatic integration of biological networks. Bioinformatics 2007, 23(3):392-393. Web resources & Public databases WEB RESOURCES http://www.arisesoft.com http://www.Illumina.com http://www.bioconductor.org http://www.ncbi.nlm.nih.gov/entrez http://www.geneontology.org http://www.genenetwork.org http://r-project.org http://cran.r-project.org http://www.cytoscape.org http://err.bio.nyu.edu/cytoscape/bionetbuilder PUBLIC DATABASES http://www.genome.jp/kegg - KEGG - Kyoto encyclopedia of genes and genomes. http://bond.unleashedinformatics.com - BIND - Biomolecular interaction network database. http://www.thebiogrid.org - BioGRID - General repository for interaction datasets. http://dip.doe-mbi.ucla.edu - DIP - Database of interacting proteins. http://128/97/39/94/cgi-bin/functionator/pronav - ProLINK - Protein interactions.