"Analysis of Expression Data: An Overview". In: Current Protocols in
Transcription
"Analysis of Expression Data: An Overview". In: Current Protocols in
Analysis of Expression Data: An Overview Transcriptome profiling, the simultaneous measurement and analysis of expression changes in thousands of RNAs, has been enabled largely by microarray technology (Schena et al., 1995). Technologies such as serial analysis of gene expression (SAGE) or, more recently, massively parallel signature sequencing (MPSS; Brenner et al., 2000), provide alternative means to measure mRNA abundance. While all three technologies generate high volume data that require serious consideration of methods of analysis, this overview focuses on analysis of data from the most widely used platform, microarrays. Since their inception in 1989, the underlying technology and DNA probe content applied to microarrays has diversified considerably. Probes may be deposited or directly synthesized on either a silica chip, a glass slide, in a gel-matrix, or on a bead. Probes often vary in length from as short as 20 nucleotides (often synthesized in silico) to as long as several thousand bases (using spotted cDNA technology). In fact, the DNA does not even have to correspond exclusively to transcribed sequences, such as when microarrays are used for comparative genome hybridization or single nucleotide polymorphism (SNP) analysis. One-Color Versus Two-Color Arrays Microarray technologies can generally be divided into two categories, one-color arrays and two-color arrays. In one-color arrays, one RNA sample is processed, labeled with fluorescent dye, and applied to one microarray. Thus, one raw intensity score is generated from each feature or spot on the array (a feature or a spot refers to the immobilized DNA in one physical location on a microarray that corresponds to one sequence). With two-color arrays, two RNA samples are applied to each microarray (Shalon et al., 1996). A separate raw intensity score can still be obtained for each sample from each feature because the RNA samples are labeled with dyes that fluoresce across different spectral ranges. The advantages of two-color arrays are that one of the two labeling fluors can be used as an internal standard. Alternatively, onecolor arrays are often easier to use in large cohort studies, where obtaining a common reference sample for all subjects is difficult. Ultimately, the choice of technologies is very often influenced by the experimental design. EXPERIMENTAL DESIGN The term “experimental design” refers to the structure of the experiment and dictates the structure of the resulting data as well as the types of analyses that can be performed. Because of this, the design phase takes place before any experimentation is carried out, and should be carefully considered to maximize the usefulness of the experiment. The primary consideration in designing an experiment should be what question or questions the experiment will attempt to answer. Typical goals of a microarray experiment are to find genes that are differentially expressed between two treatments, or to study how expression levels in a set of genes change across a set of treatments. Because baseline gene expression levels are not usually known beforehand, most microarray experiments involve comparing gene expression levels between treatment samples and those representing control samples. Once treatment and control conditions have been decided upon, the next step is specifying the units or samples to which the treatments will be applied and the rules by which these treatments are to be allocated. In the simplest case, mRNA samples are prepared for both treatment and control conditions, and then hybridized either to separate one-color chips, or together on a two-color chip. More sophisticated designs are also possible, such as dyeswap and loop designs (Kerr and Churchill, 2001); these have the advantage of increased control over non-biological sources of variation, such as labeling efficiencies, dye effects, hybridization differences, and other sources of error arising from the measurement process. Since conventional statistical techniques depend on the existence of replicates, the number and structure of replicates need to be determined during the design phase. Replication structure will typically be dictated by the goals of the experiment; the number of replicates to use can be estimated by deciding which statistical analyses will be used and performing a priori power calculations using data from a pilot experiment. The goal of these calculations is to estimate the number of Contributed by Anoop Grewal, Peter Lambert, and Jordan Stockton Current Protocols in Bioinformatics (2007) 7.1.1-7.1.12 C 2007 by John Wiley & Sons, Inc. Copyright UNIT 7.1 Analyzing Expression Patterns 7.1.1 Supplement 17 replicates that will be needed to detect specified differences in gene expression levels. Frequently, the cost of microarrays or tissue availability prohibits the researcher from conducting a pilot study or using an optimal number of biological replicates recommended by a statistician. While using more biological replicates is always better, most reviewers currently accept experimental designs with at least three biological replicates per experimental condition. Bear in mind that variability between replicate samples will depend on the model system. For instance, an in vitro yeast study employing cells with identical genetic backgrounds is likely to yield more significant results with three replicates per group than a study measuring expression from blood samples of humans whose genetic backgrounds and daily routines add variability between samples The latter study will almost certainly necessitate many more biological replicates per treatment to yield significant results. The end goal of any microarray experiment is to precisely measure some type of biological variation; an optimal design will facilitate this while also controlling for as many nonrelevant sources of variation as possible. To this end, early interaction and communication between biologists and statisticians is desirable, and can result in higher-quality data that is more suitable for answering the questions of interest. RAW DATA OUTPUT Scanner Data Processing Analysis of Expression Data: An Overview Virtually all array scanners include software that converts a scanned image into a tabular set of signals, each corresponding to a feature on the array. In the case of Affymetrix GeneChips, each transcript is measured by multiple features consisting of 25-mer sequences that map to different regions of the same transcript (Lipshutz et al., 1999). To provide some additional information about nonspecific hybridization and background, each of these features is paired with a control mismatch feature that matches the ‘perfect match’ 25-mer at all positions except for a single nucleotide difference or ‘mismatch’ at the central or 13th position. A collection of 11 to 20 pairs of such features mapping to a common transcript are referred to as a probeset. Thus, the raw data gathered at this stage needs to be processed further in a step referred to as ‘probe-level analysis’ to yield a single expression measurement for a given RNA transcript. Probe-level anal- ysis is discussed in greater detail below, see Normalization. In addition to the summary expression score obtained for each RNA transcript, scanner software may produce many additional statistics to allow the data analyst to validate the overall RNA sample quality, to measure sample-to-sample consistency, to assess spotto-spot variability, and to assess other sources of technical variation. Scanner software may also include normalization options to correct for technical variations across dyes (for twocolor arrays), features, and arrays. Flags One type of data value designed to indicate the quality of a probe score or spot is referred to as the “flag.” Most, but not all, scanners will generate at least one flag call, usually a nominal value, per feature. Earlier-generation Affymetrix software (MAS 4.0 and MAS 5.0) reports a detection call that may be treated as a ‘flag’ for each probeset (i.e., RNA transcript) by some software, but this call is meant to be interpreted differently from other scanner vendors’ flags. Here, the flag is a call to assess whether a gene is expressed at a detectable level (flag value = Present) or not (flag value = Absent). A third possible value, ‘Marginal’ designates a situation where a definitive call of ‘Present’ or ‘Absent’ cannot be made. Because of the scanner-to-scanner variability in what a flag value means, microarray data analysis software supporting multiple platforms make limited use of flag information. When flag values are imported, they can be used in quality control assessment to remove associated feature measurements when the flag indicates spot quality problems. In the case of Affymetrix data, flag values can be used to restrict genes of interest at a first pass to those that are ‘Present’ in at least one or some other minimal number of samples in the study. Microarray Data Analysis Software Software specialized for microarray data is typically used to analyze the raw microarray data output. These software solutions are preferred to more generic statistical software packages because of the inclusion of normalization methods, clustering algorithms, and tools for biological interpretation, such as gene ontology analysis, gene annotation management, and pathway analysis. Thus, as data analysis techniques are discussed below, companion tables provide a guide of software solutions that offer the described techniques. Note 7.1.2 Supplement 17 Current Protocols in Bioinformatics Table 7.1.1 Microarray Platforms Supported by Common Data Analysis Software Packagesa Software (manufacturer or citation) Affymetrixb Other one-color Two-color Acuity (Molecular Devices) X X X ArrayAssist (Stratagene) X X X ArrayStat (Imaging Research) X X X Avadis (Strand Genomics) X X X BioConductor (Gentleman et al., 2004) X X X ChipInspector (Genomatix) X d-Chip (Li and Wong, 2001) X Expressionist (Genedata) X X X GeneLinker Gold (Improved Outcomes Software) X X X GeneSifter (VizX Labs) X X X GeneSight (BioDiscovery) X X X GeneSpring (Agilent) X X X J-Express Pro (Molmine) X X X Partek Genomics Suite (Partek) X X X Rosetta Resolver System (Rosetta Biosoftware) X X X S+ArrayAnalyzer (Insightful) X X X TeraGenomics (IMC) X TM4 (Saeed et al., 2003) X X X a Check with software vendor for automated recognition of specific scanner software output files. b Affymetrix is listed separately from other one-color arrays because some software only supports Affymetrix data. that products are listed by features that are current to the publication of this unit and not all product features are discussed below. Additional features are likely to be added to products, so check the product Website for current capabilities. Table 7.1.1 lists commonly used software solutions for microarray data analysis according to the general microarray platforms (Affymetrix, other one-color, and two-color) they support. DATA NORMALIZATION Gene expression data come from a variety of sources and are measured in a variety of ways. The measurements are usually in arbitrary units, so normalization is necessary to compare values between different genes, samples, or experiments. The goal of normalization is usually to produce a dimensionless number that describes the relative abundance of each gene in each sample or experiment. Ideally, any difference in raw expression scores for a gene across two conditions could be interpreted as having biological significance. However, technical variation can be introduced at many steps of experimental sample processing. Thus, normalization methods are required to minimize the impact of technical variation. Affymetrix probe-level analysis methods usually apply normalization steps during the course of calculating a summary score and are, therefore, discussed in detail. One-Color Arrays A gene expression score corresponding to the fluorescence intensity from an array feature cannot be easily converted to a measurement with biological relevant units (such as transcripts per cell or microgram RNA). Since the goal of gene expression studies is generally to identify differentially expressed genes between or among conditions, biologically relevant units are not necessary. Rather, it is necessary to apply normalizations, mathematical techniques to minimize bias introduced with technical variation, so that the measurements can nevertheless be compared across arrays and even, genes. Common global normalization methods employed to correct for chip-to-chip variation Analyzing Expression Patterns 7.1.3 Current Protocols in Bioinformatics Supplement 17 include median-centering normalization (dividing all measurements on an array by the median feature measurement), mean-centering normalization, Z-score normalization, quantile normalization, median polishing, and selected-gene-set (i.e., housekeeping genes)– based normalizations. Affymetrix Data Pre-Processing and Normalization: Probe Level Analysis Because multiple methods exist for calculating a summary score for the probes that correspond to a single transcript on an Affymetrix GeneChip, and because such methods can produce different results when the output is used to perform secondary analyses, an overview of the most common available methods is provided. Table 7.1.2 indicates the probe-level processing capabilities of data analysis software packages. The original method for probe-level analysis, the average difference method (Affymetrix Microarray Analysis Suite, MAS version 4.0, Santa Clara) was applied to most Affymetrix data dating from before 2001. MAS 4.0 calculated a robust average from the set of differences between the perfect matches (PM) and their respective mismatch (MM) over a probeset. When assessed for precision, consistency, specificity, and sensitivity using data from a GeneLogic study (2002) in which several dif- ferent cRNA sequences are spiked-in to RNA samples at known amounts as well as a companion study employing RNA samples applied to Affymetrix arrays in a dilution series, results indicate that the summary scores calculated for a single gene do not correlate linearly over a range of spike-in or dilution concentrations. To reduce the dependence on variance, which is not equal over the intensity range (as the original Affymetrix model does), and to prevent the generation of meaningless negative signal statistics, Affymetrix MAS 5.0 probe level analysis (Affymetrix Microarray Analysis Suite, version 5.0, Santa Clara) includes an adjustment to avoid calculating negative numbers and calculates the Tukey-biweightderived robust mean over a log transformation. In both cases, array-to-array technical variation is reduced by scaling all measurements further to a trimmed mean over the array. Li and Wong (2001) have since developed the Model Based Expression Index (MBEI) for calculating summary scores, and Irizarry et al. have proposed the Robust Multi-chip Averages method (RMA; Irizarry et al., 2003). The RMA method is notable for employing quantile normalization that forces the distributions of probe-level measurements to be equal across multiple arrays before probe-set summaries are calculated. A modification on RMA, GC-RMA (Wu et al., 2004) uses the mismatch data that RMA ignores to model Table 7.1.2 Software Solutions Offering Affymetrix Probe-Level Analysis Software (manufacturer or citation) GC-RMA MAS5 MBEI PLIER Acuity (Molecular Devices) X ArrayAssist (Stratagene) X X X X ArrayAssist Lite (Affymetrix/Stratagene) X X X X Avadis (Strand Genomics) X X X X Bioconductor X X X X dChip (Li and Wong, 2001) DecisionSite for Microarray Analysis (Spotfire) GeneSpring (Agilent Technologies) X X X Expressionist (GeneData) X X X X Genowiz (Ocimum Biosolutions) S+ArrayAnalyzer (Insightful) TeraGenomics (IMC) X X X X RMAExpress 0.4.1 (Bolstad, 2006) Analysis of Expression Data: An Overview RMA X X X X X X X 7.1.4 Supplement 17 Current Protocols in Bioinformatics the effects of GC-content on nonspecific binding. Affymetrix has introduced the PLIER algorithm that applies similar corrections to RMA and GC-RMA. The results of these algorithms and other algorithmic variations, as applied to the GeneLogic spike-in and dilution series datasets, can be viewed at the Affycomp Website of Cope et al., (2004): http://affycomp.biostat.jhsph.edu. Two-Color Arrays In a two-color array, two measurements are obtained for each feature. Typically, these measurements are derived from the hybridization of control and experimental samples, each with different-colored dyes. Because of differences in the rates of dye incorporation, and different detection efficiencies of the scanners for different wavelengths, it is often worthwhile to normalize with respect to the overall expression of each dye, in the same way as the per-chip normalization is described above for the one-dye measurements. A normalization that uses Lowess curve fitting (Yang et al., 2002) works well in this situation. DATA ANALYSIS Identifying Differentially Expressed Genes Once normalizations similar to those described above have been applied to gene expression data, the results are ratios that can be analyzed using standard statistical techniques. While earlier microarray data analyses have focused on using fold change as the sole criterion for differential expression, the limitations of this approach have become apparent. Traditional data-driven statistical techniques, such as ANOVA and t tests, base the significance of differential expression on the variation across replicate groups versus that within replicate measurements. Most of these techniques depend upon the existence of replicate measurements to estimate variability and error. When replicates are not available, there are alternative methods that can be used to estimate error; these are described below. When analyzing ratios resulting from microarray data, it is generally a good idea to first apply a log transform. This is because treatment effects on gene expression levels are generally believed to fit an additive model, with treatment effects being multiplicative. The log transform therefore places the data on a linear scale, and the resulting values are symmetric about zero. The most straightforward method of identifying differential expression is to apply a series of t tests on the log ratios, on a gene-by-gene basis. For two-color data, where the ratios contain expression values for both the control and treatment, a one-sample t test can be performed. When comparing across more than two treatments, e.g., in a time-series experiment, this approach can be generalized by using ANOVA instead of simple t tests. For each gene, the means in all treatment conditions are compared simultaneously and a single p value is generated. If the p value falls below the threshold, the gene being tested can be considered differentially expressed. More sophisticated ANOVA models that analyze the variance across an entire experiment at once have also been suggested (Kerr et al., 2000). One particular statistical method, Significance Analysis of Microarrays (SAM), has been devised by Tusher et al. (2001). This popular statistical method for finding genes exhibiting statistically significant differences in expression involves comparing mean differences and standard deviations in defined groups to those obtained when the same data is randomly permutated. Microarray experiments typically result in expression information for thousands of genes. When performing univariate tests for every gene in an experiment, inflated experimentwise error rates and false positives become issues that need to be addressed. In this situation, it is generally a good idea to apply some sort of multiple testing correction (MTC). A method for controlling the false discovery rate, such as Benjamini-Hochberg, represents a reasonable approach to controlling the yield of false positives and false negatives (Benjamini and Hochberg, 1995). Error models In cases where replicate measurements are not available, standard statistical formulas for variance and standard error cannot be used. It is still possible to estimate variances under these circumstances through the technique of pooling residuals from many different genes. It has been observed that gene expression variability is a function of the “normal” expression level. This quantity can be measured using control samples, or through normalization procedures similar to those detailed above (see Normalization). Because of this dependence, the pooling of error information is usually done locally. Another approach is to apply a variance stabilizing transformation to the entire sample or experiment. Once this has been done, all genes can be assumed to have similar variance and thus all measurements in a given Analyzing Expression Patterns 7.1.5 Current Protocols in Bioinformatics Supplement 17 sample can be used to compute a common variance (Durbin et al., 2002). These techniques can also be used in cases where very few replicates are available, where they can lead to more reliable estimates of error. In all cases, true replicate measurements are always the best source of information about error and variance. Volcano Plots A popular graphic to display results from a pairwise statistical comparison between conditions is the volcano plot. By displaying the p value from a statistical test, such as the twosample t test against the average fold-change, usually on a logarithm 2-dimensional plot, the volcano plot summarizes analysis results by the two desired criteria, statistical significance and fold-change. This technique provides an easy method for users to identify differentially expressed genes that are both significant and exhibit differences that are biologically significant in the mind of the user. Table 7.1.3 provides a non-exhaustive list of statistical features offered by microarray data analysis software providers. Identifying Expression of Splice Variants Exon arrays attempt to include sets of features from RNA transcripts that can differentiate between specific transcripts where a gene is known to have multiple splice variants. In this case, additional analysis is required to determine if differential expression is present in the form of the expression of different splice variants. Software such as Exon Array Computation Tool (Affymetrix), ChipInspector Exon Array (Genomatix), and Partek Genomics Suite (Partek) offer exon array analysis to investigate this specific case of differential expression. Table 7.1.3 Software Solutions Offering Statistical Analysis Features for Microarray Data Analysis of Expression Data: An Overview Software (manufacturer or citation) Statistical featuresa ArrayAssist (Stratagene) ANOVA, MTC, t tests, volcano plots ArrayStat (Imaging Research) ANOVA, MTC, EM, power analysis, t tests Avadis (Strand Genomics) ANOVA, paired t tests, volcano plots Bioconductor (Gentleman et al., 2004) ANOVA, MTC, power analysis, SAM, t tests ChipInspector (Genomatix) SAM-like t test d-Chip (Li and Wong) ANOVA, MTC, t tests Expressionist (Genedata) ANOVA, t tests GeneLinker Gold/Platinum (Improved Outcomes Software) ANOVA, MTC, t tests GeneSifter (VizX Labs) ANOVA, MTC, t tests GeneSight (BioDiscovery) ANOVA, MTC, t tests GeneSpring (Agilent) ANOVA, EM, MTC, t tests, volcano plots J-Express Pro (Molmine) ANOVA, SAM Partek Genomics Suite (Partek) ANOVA, MTC, t tests, volcano plots Rosetta Resolver System (Rosetta Biosoftware) ANOVA, EM, MTC, t tests S+ArrayAnalyzer (Insightful) ANOVA, EM, MTC, paired t test, volcano plots SAM (Tusher et al., 2001) SAM SAS Microarray (SAS) ANOVA, MTC, power analysis, t tests Spotfire’s Decision Site for Microarray Analysis (Spotfire) ANOVA, t tests TeraGenomics (IMC) t tests TM4 (Saeed et al., 2003) ANOVA, SAM, a Abbreviations: ANOVA, analysis of variance; EM, error model(s) to estimate variance; MTC, multiple testing corrections; SAM, significance analysis of microarrays. 7.1.6 Supplement 17 Current Protocols in Bioinformatics Other techniques involve using arrays with probes that represent both exons and exonexon junctions. These techniques have the advantage in that they can unambiguously identify the presence of exon skipping events and other post-transcriptional modifications (Fehlbaum et al., 2005). Blencowe and Frye have developed the GenASAP algorithm to work with data generated from similar arrays. They are able to identify the frequency of excluded exons with remarkably high fidelity (Pan et al., 2004). Clustering Clustering is a generic name applied to the idea of grouping genes, usually based upon expression profiles. The general idea is that genes with similar expression profiles are likely to have a similar function or share other properties. To do this, the concept of “similarity” of expression profiles needs to be defined. The objective is to define a function that produces a score of the similarity of expression patterns of two genes. There are various ways to do this; the most common are distance formulas and various correlations. The simplest method for finding similar genes is to compare the expression pattern for a single gene against all the other genes in the experiment. This finds genes that have an expression profile similar to the gene of interest. Hopefully, the similar genes will somehow be related. Often, the goal is to find distinct groups of genes that have a certain, similar pattern. When one has no idea of what to look for in advance, all the genes can be divided up according to how similar they are to each other. There are many clustering algorithms; two of the most common are k means and selforganizing maps. In both algorithms, the number of groups desired is roughly specified, and the genes are divided into approximately that number of hopefully distinct expression patterns. These algorithms are computationally intensive and are typically performed using software (e.g., UNIT 7.3). Another common method for clustering expression data is called “hierarchical clustering” or “tree building.” When a phylogenetic tree (see Chapter 6) is constructed, organisms with similar properties are clustered together. A similar structure of genes can also be used to make a tree of genes, such that genes with similar expression patterns are grouped together. The more similar the expression patterns, the further down on the tree those genes will be joined. A similar tree can be made for experiments or samples, where experiments or sam- ples that affect genes in similar ways can be clustered together. This technique has an advantage over the abovementioned methods in that the number of groups does not need to be specified in advance. Groups of genes can then be extracted as branches of the tree. Table 7.1.4 lists software solutions according to whether each of the three most common clustering methods for microarray gene clustering is offered; however, multiple, other equally valid clustering algorithms are also offered. Note, too, that even when the same clustering algorithm is provided, parameters such as linkage method in the case of hierarchical clustering can vary leading to nonidentical results with a common data input. Principal Components Analysis Principal Components Analysis (PCA) attempts to reduce the dimensionality of highdimension microarray data by finding the dominant trends among genes or samples and expressing each gene or sample as sums of a small number of profiles. PCA is more commonly performed on samples than genes to assess whether samples of a common class or treatment cluster together when plotted according to their correlation by the first two or three principal components. PCA may also help in identifying suspect samples that do not group with biological replica cohorts. Table 7.1.4 summarizes software solutions according to the availability of common clustering algorithms and principal components analysis. Classification Classification of tumors and other tissues is another potentially useful application of geneexpression data. These techniques can be used to find genes that are good predictors for cancers and other conditions, to verify tissue classifications obtained by other means, and for diagnostic purposes. The clustering techniques mentioned previously (see Clustering) can be applied to samples instead of genes; when used in this context, they are examples of unsupervised learning, i.e., the identification of new or unknown classes using gene-expression profiles. The term supervised learning refers to the classification of samples or tissues into known classes. In this setting, a set of tissue samples where the classification is previously known, e.g., cancerous tumors, is analyzed in a microarray experiment. The resulting geneexpression data can then be used to classify or predict the class of new samples based on their gene-expression levels. Analyzing Expression Patterns 7.1.7 Current Protocols in Bioinformatics Supplement 17 Table 7.1.4 Software Solutions Offering Tools for Clustering and PCAa Software (manufacturer or citation) Acuity (Molecular Devices) HC k means PCA SOMs X X X X ArrayAssist (Stratagene) X X X X Avadis (Strand Genomics) X X X X Bioconductor (Gentleman et al., 2004) X X X d-Chip (Li and Wong, 2001) X Expressionist (Genedata) X X X X GeneLinker Gold (Improved Outcomes Software) X X X X GeneSifter (VizX Labs) X X X GeneSight (BioDiscovery) X X X X GeneSpring (Agilent) X X X X Genowiz (Ocimum Biosolutions) X X X X J-Express Pro (Molmine) X X X X Partek Genomics Suite (Partek) X X X Rosetta Resolver System (Rosetta Biosoftware) X X X X S+ArrayAnalyzer (Insightful) X X SAS Microarray (SAS) X X Spotfire’s Decision Site for Microarray Analysis (Spotfire) X X X X TM4 (Saeed et al., 2003) X X X X a Tools are tabulated for the three most common clustering methods; some software solutions contain additional clustering algorithm options. Hierarchical clustering (abbreviated HC) is most often available for both genes and samples, while k means and self-organizing maps (SOMs) is usually only available for genes. PCA may be provided for genes and/or samples. Analysis of Expression Data: An Overview There are a number of statistical techniques and algorithms that can be applied to perform class prediction. They include various types of discriminant analyses, nearest-neighbor techniques, classification trees, and others that fall under the general heading of machine learning (Dudoit et al., 2000). In all of these techniques, the basic steps followed are similar. First, predictor genes are chosen based on their expression profiles in the samples with known classification. These tend to be genes whose expression profiles are significantly different between the classes of interest, and thus are good for discriminating between those classes. Next, the expression profiles for these genes in the samples of unknown classification are examined. This information is then used to place the new or unknown samples into the appropriate classes. If the set of samples being classified have already been classified by alternative clinical methods, this can be used as a validation or verification of those methods. For samples not yet classified, this information is potentially valuable for diagnostic purposes. The following software packages contain one or more methods for class prediction: ArrayAssist (Stratagene), Avadis (Strand Genomics), PAM (Tibshirani et al., 2002), dChip (Li and Wong, 2001), Expressionist (GeneData), GeneLinker Platinum (Improved Outcomes), GeneSpring (Agilent Technologies), Genowiz (Ocimum Biosolutions), Partek Genomics Suite (Partek), Rosetta Resolver (Rosetta Inpharmatics), S+ ArrayAnalyzer (Insightful), SAS Microarray (SAS), and TM4 (Saeed et al., 2003). Sequence Analysis Genes with similar expression profiles may be regulated by common transcription factors. For organisms whose genomes are completely sequenced and mapped (e.g., S. cerevisiae and C. elegans), high-throughput computation now enables the search for candidate DNA binding sites in upstream regions of genes clustered based on expression profile similarities (Wolfsberg et al., 1999). Software solutions offering some of sequence analysis include AlignACE (Hughes et al., 2000), Gene2Promoter (Genomatix), 7.1.8 Supplement 17 Current Protocols in Bioinformatics Table 7.1.5 Comprehensive Microarray Data Analysis Solutions Offering Gene Ontology and/or Pathway Analysis Software Manufacturer or citation ArrayAssist Stratagene d-Chip Li and Wong (2001) GeneSifter VizX Labs Genowiz Ocimum Biosolutions Rosetta Resolver System Rosetta Biosoftware Avadis Strand Genomics Expressionist Genedata GeneSpring Agilent Technologies J-Express Pro Molmine Spotfire’s Decision Site for Microarray Analysis Spotfire GeneSpring (Agilent Technologies), MotifSampler (Thijs et al., 2002), and TOUCAN2 (Stein et al., 2005). Pathway and Ontology Analysis Once genes are identified as strong candidates for differential expression across the conditions of interest, the results need to be interpreted in the context of known biology to extend molecular data to an understanding of higher-level biological effects. Comparing the list of gene profiles of interest against previously assembled lists of genes grouped by function, pathway of action, or cellular localization can provide useful insights. Facilitating the effort, the Gene Ontology (GO) Consortium (UNIT 7.2) has established standard hierarchical classifications for genes grouped by biological process, cellular localization, or molecular process with a fixed and controlled vocabulary for class names (Asburner et al., 2000). Furthermore, the group has embarked on gene curation efforts to assign genes to the defined classes. Investigators can now mine NCBI’s LocusLink (UNIT 1.3) for gene-classification information and effectively set up classifications based on GO annotations. Table 7.1.5 lists comprehensive microarray data analysis solutions that offer gene ontology analysis. An extensive list of the many other dedicated tools for Gene Ontology analysis can be found on the Website of the Gene Ontology Consortium at http://www.geneontology.org/GO.tools.shtml. Several online pathway databases have also come to the fore, specifically, the Kyoto Encyclopedia of Genes and Genomes (KEGG; Kanesha et al., 2002; also see Internet Resources), Biocarta (see Internet Resources), and GenMAPP, which includes analysis software (Dahlquist et al., 2002). Finally, statistics relating the expected probability of overlap to observed overlap between gene sets can further be brought to bear to examine the significance of potential relationships. See Table 7.1.6 for a list of software solutions that use data from these resources, their own manually curated data and/or results from naturallanguage-processing (NLP)-based algorithms to derive gene interaction information that can Table 7.1.6 Dedicated Pathway-Analysis Software and Resources for Microarray Data Analysis Software Manufacturer or citation BiblioSphere PathwayEdition Genomatix GenMapp Dahlquist et al. (2002) PathwayArchitect Stratagene Cytoscape Shannon et al. (2003) Ingenuity Pathway Analysis Ingenuity PathwayStudio Ariadne Genomics Analyzing Expression Patterns 7.1.9 Current Protocols in Bioinformatics Supplement 17 be used to make sense of the results from microarray statistical analyses. INFORMATICS AND DATABASES The primary challenges of archiving and retrieving gene expression data are a result of the speed at which such data can be generated. The cost of performing array-based expression experiments has dropped significantly in the past few years, so that even a medium-sized microarray facility can produce data from hundreds of arrays each month. For each of these arrays, there exist clinical and experimental parameters that are invaluable for interpreting the resulting expression data. In such an environment, it becomes necessary to be able to query the data based on these experimental parameters as well as based on actual measurements of gene expression. A typical query might ask to find all of the genes that are significantly up-regulated in any sample treated with compound X. Such a question is nontrivial because it asks both a statistical question (significantly up-regulated) and a historical question (sample treatment). In general, there are different types of tools to answer these two different types of questions. The ability to integrate the archival tools with the analysis tools is key to building a truly useful informatics system. When performing many experiments, especially in a large organization, a laboratory information management system (LIMS) database is useful to keep track of who has done what to which experiments, and other useful information. This database is often custom built to work with the flow of a particular laboratory, but several commercial suppliers offer preconfigured LIMS databases. In a high-throughput work environment, a LIMS system can help by tracking a sample from its creation/isolation through to the data collected from a hybridized microarray. Often these data are useful for quality control, e.g., for tracking contaminated reagents. LIMS systems are often connected directly to array scanners so that the sample annotation, data collection, and subsequent data analysis are directed from a single platform. Archiving Data Analysis of Expression Data: An Overview In any database for microarray data, the actual results from microarray experiments should be stored in association with parameters that describe the experiments (i.e., the difference between the experiments). In addi- tion, it is necessary to archive the results of statistical analyses (e.g., lists of genes with interesting behaviors across specific experimental parameters). There are two common techniques for this, as described in the following sections (also see UNIT 9.1). Text The data may be stored as text files that have been produced by the image analysis software. This is an inexpensive, fast, and convenient method for individual scientists; however, it can make working in groups difficult, and can increase the chances of losing historical data. Storing data in text files lacks the facility of a LIMS to provide a detailed description of the experiment associated with it. For the laboratory that is confined to using flatfile data storage, archiving can be improved by using a document management system like Pharmatrix Base4 or a flatfile data repository like GeneSpring WorkGroup from Agilent Technologies (see Internet Resources). Access via Structured Query Language (SQL) Data may be stored in an SQL-compliant database (UNIT 9.2), preferably associated with the LIMS for tracking production if it exists. A variety of analysis tools can then extract data from the database. This tends to be slow and expensive, but it makes backing up and archiving more reliable. The AADM database from Affymetrix is such a database, and it integrates with the Affymetrix LIMS system. If parameters of the experiments are stored in this database, then they can be retrieved automatically by a number of data-analysis packages. However, such databases provide little functionality for storing the results of statistical analyses. A dedicated enterprise-level expression repository like Agilent Technologies’ GeneSpring WorkGroup can store both raw expression data and display the results of statistical analyses (e.g., hierarchical clustering dendrograms) in a single package. Making Data Globally Accessible Many people want to publish data on the Web, and a growing number of journals mandate that expression analysis data included in published papers be available to all readers on the Internet. There are several methods for making such results globally accessible as described in the following sections. 7.1.10 Supplement 17 Current Protocols in Bioinformatics FTP Raw data files can be placed on an FTP server. This method is simple for the experimenter, but hard for others to use, as it requires a detailed description of the data structure for the data to be useful. Public databases The NCBI and EBI have created similar public databases (see Internet Resources). These solutions make the data available to anyone on the Web, and so are reasonable for academics. In addition, Agilent’s GeneSpring WorkGroup provides users with a Webaccessible repository that can be placed outside the firewall of a particular institution, so that guest users can access selected data via a Web browser. CONCLUSION Over the past few years, the hardware and technologies underlying microarray experiments have become more readily available to scientists interested in working with geneexpression data, and have matured to the point at which data acquisition and quality control are no longer the limiting factors. The focus is increasingly on analysis and interpretation, along with data management, storage, and accessibility. Contributions from statistics and computational biology have led to the availability of a wide variety of models and analyses for scientists working with microarray data. At the same time, the ongoing development of specialized software solutions that combine all these aspects of the microarray experimental process are allowing scientists to investigate the basic biological questions that this technology was designed to address. LITERATURE CITED Asburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., and Sherlock, G. 2000. Gene Ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25:25-29. Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57:2889-3000. Bolstad, B.M. 2006. RMAExpress. URL: http:// rmaexpress.bmbolstad.com/. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S.R., Moon, K., Burcham, T., Pallas, M., DuBridge, R.B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K. 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18:630634. Cope, L.M., Irizarry, R.A., Jaffee, H.A., Wu, Z., and Speed, T.P. 2004. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 20:323-331. Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S.C., and Conklin, B.R. 2002. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet. 31:19-20. Dudoit, S., Fridlyand, J., and Speed, T. 2000. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Tech. Rep. 576, Dept. of Statistics, University of California, Berkeley. Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. 2002. A variance-stabilizing transformation for gene expression microarray data. Bioinformatics 18:S105-S110. Fehlbaum, P., Guihal, C., Bracco, L., and Cochet, O. 2005. A microarray configuration to quantify expression levels and relative abundance of splice variants. Nucleic Acids Res. 10:e47. GeneLogic. 2002. Datasets. http://www.genelogic. com/newsroom/studies/index.cfm. Gentleman, R.C., Carey, V.J., Bates, D.J., Bolstad, B.M., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G.K., Tierney, L., Yang, Y.H., and Zhang, J. 2004. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5:R80. Hughes, J.D., Estep, P.W., Tavazoie, S., and Church, G.M. 2000. Computational identification of cisregulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296:1205-1214. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., and Speed, T.P. 2003. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31:e15. Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42-46. Kerr, M.K. and Churchill, G.A. 2001. Statistical design and the analysis of gene expression microarrays. Genet. Res. 77:123-128. Kerr, M.K., Martin, M., and Churchill, G.A. 2000. Analysis of variance for gene expression microarray data. J. Comput. Biol. 7:819-837. Li, C. and Wong, W. 2001. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. U.S.A. 98:31-36. Analyzing Expression Patterns 7.1.11 Current Protocols in Bioinformatics Supplement 17 Lipshutz, R., Fodor, S., Gingeras, T., and Lockart, D. 1999. High density synthetic oligonucleotide arrays. Nat. Genet. 21:20-24. Pan, Q., Shai, O., Misquitta, C., Zhang, W., Saltzman, A.L., Mohammad, N., Babak, T., Siu, H., Hughes, T.R., Morris, Q.D., Frey, B.J., and Blencowe, B.J. 2004. Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Mol. Cell. 16:929-941. Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A., Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., and Quackenbush, J. 2003. TM4: A free, open-source system for microarray data management and analysis. Biotechniques 34:374-378. Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467-470. Shalon, D., Smith, S.J., and Brown, P.O. 1996. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res. 6:639-645. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. 2003. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13:2498-2504. Stein, A., Van Loo, P., Thijs, G., Mayer, H., de Martin, R., Moreau, Y., and De Moor, B. 2005. TOUCAN2: The all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Res. 33:W393-W396. Thijs, G., Marchal, K., Lescot, M., Rombauts, S., De Moor, B., Rouze, P., and Moreau, Y. 2002. A Gibbs sampling method to detect overrepresented motifs in upstream regions of coexpressed genes. J. Comput. Biol. 9:447-464. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 99:6567-6572. Wolfsberg, T.G., Gabrielian, A.E., Campbell, M.J., Cho, R.J., Spouge, J.L., and Landsman, D. 1999. Candidate regulatory sequence elements for cell cycle–dependent transcription in Saccharomyces cerevisiae. Gen. Res. 9:775792. Wu, Z., LeBlanc, R., and Irizarry, R.A. 2004. Stochastic Models Based on Molecular Hybridization Theory for Short Oligonucleotide Microarrays Technical report, Johns Hopkins University, Dept. of Biostatistics Working Papers. Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 4:e15. INTERNET RESOURCES http://www.ncbi.nlm.nih.gov/geo/ The Gene Expression Expression Omnibus (GEO) is a public database of expression data derived from a number of different expression analysis technologies. http://www.ebi.ac.uk/arrayexpress/ ArrayExpress is a public repository for gene expression data, focused on providing a rich source of experimental background for each experiment set. http://www.biocarta.com/genes/index.asp Web site for Biocarta Pathways—interactive graphic models of molecular and cellular pathways. http://www.genome.ad.jp/kegg/ Kyoto Encyclopedia of Genes and Genomes. Contributed by Anoop Grewal and Peter Lambert NextBio Cupertino, California Jordan Stockton Agilent Technologies Santa Clara, California Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98:5116-5121. Analysis of Expression Data: An Overview 7.1.12 Supplement 17 Current Protocols in Bioinformatics The Gene Ontology (GO) Project: Structured Vocabularies for Molecular Biology and Their Application to Genome and Expression Analysis UNIT 7.2 Judith A. Blake1 and Midori A. Harris2 1 2 The Jackson Laboratory, Bar Harbor, Maine EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, United Kingdom ABSTRACT Scientists wishing to utilize genomic data have quickly come to realize the benefit of standardizing descriptions of experimental procedures and results for computer-driven information retrieval systems. The focus of the Gene Ontology project is three-fold. First, the project goal is to compile the Gene Ontologies: structured vocabularies describing domains of molecular biology. Second, the project supports the use of these structured vocabularies in the annotation of gene products. Third, the gene product-to-GO annotation sets are provided by participating groups to the public through open access to the GO database and Web resource. This unit describes the current ontologies and what is beyond the scope of the Gene Ontology project. It addresses the issue of how GO vocabularies are constructed and related to genes and gene products. It concludes with a discussion of how researchers can access, browse, and utilize the GO project in the course of C 2008 by John Wiley & Sons, Inc. their own research. Curr. Protoc. Bioinform. 23:7.2.1-7.2.9. Keywords: Gene Ontology r functional annotation r bioOntology INTRODUCTION With the age of whole genome analysis, systems biology, and modeling of whole cells upon us, scientists continue to work towards the integration of vast amounts of biological information. The goal, of course, is not the integration itself, but the ability to traverse this information space in the quest for knowledge. We want to construct knowledge systems so that we can infer new knowledge from existing and emerging information. With technological advances permitting expression analysis for tens of thousands of genes at a time, researchers seek clarity in finding and validating information. Recently, much interest has focused on the semantics used by information systems to report on biological knowledge, such as molecular function, or the parameters of experimental systems, such as with microarray experiments. The problem has been the multiplicity of ways that the same phenomena can be described in the literature or in database annotations. While it is difficult to persuade laboratory scientists to employ standardized descriptions of experimental procedures and results in their publications, those wishing to utilize genomic data have quickly come to realize the significance and utility of such standards to computer-driven information retrieval systems. WHAT ARE ONTOLOGIES AND WHY DO WE NEED THEM? Ontologies, in one sense used today in the fields of computer science and bioinformatics, are “specifications of a relational vocabulary” (Gruber, 1993; http://www-ksl.stanford.edu/ kst/what-is-an-ontology.html). Simply put, ontologies are vocabularies of terms used in a specific domain, definitions for those terms, and defined relationships between the terms. Ontologies provide a vocabulary for representing and communicating knowledge about some topic, and a set of relationships that hold among the terms of the vocabulary. They can be structurally very complex, or relatively simple. There is a rich field of study in ontologies in computer science and philosophy (Schulze-Kremer, 1998; Jones and Paton, 1999). Most importantly, ontologies capture Current Protocols in Bioinformatics 7.2.1-7.2.9, September 2008 Published online September 2008 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471250953.bi0702s23 C 2008 John Wiley & Sons, Inc. All rights reserved. Copyright Analyzing Expression Patterns 7.2.1 Supplement 23 domain knowledge in a computationally accessible form (Stevens et al., 2000). Because the terms in an ontology and the relationships between the terms are carefully defined, the use of ontologies facilitates making standard annotations, improves computational queries, and can support the construction of inference statements from the information at hand. Ontology-Based Enhancement of Bioinformatics Resources Bioinformatics systems have long employed keyword sets to group and query information. Journals typically provide keywords, which subsequently permit indexing of the published articles. Hierarchical classifications (e.g., taxonomies, Enzyme Commission Classification) have been used extensively in biology, and molecular function classifications started to appear with the work of Monica Riley in the early 1990s (Riley, 1993, 1998; Karp et al., 1999). The Unified Medical Language System (UMLS) incorporates multiple vocabularies in the area of medical informatics. In recent years, bio-information providers have increasingly focused on the development of bio-ontologies for capture and sharing of data (Baker et al., 1999; Stevens et al., 2000; Sklyar, 2001). Bio-ontologies support a shared understanding of biological information. The development of these ontologies has paralleled the technological advances in data generation. Genomic sequencing projects and microarray experiments, alike, produce electronically generated data flows that require computer accessible systems to work with the information. As systems that make domain knowledge available to both humans and computers, bio-ontologies are essential to the process of extracting biological insight from enormous sets of data. WHAT IS THE GENE ONTOLOGY (GO) CONSORTIUM? The Gene Ontology (GO) Project The Gene Ontology Consortium includes people from many of the model organism database groups and from other bioinformatics research groups who have joined together to build GOs and to support their use. The GOs, annotations to GO, and tools to support the use of GO, are in the public domain. Information, documentation, and access to various components of GO can be found at the GO Web site (http://www.geneontology.org) or in supporting publications (The Gene Ontology Consortium, 2000, 2001, 2008). WHAT ARE THE OBJECTIVES OF THE GO PROJECT? The focus of the GO project is three-fold. The first project goal is to compile and provide the GOs: structured vocabularies describing domains of molecular biology. The three domains under development were chosen as ones that are shared by all organisms: Molecular Function, Biological Process, and Cellular Component. These domains are further described below. A later ontology developed by the GO Consortium is the Sequence Ontology (Eilbeck et al., 2005). Second, the project supports the use of these structured vocabularies in the annotation of gene products. Gene products are associated with the most precise GO term supported by the experimental evidence. Structured vocabularies are hierarchical, allowing both attributions and queries to be made at different levels of specificity. Third, the gene product-to-GO annotation sets are provided by participating groups to the public through open access to the GO database and Web resource. Thus, the community can access standardized annotations of gene products across multiple species and resources. The GO Consortium supports the development of GO tools to query and modify the vocabularies, to provide community access to the annotation sets, and to support data exploration. WHAT ARE THE CURRENT ONTOLOGIES SUPPORTED BY THE GO PROJECT? The current ontologies of the GO project are Molecular Function, Biological Process, and Cellular Component. These three areas are considered orthogonal to each other, i.e., they are treated as independent domains. The ontologies are developed to include all terms falling into these domains without consideration of whether the biological attribute is restricted to certain taxonomic groups. Therefore, biological processes that occur only in plants (e.g., photosynthesis) or mammals (e.g., lactation) are included. Molecular Function Molecular Function refers to the elemental activity or task performed, or potentially performed, by individual gene products. Enzymatic activities such as “nuclease,” as well as structural activities such as “structural constituent of chromatin” are included in Molecular Function. An example of a broad functional term is “transporter activity” (enabling the directed movement of substances, 7.2.2 Supplement 23 Current Protocols in Bioinformatics such as macromolecules, small molecules, and ions, into, out of, or within a cell). An example of a more detailed functional term is “proteinglutamine gamma-glutamyltransferase activity,” which cross-links adjacent polypeptide chains by the formation of the N6-(Lisoglutamyl)-L-lysine isopeptide; the gammacarboxymide groups of peptide-bound glutamine residues act as acyl donors, and the 6-amino-groups of peptidyl- and peptidebound lysine residues act as acceptors, to give intra- and inter-molecular N6-(5glutamyl)lysine cross-links. Biological Process Biological Process refers to the broad biological objective or goal in which a gene product participates. Biological Process includes the areas of development, cell communication, physiological processes, and behavior. An example of a broad process term is “mitosis” (the division of the eukaryotic cell nucleus to produce two daughter nuclei that, usually, contain the identical chromosome complement to their mother). An example of a more detailed process term is “calcium-dependent cell-matrix adhesion” (the binding of a cell to the extracellular matrix via adhesion molecules that require the presence of calcium for the interaction). Cellular Component Cellular Component refers to the location of action for a gene product. This location may be a structural component of a cell, such as the nucleus. It can also refer to a location as part of a molecular complex, such as the ribosome. WHY DOES THE GO PROJECT REFER TO GENE PRODUCTS? GO vocabularies are built to support annotation of particular attributes of gene products. Gene products are physical things, and may be transcripts, proteins, or RNAs. The term “gene product” covers the suite of biological (physical) objects that are being associated with GO terms. Gene products may be polypeptides that associate into complex entities, or “gene product groups.” These gene product groups may be relatively simple, e.g., a heterodimeric enzyme, or very complex assemblies of a variety of different gene products, e.g., a ribosome. In addition, in most of the model organism database systems, the biological object being annotated is a loosely defined “gene” object with the potential of producing a protein or other molecule that could engage in a molecular function or be located in or at a partic- ular cellular component. The use of the term “gene product” encompasses all these physical objects. Further development of biological databases and information systems will support more precise descriptions of gene products, such as splice variants or modified proteins. GO vocabularies can be used to assign attributes to any of them. WHAT IS BEYOND THE SCOPE OF THE GO PROJECT? Almost as important as understanding the scope of the GO project is understanding what the GO project is not. The most common misapprehensions are (1) that GO is a system for naming genes and proteins and (2) that GO attempts to describe all of biology. GO neither names genes or gene products, nor attempts to provide structured vocabularies beyond the three domains described above. GO is Not a Nomenclature for Genes or Gene Products The vocabularies describe molecular phenomena, not biological objects (e.g., proteins or genes). Sharing gene product names would entail tracking evolutionary histories and reflecting both orthologous and paralogous relationships between gene products. Different research communities have different naming conventions. Different organisms have different numbers of members in gene families. The GO project focuses on the development of vocabularies to describe attributes of biological objects, not on the naming of the objects themselves. This point is particularly important to understand because many genes and gene products are named for their function. For example, enzymes are often named for their function; the protein DNA Helicase is a physical object that exerts the function “DNA helicase activity,” a term in the GO molecular function ontology (GO:0003678). GO is Neither a Dictated Standard Nor a Means to Unify Biological Databases The members of the GO consortium have chosen to work cooperatively to define and implement the GO system in their databases. However, the commitment is to the development of GO, the use of a common syntax for constructing GO annotation datasets, and the support of tools and the GO database for community access to GO and GO association files. Model organism databases and others using GO do so within the context of their own informatics systems. While GO was not Analyzing Expression Patterns 7.2.3 Current Protocols in Bioinformatics Supplement 23 developed to unify biological databases, it is true that the more GO is used in annotation systems, the easier it will be to navigate bioinformation space and to harness the power and potential of computers and computational systems. GO Does Not Define Evolutionary Relationships Shared annotation of gene products to GO terms reflect shared association with a defined molecular phenomena. Multiple biological objects (proteins) can share function or cellular location or involvement in a larger biological process, and not be evolutionarily related in the sense of shared ancestry. That said, many proteins that share molecular function attributes, in particular, do share ancestry. However, the property of shared ancestry is separate from the property of function assignment and is not reflected explicitly in GO associations to gene products. Other Ontologies Under Development Complement GO GO vocabularies do not describe attributes of sequence such as intron/exon parameters, protein domains, or structural features. They do not model protein-protein interactions. They do not describe mutant or disease phenotypes. There are efforts under way to develop ontologies for each of these domains. The GO consortium has played a leading role in the Open Biomedical Ontologies (OBO) effort (http://obofoundry.org/ ), which The Gene Ontology (GO) Project aims to support the development of ontologies in the biomedical domain, with particular emphasis on a core set of interoperable ontologies, the “OBO Foundry,” which meet a number of inclusion criteria. The OBO Foundry requirements detailed on the Web (http://www.obofoundry.org/crit.shtml), include that the ontology be orthogonal to existing ontologies, that the terms and relationships be defined, that they be publicly available, and that they be structured in a common syntax, such as OWL or OBO format. HOW ARE GO VOCABULARIES CONSTRUCTED? GO vocabularies are updated and modified on a regular basis. A small number of GO curators are empowered to make additions to and deletions from GO. Currently, a Concurrent Versions System (CVS) is employed to regulate and track changes. Those interested can request e-mail notification of any changes. Each committed set of changes is versioned and archived. Suggestions from the community for additional terms or for other improvements are welcome (details below). Properties of GO Vocabularies GO vocabularies are DAGs GO vocabularies are structured as directed acyclic graphs (DAGs), wherein any term may have one or more parent as well as zero, one, or more children (Fig. 7.2.1). Within each vocabulary, terms are defined, and parent-child Figure 7.2.1 The GO vocabularies are sets of defined terms and specifications of the relationships between them. As indicated in this diagram, the GO vocabularies are directed acyclic graphs: there are no cycles, and “children” can have more than one “parent.” In this example, germ cell migration has two parents; it is a “part of” gamete generation and “is a” (is a subtype of) cell migration. The GO uses these elementary relationships in all vocabularies. 7.2.4 Supplement 23 Current Protocols in Bioinformatics relationships between terms are specified. A child term is a subset of its parent(s). Thus, for example, the fact that the nucleolus is part of the nuclear lumen, which in turn is part of the (eukaryotic) cell, can be captured; further, the DAG structure permits GO to represent “endoribonuclease” as a subcategory of both “endonuclease” and “ribonuclease.” GO terms with their definitions are accessioned The accession ID is tracked by GO. The accession ID more precisely belongs with the definition. Thus, if a term changes (e.g., from “chromatin” to “structural component of chromatin”), but the definition for the term does not change, the accession ID will remain the same. Terms can become obsolete. Obsolete terms continue to be maintained and tracked in the GO database system. True-path rule The multiple parentage allowed by the DAG structure is critical for accurately representing biology. GO developers impose an additional constraint on the parent-child relationships specified in the vocabularies. Every possible path from a specific node back to the root (most general) node must be biologically accurate. Because some functions, processes, and cellular components are not found in all species, many terms will not be used to annotate gene products from a given organism. The general working rule is that terms are included if they apply to more than one taxonomic class. In accordance with the true-path rule, however, relationships between terms must be specified, such that the paths from any term leads to the root only via parent terms that are relevant to the organism in question. A parent term must never be specific to a narrower taxon than any of its children. Relationship types At present, GO vocabularies define two semantic relationships between parent and child terms: “is a” and “part of.” The is a relationship is used when the child is a subclass, of the parent, e.g., “endonuclease activity” is a subcategory of “nuclease activity.” The part of relationship is used when the child is a component of the parent, such as a subprocess (“DNA replication initiation” is part of “DNA dependent DNA replication”) or physical part (“nucleolus” is part of “nuclear lumen”). Further, the relationship’s meaning is restricted in one direction but not the other: part of means “necessarily part of” but not “necessarily has part.” In other words, the parent need not always encompass the child, but the child exists only as part of the parent. For example, in the cellular component ontology, “prereplicative complex” is a part of “nucleoplasm” although it is only present in the nucleoplasm at particular times during the cell cycle. Whenever the prereplicative complex is present, however, it is in the nucleoplasm. In addition, any term may be a subtype of one parent and part of another, e.g., “nuclear membrane” is part of “nuclear envelope” (and therefore also part of the nucleus) and is a of “organelle membrane.” Initially, GO curators used the part of relationship to link the regulation terms with the processes being regulated. We are now implementing the replacement of these “part of” relationships with a new relationship type called “regulates.” We have also created a regulates hierarchy in the graph with “regulation of biological process,” “regulation of molecular function,” and “regulation of biological quality” as the parent nodes. These terms have been used to create the appropriate subsumption hierarchy for terms that describe regulation of biological processes, molecular functions, and measurable biological attributes. In the cases of biological process and molecular function, automated reasoning has been used to ensure that the regulates portion of the graph and the portion of the graph describing the processes being regulated are consistent. The introduction of this new relationship type better reflects the underlying biology. Users can now choose to exclude or include in their analyses gene products that play a regulatory role in a biological process. One of the limitations of GO is the paucity of relationship types. As noted above, the is a and part of relationships can be seen to contain several sub-relationships. Further development and formalization of GO should result in more robust analysis and representation of relationships among the terms; GO will use relationships drawn from the OBO Relations Ontology (http://www.obofoundry.org/ro/; Smith, 2005). HOW DO GO VOCABULARIES RELATE TO OTHER RESOURCES SUCH AS THE INTERPRO? Various other classification schemes have been indexed to GO, including the SwissProt keyword set and MetaCyc Pathways and Reactions (UNIT 1.17). These mappings are provided Analyzing Expression Patterns 7.2.5 Current Protocols in Bioinformatics Supplement 23 to the public at the GO Web site (http://www. geneontology.org/GO.indices.shtml). They are reviewed and updated as needed. HOW ARE GENES AND GENE PRODUCTS ASSOCIATED WITH GO TERMS? Genes and gene products can obviously be associated with GO terms by whoever wishes to do so. For the groups participating in the GO Consortium, some general rules concerning gene associations to GO have been formulated. A gene product may be annotated to zero or more nodes of each ontology, and may be annotated to any level within the ontology. A well-characterized RNA or protein might be annotated using very specific terms, whereas a little-studied gene product might be annotated using only general terms. All GO terms associated with a gene product should refer to its normal activity and location. Functions, processes, or localizations observed only in mutant or disease states are therefore not included. Participating databases contribute sets of GO annotations to the GO site, providing a set of data in a consistent format. Details of these conventions can be found in the GO Annotation Guide (http://www.geneontology.org/GO.annotation. html). Evidence Codes and Citations The Gene Ontology (GO) Project Every association made between a GO term and a gene product must be attributed to a source, and must indicate the evidence supporting the annotation. A simple controlled vocabulary is used to record evidence types; it is described in the GO Evidence Codes document (http://www. geneontology.org/GO.evidence.shtml ). For a single gene product, there may be strong evidence supporting annotation to a general term, and less reliable evidence supporting annotation to a more specific term. Many of the evidence codes represent certain types of experimental data, such as inferred from mutant phenotype (IMP) or inferred from direct assay (IDA), which might be found in the literature describing a gene product. One evidence code, inferred from electronic analysis (IEA), is distinguished from the rest in that it denotes annotations made by computational methods, the results of which are not usually checked individually for accuracy. Annotations using the “IEA” code are therefore generally less reliable than those that have other types of evidence supporting them. HOW DO I BROWSE GO AND FIND GO ANNOTATIONS FOR “MY” GENES? Several browsers have been created for browsing the GO and finding GO associations for genes and gene products. These can be accessed at the GO Web site. The AmiGO browser, as an example, allows searches by both GO term (or portion thereof) and gene products. The results include the GO hierarchy for the term, definition and synonyms for the term, external links, and the complete set of gene product associations for the term and any of its children (Fig. 7.2.2; http://amigo.geneontology.org/ ). CAN I DOWNLOAD GO? GO vocabularies, association tools, and documentation are freely available and have been placed in the public domain. GO is copyrighted to protect the integrity of the vocabularies, which means that changes to GO vocabularies need to be done by GO developers. However, anyone can download GO and use the ontologies in their annotation or database system. The GO vocabularies are available in several formats, including OBO (GO, 2004), OWL (Horrocks, 2003), OBO XML, and RDF-XML. Monthly snapshots of the OBO v1.0 format GO file are also saved and posted on the GO Web site, which provide other information systems with a stable version of GO and the ability to plan for regular updates of GO in their systems. GO is also available in a MySQL database (UNIT 9.2); the database schema accommodates both vocabulary and gene association data, and downloads with and without the gene associations are available. More information about downloading GO can be found on the Web site (http://www.geneontology.org/GO.downloads. shtml), as can the citation policy (http://www. geneontology.org/GO.cite.shtml). WHERE CAN I ACCESS AND/OR OBTAIN THE COMPLETE GENE PRODUCT/GO ASSOCIATION SETS? As with the vocabularies, the gene product/GO association sets from contributing groups are available at the GO Web site. Tab-delimited files of the associations between gene products and GO terms that are made by the member organizations are available from their individual FTP sites, or from a link on the Current Annotations table. The “gene association” file format is 7.2.6 Supplement 23 Current Protocols in Bioinformatics Figure 7.2.2 The AmiGO browser provides access to the GO and to contributed gene associations sets. Queries can initiate with GO terms or gene product terms, results can be filtered in various ways. The AmiGO browser was developed by the Berkeley Drosophila Genome Project. described in the Annotation Guide (http:// www.geneontology.org/GO.format.annotation. shtml). These files store IDs for objects (genes/gene products) in the database that contributed the file (e.g., FlyBase IDs, SwissProt accessions IDs for proteins), as well as the citation and evidence data described above. The FTP directory is found at ftp://ftp. geneontology.org/pub/go/gene-associations/. There are also files containing SwissProt/TrEMBL protein sequence identifiers for gene products that have been annotated using GO terms; they are available via FTP from ftp://ftp.geneontology.org/pub/go/gp2protein/. WHERE CAN I FIND GO ANNOTATIONS FOR TRANSCRIPTS AND SEQUENCES? Gene objects in a model organism database typically have multiple nucleotide sequences from the public databases associated with them, including ESTs and one or more pro- tein sequences. There are two ways to obtain sets of sequences with GO annotations: (1) from the model organism databases or (2) from the annotation sets for transcripts and proteins contributed to GO by Compugen and Swiss-Prot. Obtaining GO Annotations for Model Organism Sequence Sets In gene association files, GO terms are associated with an accession ID for a gene or gene product from the contributing data resource. Usually, the association files of the gene to sequenceIDs are also available from the contributing model organism database. For example, the Mouse Genome Informatics FTP site (ftp://ftp.informatics.jax.org/pub/ infomatics/reports) includes the gene association files contributed to GO, and other reports that include official mouse gene symbols and names and all curated gene sequence ID associations. Analyzing Expression Patterns 7.2.7 Current Protocols in Bioinformatics Supplement 23 Obtaining GO Annotations for Transcript and Proteins in General Large transcript and protein sequence datasets are annotated to GO by SwissProt/TrEMBL. These files can be downloaded directly from the GO Web site. Species of origin for the sequence is included in the association files. HOW CAN GO BE USED IN GENOME AND EXPRESSION ANALYSIS? Using Gene Association Sets in Annotation of New Genes Genome and full-length cDNA sequence projects often include computational (putative) assignments of molecular function based on sequence similarity to annotated genes or sequences. A common tactic is to use a computational approach to establish some threshold sequence similarity to a Swiss-Prot sequence. Then GO associations to the SwissProt sequence can be retrieved and associated with the gene model. Under GO guidelines, the evidence code for this event would be IEA. For example, various permutations of this approach were used in the functional annotation of 21,000 mouse cDNAs (The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium, 2001). One aspect of the use of GO for annotation of large datasets is the ability to group gene products to some high-level term. While gene products may be precisely annotated as having a role in a particular function in carbohydrate metabolism (i.e., glucose catabolism), in the summary documentation of the dataset, all gene products functioning in carbohydrate metabolism could be grouped together as being involved in the more general phenomenon carbohydrate metabolism. Various sets of GO terms have been used to summarize experimental datasets in this way. The published sets of high-level GO terms used in genome annotations and publications can be archived at the GO site. Using the Gene Association Sets in Annotation of Expression Information The Gene Ontology (GO) Project The inclusion of GO annotation in microarray datasets can often reveal aspects of why a particular group of genes share similar expression patterns. Sets of co-expressed genes can encode products that are involved in a common biological process, and may be localized to the same cellular component. In cases where a few uncharacterized genes are co-expressed with well-characterized genes annotated to identical or similar GO process terms, one can infer that the “unknown” gene product is likely to act in the same process. Recently, software for manipulating and analyzing microarray gene expression data that incorporates access to GO annotations for genes is becoming available. For example, the Expression Profiler is a Web-based set of tools for the clustering and analysis of gene expression data developed by Jaak Vilo at the European Bioinformatics Institute (EBI; for review, see Quackenbush, 2001). One of the tools available in this set is the EP:GO, a tool that allows users to search GO vocabularies and extract genes associated with various GO terms to assist in the interpretation of expression data. The GO Consortium provides a Web presence where developers can provide access to their GO tools (http:// www.geneontology.org/GO.tools.shtml ). HOW CAN I SUGGEST ADDITIONAL TERMS OR CONTRIBUTE TO THE GO PROJECT? For changes to the ontologies, a page at the SourceForge site allows GO users to submit suggestions to GO curators (http:// sourceforge.net/projects/geneontology). This system allows the submitter to track the status of their suggestion, both online and by e-mail, and allows other users to see what changes are currently under consideration. GO also welcomes biologists to join Curator Interest Groups or to participate in meetings devoted to specific areas of the vocabularies. Both interest groups and meetings provide mechanisms to bring GO curators and community experts together to focus on areas of the ontology that may require extensive additions or revisions. Curator Interest Groups are listed on the GO Web site (http://www.geneontology.org/GO.interests. shtml); ontology content meetings are organized as the need arises and as biological experts become available to participate. GO also has several mailing lists, covering general questions and comments, the GO database and software, and summaries of changes to the ontologies. The lists are described at (http://www.geneontology.org/ GO.mailing.lists.shtml). Any questions about contributing to the GO project should be directed to the GO helpdesk at (gohelp@ geneontology.org). 7.2.8 Supplement 23 Current Protocols in Bioinformatics SUMMARY The development of GOs is a practical and on-going approach to the need for consistent and defined structured vocabularies for biological annotations. Originating from the biological community, the project continues to be enhanced through the involvement of the ontology engineers and through the availability of software tools for access to GO and to GO association datasets. GO is one example of several emerging bioontology and biological standards projects that include the work of the MGED group (http://www.cbil.upenn.edu/Ontology/MGED ontology.html), various species-specific anatomies (Bard and Winter, 2001), and structured vocabularies for phenotypes and disease states. This work both facilitates research in comparative genomics and proteomics, as well as the interconnection of bioinformatics and medical informatics systems. The GO project continues to provide a vital and illuminating example of community development of an information resource that benefits all biological research. ACKNOWLEDGEMENTS We thank Martin Ringwald, Carol Bult, and Jane Lomax for careful reading and useful suggestions. This work summarizes the efforts of all the people working together as part of the Gene Ontology Consortium. The Gene Ontology Consortium is supported by a grant to the GO Consortium from the National Institutes of Health (HG02273) and by donations from AstraZeneca Inc, and Incyte Genomics. LITERATURE CITED Baker, P.G., Goble, C.A., Bechhofer, S., Paton, N.W., Stevens, R., and Brass, A. 1999. An ontology for bioinformatics applications. Bioinformatics 15:510-520. Bard, J. and Winter, R. 2001. Ontologies of developmental anatomy: Their current and future roles. Brief. Bioinformatics 2:289-299. Eilbeck, K., Lewis, S.E., Mungall, C.J., Yandell, M., Stein, L., Durbin, R., and Ashburner, M. 2005. The Sequence Ontology: A tool for the unification of genome annotations. Genome Biol. 6:R44. The Gene Ontology Consortium. 2000. Gene Ontology: Tool for the unification of biology. Nat. Genet. 25:25-29. The Gene Ontology Consortium. 2001. Creating the gene ontology resource: Design and implementation. Genome Res. 11:1425-1433. The Gene Ontology Consortium. 2008. The Gene Ontology Project in 2008. Nucleic Acids Res. 36 (Database issue): D440-D444. GO. 2004. The OBO Flat File Format Specification. Version 1.2 available at http://www. geneontology.org/GO.format.obo-1 2.shtml. Gruber, T.R. 1993. A translational approach to portable ontologies. Know Acq. 5:199-220. Horrocks, I., Patel-Schneider, P.F., and van Harmelen, F. 2003. From SHIQ and RDF to OWL: The Making of a Web Ontology Language.Web Semant. 1:7-26. Jones, D.M. and Paton, R.C. 1999. Toward principles for the representation of hierarchical knowledge in formal ontologies. Data Knowl. Eng. 31:102-105. Karp, P.D., Riley, M., Paley, S.M., Pellegrini-Toole, A., and Krummenacker, M. 1999. Eco Cyc: Encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res. 27:55-58. Quackenbush, J. 2001. Expression Profiler: A suite of Web-based tools for the analysis of microarray gene expression data. Brief. Bioinform. 2:388-404. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium. 2001. Functional annotation of a full-length mouse cDNA collection. Nature 409:685-690. Riley, M. 1993. Functions of the gene products of Escherichia coli. Microbiol. Rev. 57:862-952. Riley, M. 1998. Systems for categorizing functions of gene products. Curr. Opin. Struct. Biol. 8:388392. Schulze-Kremer, S. 1998. Ontologies for molecular biology. Pacific Symp. Biocomput. 3:695-706. Sklyar, N. 2001. Survey of existing Bio-ontologies. Technical Report 5/2001, Department of Computer Science, University of Leipzig, Germany. Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A.L., and Rosse, C. 2005. Relations in biomedical ontologies. Genome Biol. 6:R46. Stevens, R., Goble, C.A., and Bechhofer, S. 2000. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1:398-414. INTERNET RESOURCES http://www.nlm.nih.gov/research/umls/ UMLS Unified Medical Language System. http://www.geneontology.org/ The Gene Ontology Web site. http://www.mged.sourceforge.net/ontologies/ The Microarray Gene Expression Data (MGED) Society Ontology Working Group (OWG) Web site. http://dol.uni-leipzig.de/pub/2001-30/en A survey of existing bio-ontologies (Sklyar, 2001). http://www.w3.org/TR/owl-guide OWL Web Ontology Language Guide. Analyzing Expression Patterns 7.2.9 Current Protocols in Bioinformatics Supplement 23 Analysis of Gene-Expression Data Using J-Express UNIT 7.3 Anne Kristin Stavrum,1, 2 Kjell Petersen,3 Inge Jonassen,1, 3 and Bjarte Dysvik4 1 University of Bergen, Bergen, Norway Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway 3 Computational Biology Unit, BCCS, University of Bergen, Bergen, Norway 4 MolMine AS, Thormoehlens, Bergen, Norway 2 ABSTRACT The J-Express package has been designed to facilitate the analysis of microarray data with an emphasis on efficiency, usability, and comprehensibility. The J-Express system provides a powerful and integrated platform for the analysis of microarray gene expression data. It is platform-independent in that it requires only the availability of a Java virtual machine on the system. The system includes a range of analysis tools and a project management system supporting the organization and documentation of an analysis project. This unit describes the J-Express tool, emphasizing central concepts and principles, and gives examples of how it can be used to explore gene expression data C 2008 by John Wiley & Sons, Inc. sets. Curr. Protoc. Bioinform. 21:7.3.1-7.3.25. Keywords: gene expression r J-Express r microarray r spot intensity quantitation INTRODUCTION The J-Express package has been designed to facilitate the analysis of microarray data with an emphasis on efficiency, usability, and comprehensibility. An early version of J-Express was described in an article in Bioinformatics in 2001 (Dysvik and Jonassen, 2001). This unit describes the J-Express tool, emphasizing central concepts and principles. Examples show how it can be used to explore gene-expression data sets. The J-Express system provides a powerful and integrated platform for the analysis of microarray gene-expression data. It is platform-independent in that it requires only the availability of a Java virtual machine on the system. The system includes a range of analysis tools, and, importantly, a project-management system supporting the organization and documentation of an analysis project. The package can be used not only for analysis of microarray gene-expression data, but also to analyze any set of objects where each measurement is represented by a multidimensional vector. For example, it has been used to analyze data from 2-D gel experiments. J-Express allows the user to import output files from spot-quantitation programs such as GenePix and Scanalyze and to take the data through filtering and normalization procedures to produce log-ratio data (see Basic Protocol 1). Alternatively, the user can input externally processed gene-expression data. These data can be log-ratio type data (relative quantitation of mRNA abundances) or more direct mRNA quantitations produced, for example, using Affymetrix technology. The program offers a choice of different unsupervised analysis methods, including clustering and projection methods (see Basic Protocol 2). Supervised analysis methods include differential expression analysis and gene-set enrichment approaches. For a discussion of supervised and unsupervised analysis methods, see Background Information. Current Protocols in Bioinformatics 7.3.1-7.3.25, March 2008 Published online March 2008 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471250953.bi0703s21 C 2008 John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.3.1 Supplement 21 J-Express automatically keeps track of the processing and analysis steps through which the user takes the data. This helps the user to keep track of his/her own project and allows documentation of produced results and visualizations. BASIC PROTOCOL 1 CREATE A GENE-EXPRESSION MATRIX FROM SPOT INTENSITY DATA WITH J-EXPRESS In order to analyze microarray data, J-Express creates a gene-expression matrix from spot-intensity data. The program accepts multiple spot-quantitation file formats, and then filters and normalizes the data before creating the final gene-expression matrix. This protocol discusses loading and filtering raw data, as well as the normalization options. Necessary Resources Hardware Suggested minimum requirements for PC system: 1 GHz processor and 1 Gb RAM. Large data sets may, however, require more resources. Software The J-Express software can be obtained from the Web site http://www.molmine. com/. A license key needs to be installed; such licenses are available from the molmine Web site. Version 2.7 is free for academic users; see Web site for more detailed information. The software runs on top of a Java environment, which can be downloaded for most operating systems (e.g., Linux, Windows, and Solaris). Most operating systems are shipped with Java installed, and upgrading or installing new versions is very easy. Files A number of spot-quantitation file formats are accepted, and readers for new file formats are easily added. As long as the files are tab-delimited text files with a number of measured and calculated quantities for each spot, customized loaders can be manually set up in a data format tool included in the framework. As an example, the following protocol uses a synthetic data set, after an idea from Quackenbush (2001). The data were generated by creating seven seed profiles and applying noise to these. The sources and actual data are shown in Figure 7.3.1. 1. Download the software from http://www.molmine.com. Install and start the program. An installation program is downloaded and executed. The installation program unpacks the J-Express program and places the files in a directory that can be chosen by the user. The procedure is self-explanatory and straightforward. Load spot intensity data 2. Load “raw data (SpotPix Suite)” using a flexible data-import wizard (see Fig. 7.3.2). Add all your arrays to the experiment by dragging them into the experiment table. Rename each row to correspond to the sample value. More documentation on setting up your experiment can be found in the documentation and as additional PDF files installed together with J-Express. Analysis of Gene-Expression Data Using J-Express The wizard allows the user to set up the experiment by adding virtual arrays in a table representing the experiment. Each row in the table represents a sample measurement, and for each sample measurement, a number of replicate arrays can be added (these will become additional columns in the experiment table). The first column in each row contains the name of the measurement. This is also the name that will appear at the top of each sample column in the final expression matrix. When the experiment has been defined by adding arrays and linking them to data files, the next step is to perform a quality control of the data associated with each array. 7.3.2 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.1 Synthetic data were generated from seven seed profiles by addition of (white) noise. To the left are shown the seed profiles and to the right the resulting synthetic data. The color of each profile is that of the seed profile from which it was generated. If the profiles are thought of as generated from a time-series experiment, the x axis corresponds to the time points. The y axis gives the log-ratio of a gene’s expression level (logarithm of the expression level of a gene at a certain time point divided by its expression level in a reference sample). For example the “black genes” have an expression level that does not change much during the time course, whereas the “red genes” are unchanged during first few time steps (but below reference level), then increase through a number of time steps, and stay the same for the last few time steps. The data were derived by defining the seven template profiles and generating profiles by adding noise, specifically by adding random numbers between –0.5 and 0.5 (uniform probability) to each gene at each time point. For the color version of this figure go to http://www.currentprotocols.com. Figure 7.3.2 Data-import pipeline. Spot-intensity data are loaded from a file. A subset of the genes is selected through a filtering step, the intensity values for the remaining genes are normalized, and log-ratios are calculated. The prepared data set is a gene-expression data matrix that can be analyzed using, e.g., clustering methods. Simple quality control of the array 3. To check the quality of the physical array, it is possible use the “quality control” tool in J-Express. The user can choose various fields in the array output file and plot these according to their spatial location. For instance, a popular control is to plot the background intensity to see if there is a correlation between background contribution and spatial location. Perform preprocessing of each array The Process tab enables customized routines for preprocessing each array in the experiment. Although it is possible to create an individual processing procedure for each array, it is recommended that all arrays receive the same treatment during low-level preparation. An easy way of doing this is to define a certain sequence (stack) of processes, try them on a single array, and then use the “copy to all” option. What we refer to here as processes are generally a number of routines available in the framework which can be added to a list (the process stack). An example of such a process is the filter process, which can Analyzing Expression Patterns 7.3.3 Current Protocols in Bioinformatics Supplement 21 read any statistics in the array output file and compare them to a value. If the comparison is valid (or invalid, as defined by the user) the measurement (spot) is tagged as filtered. Another process is the plot process, which can, for instance, be added to the end of each process stack to view a scatter plot of the whole processing procedure. Normalization processes can be added when two-channel arrays are used (see below for inter-array normalization procedures). By adding a plot before and after a normalization process (or moving the same plot from before the normalization process to after), it is easy to see in which way the normalization has changed the data. Another important process worth mentioning is the scripting process. This provides a Jython (Python in Java) interface, which can be used to manually manipulate the data or tag the measurements as filtered. The scripting interface can also be used as a programming interface and enables users to develop their own Java classes for data manipulation (e.g., data transformation, filtering, or normalization) and plug them directly into the preprocessing framework. 4. Go to the Process tab and add filtering processes to remove all unwanted measurements. Use the String filter process to remove spots that are not a part of your experiment (e.g., control spots, spikes, empty spots, etc). If this is a two-channel experiment, add the global lowess normalization process. When satisfied with the process stack, add a plot (one- or two-channel plot) to the end and click “run to” on the plot. Check the distribution of the measurements. If the plot looks OK, click the “copy to all” button to copy this process stack to all arrays. Further information about normalization 5. For two-channel arrays, the normalization consists of a transformation of channel 1 to make it comparable to the second channel. For one-channel arrays, this can be a reference array. The transformation is to correct for unequal quantities of hybridization material with each of the two dyes or for unequal labeling or hybridization properties for each of the dyes. J-Express, at the moment, offers the user the choice between four alternative normalization transformations available for two-channel arrays (Fig. 7.3.3), and, currently, one alternative for one-channel arrays. All normalization methods (substeps a to d) available in J-Express can be instructed to use only a subset of genes in finding the normalization transformation to be applied to the data. This subset is chosen by the user and may contain genes that are expected to change little in expression values through the experiment. Analysis of Gene-Expression Data Using J-Express a. The first and simplest, median transformation, multiplies all intensities in channel 1 by a number that makes the median of the intensities in each of the two channels identical. The underlying assumption is that most genes have similar expression level in both samples. b. The other normalization is a linear-regression method termed MPI, since this was supplied by the collaborating group of Martin Vingron now at Max Planck Institute in Berlin (Beibbarth et al., 2001). This method also assumes that most genes have unchanged expression levels between the two samples. First, a percentile (e.g., a value that is above exactly x% of the intensity values) is subtracted from each channel to correct for unequal global background. Second, a multiplicative factor is found to scale the first channel so that most of the highly expressed genes are transformed to lie near the diagonal in a plot of intensity values. c. The third and most popular method is the lowess normalization method, which makes sure the mean intensity within any window along the intensity axis is equal in the two channels (Cleveland, 1979). In addition, a procedure to account for outliers is applied, and often this is repeated in iterations. d. The fourth and final normalization option is to use splines (short curved segments) to find a flexible mean between the two channels and process the data so that the 7.3.4 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.3 Screen shot of the SpotPix suite in J-Express including a visualization of a lowess normalization. The SpotPix suite shows the experimental design linking data files to samples. For each data file, a sequence of processes (including filters, plots, and normalization procedures) can be edited and executed. The figure shows a process batch including a plot that has been executed so that the plot is included in the screenshot. The plot shows the “regression line” defined by the lowess normalization procedure just above the plot in the process batch. splines are as straight as possible (Workman et al., 2002). This method is similar to the lowess method, but the implementation gives more control to the user. Try several normalization methods including lowess, change the parameters, and inspect the results visually. The normalization window includes plots visualizing the input and the output of the normalization algorithm (see Figure 7.3.3). The plots also provide an indication of the quality of the intensity values and their normalization. Global normalization 6. The normalization procedures described above are typical single-array procedures and perform normalization only on two comparable data sources (measurements minus reference measurements, or the channels in a two-channel array). Another class of normalization methods can normalize sets of data sources and a use batches of arrays instead of single pairs. Examples are the RMA (Irizarry et al., 2003) procedure for Affymetrix arrays and quantile normalization (quantile normalization is actually included in the RMA procedure). RMA is short for Robust MultiChip Average and is available in the main J-Express file menu. It lets the user select a batch of Affymetrix chips (as CEL files) and a definition file (CDF). It then performs a background correction to correct for nonspecific binding, a quantile normalization step, and finally a “polishing” step using a probe-set summary of the log-normalized probe-level data. The quantile normalization method generally unifies the expression distribution across all arrays in the experiment. Analyzing Expression Patterns 7.3.5 Current Protocols in Bioinformatics Supplement 21 Compiling and post-processing the expression matrix 7. When a suitable process stack has been created for all the arrays (e.g., by constructing it for one array and copying to the others), the next step is to compile the expression files into a single expression matrix. Before we start this process, we must decide what to do with measurements removed by our filtering processes. 8. Click the “Post compilation” tab and choose an imputation method. The simplest methods are the row or column mean, but studies show that they can have unwanted effects on your expression matrix. A better approach is to use the LSImpute methods or the KNNimpute method (Troyanskaya et al. 2001; Bø et al. 2004). 9. Finally, go back to the data tab and select in what form you want the resulting data. For two-channel arrays, log ratio matrices are mostly used, and for one-channel data, it is a good idea to use NONE or log ratio if a reference sample has been chosen. When clicking “compile,” array replicates and (if chosen by user) within-array replicates, will automatically be combined by the chosen method (the combine method). Each sample will be added as a column and each measurement will be added as a row of an expression matrix. This matrix will be added to a data set wrapper and put into the main project tree. All information about the preprocessing is stored together with the expression matrix, so that opening the raw data loader (SpotPix Suite) when this data is selected will bring up the low-level project. The various processing procedures used can also be viewed in the meta info window. BASIC PROTOCOL 2 ANALYZE A GENE-EXPRESSION MATRIX USING J-EXPRESS The J-Express program can be used to explore a gene-expression data set contained in a gene-expression data matrix in the J-Express system. For example, one may find sets of genes behaving in a similar manner through a time-series experiment. Most of the methods below analyze and compare gene-expression profiles. A profile signifies a list of expression measurements associated with one gene (a row in the gene-expression matrix). Profile similarity search window This window allows the user to select one expression profile (expression measurements for one gene through a set of experiments or time steps) and to find other genes with similar expression profiles using any of the defined dissimilarity measures (see Background Information). Figure 7.3.4 shows the window and illustrates the difference between two dissimilarity measures. User-defined profile search J-Express also allows the user to define a search profile and to search with it to find all matching expression profiles in a gene-expression data matrix. The search profile simply defines lower and upper bounds on the expression level for each array. The user defines a search profile by using the mouse to move the lower and upper limits on the allowed expression levels for each array. The search returns the list of genes for which all expression values fall within the specified limits. A special feature of the profile search is that it allows the user to “cycle” the expression profile, that is, to shift the lower/upper bounds cyclically. This is primarily designed for time-series experiments, where it can be interesting to see sets of genes behaving similarly but with a time difference. Figure 7.3.5 illustrates the Profile Search window. Analysis of Gene-Expression Data Using J-Express Exploring the data using clustering and projection methods Given a gene-expression matrix, one natural question to ask is whether the genes (rows) and/or the arrays (columns) form groups. In other words, one can search for gene sets having similar expression profiles under a given set of conditions. Such genes may be 7.3.6 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.4 The profile similarity search in J-Express allows the user to find the profiles most similar to a query profile when a particular dissimilarity measure is used. The figure illustrates the difference between (A) Euclidean distance, and (B) Pearson correlation–based dissimilarity measure (mathematically, the dissimilarity measure is 1 minus the correlation coefficient). See Background Information for more about dissimilarity measures. hypothesized to participate in the same biological processes in the cell, for example, taking part in the same metabolic pathway. Also, it is interesting to identify a set of arrays that have similarities in gene-expression measurements, for example, to identify relationships between different tumor (cancer) samples and potentially identify subtypes represented in a cancer study. In general, given a set of objects and a measure of their dissimilarity, it is reasonable to ask whether the set can be divided into groups so that objects within each group Analyzing Expression Patterns 7.3.7 Current Protocols in Bioinformatics Supplement 21 Figure 7.3.5 The user-defined Profile Search window in J-Express allows the user to define a search profile consisting of a lower and upper limit on the expression values and to find all profiles matching that profile. The search profile is defined by the red/green barred boxes and the matching expression profiles are shown in black. For the color version of this figure go to http://www.currentprotocols.com. are relatively similar to each other, and there is less similarity between the groups. Partitional clustering methods such as the K-means algorithm will create nonoverlapping groups, which together include the complete set of objects. Alternatively, one may want to organize the objects in a tree. In the tree, very similar objects are grouped together in tight subtrees. As one moves to larger and larger subtrees (up to and including the whole tree), more and more dissimilar objects are included. The tree structure is relatively easy to interpret, and many biologists are used to looking at trees—e.g., phylogenetic trees. However, one should remember that the algorithm imposes a tree structure on the data set even though the data set may be better explained using other structures. An alternative to using a clustering method is to project the objects into a two- or threedimensional space and allow the users to visually analyze the objects in this space. Projection methods include principal component analysis and multidimensional scaling. The main objective of projection is to preserve (as much as possible) the information in the lower dimensional space. Self-organizing maps (SOMs) provide an intermediate between clustering and projection. SOMs group similar objects together, and at the same time the groups are organized in a structure (e.g., a grid) so that groups close to each other on the structure (e.g., neighbor nodes on the grid) contain similar objects. Analysis of Gene-Expression Data Using J-Express Hierarchical clustering This is a conceptually simple and attractive method. An early application of hierarchical clustering to microarray gene-expression data was provided by Eisen et al. (1998). It introduced an intuitive way of visualizing the expression profiles of the genes along 7.3.8 Supplement 21 Current Protocols in Bioinformatics the edge of the resulting dendrogram. In J-Express, the user can perform hierarchical clustering on a data set by choosing (clicking) the data set of interest in the project tree and then choosing hierarchical clustering on the Methods pull-down menu (alternatively a button with a tree icon can be clicked). The user then selects which distance measure to use to calculate the tree (see Background Information). Additionally, the user can choose which linkage rules to apply. The alternatives are single linkage, average linkage, and complete linkage. Average linkage also comes in two variants, weighted and unweighted, corresponding, respectively, to the WPGMA and UPGMA methods well known in clustering. Additionally, the user can choose whether only the rows or both the rows and the columns of the gene-expression matrix are to be clustered. The user is also given a high level of control in defining how the results should be displayed on the screen (or in the file if the graphics are saved to file). Figure 7.3.6 shows the results of hierarchical clustering of the synthetic data set using J-Express and three different linkage rules. K-means clustering K-means clustering is a very simple algorithm for dividing a set of objects into K groups. The parameter K needs to be defined prior to clustering. The algorithm chooses K points as initial center points (centroids), one for each cluster. It then alternates between two operations. First, given a set of centroids, allocate each data point to the cluster associated with the closest centroid. Then, given sets of data points allocated to each of the K clusters, calculate new centroids for each of the clusters. If, in two consecutive iterations, the same points are allocated to each of the clusters, the algorithm has converged. The algorithm may not converge in all cases, and it is convenient to define a maximum number of iterations. While the K-means algorithm is conceptually simple, it does have certain weaknesses. One is that the user needs to define the number of clusters beforehand, and in most cases the user will not have sufficient information. Another weakness is the initialization, since the final result depends strongly on this. As a remedy for this second problem, different heuristic methods have been proposed to find “good starting points,” including the random approach, Forgy approach, MacQueen approach, and Kaufman approach (Peña et al., 1999). In J-Express the user starts a K-means analysis by choosing it from the Methods pulldown menu (or alternatively by clicking a short-cut button). The user needs to specify the number of clusters, and may choose between a range of distance measures and initializing methods. The most natural distance measure to use is the Euclidean, since the centroids are calculated under the assumption of a Euclidean space. If one seeks clusters of genes with correlated expression profiles, one should, instead of using a correlationbased distance measure, perform mean and variance normalization, and use a Euclidean distance measure in the K-means analysis. Figure 7.3.7 shows the menu allowing the user to start a K-means analysis in J-Express, including control over all the parameters discussed. Principal component analysis Principal component analysis (PCA) involves mathematical procedures that transform a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components (Joliffe, 1986). This approach has been popular for analyzing gene-expression data (Holter et al., 2000; Raychaudhuri et al., 2000). The main principle is to linearly transform a set of data points in a space of high dimensionality to a set of data points in a space of lower dimensionality, while keeping as much of the variance information in the data set as possible. Conceptually, the one axis through the Analyzing Expression Patterns 7.3.9 Current Protocols in Bioinformatics Supplement 21 Figure 7.3.6 Example showing hierarchical clustering of the synthetic data set using: (A) single linkage, (B) average linkage, and (C) complete linkage. To the very right of each clustering is shown from which seed each profile was generated (this is shown using the gene group visualization functionality in the dendrogram window of J-Express). Analysis of Gene-Expression Data Using J-Express original space that explains most of the variation in the data set is first found. The variance explained by an axis can be calculated by projecting all data points onto the axis and calculating the variance of this set of (one-dimensional) numbers. Next, one removes the contribution of this axis to the data points (by subtracting the component along the first axis) and repeats the analysis on this new data set. This is continued until the data points end up in one point. The axes identified in each of these analyses constitute the principal 7.3.10 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.7 (A) K-means dialog box; (B) K-means result when clustering the synthetic data set. Each cluster is represented by its mean profile and by the bars showing the variation within the cluster at each data point. components, and each explains a maximal amount of variance while being orthogonal (independent). The PCA functionality in J-Express allows the user to project the expression profiles of interest down to two or three dimensions in order to get a visual impression of the similarity relationships in the data set. Flexible two- and three-dimensional visualization functions allow the user to visually study the data points and to interactively select objects to study. In this way, the user can access the expression profiles of any subset of data points. For example, a two-dimensional view can give an impression of existing clusters as well as outliers in the data set. Inspection of the shape of the principal components themselves can also be informative. Analyzing Expression Patterns 7.3.11 Current Protocols in Bioinformatics Supplement 21 Figure 7.3.8 (A) PCA window with applied density map and a selected green area. (B) Result from PCA selection. For the color version of this figure go to http://www.currentprotocols.com. Analysis of Gene-Expression Data Using J-Express The PCA window comes with a set of options for customizing and controlling the results. For instance, one may want to apply a density map in order to easily see where in the plot the data are the most dense (see Fig. 7.3.8). In Figures 7.3.8 and 7.3.9, the density map is used together with a density threshold so that data points in areas that are less dense are visualized only as spots. This makes it very easy to identify and group outlier genes, which in this case are genes that correlate well with the selected principal components. 7.3.12 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.9 (A) PCA window with over 6000 points (genes). (B) The same number of points with density threshold to find outliers. In J-Express, a principal component analysis can be started from the Methods pull-down menu or by clicking a button marked with a coordinate system. Self-organizing map analysis Self-organizing maps (SOMs), as originally proposed by Kohonen (1997), have been used to study gene-expression data (Tamayo et al., 1999; Tornen et al., 1999). An attractive feature of SOMs is that they provide a clustering of data points and simultaneously organize the clusters themselves where clusters with similar expression profiles are close to each other on the map. The SOMs are trained to adapt to the expression profiles under study, a training procedure that is affected by choice of a large number of parameters. For example, there are parameters controlling the “stiffness” of the map. For the user to understand the effects of changing parameter values, J-Express visualizes the training of the SOM by projecting both the data points (expression profiles) and the neurons in Analyzing Expression Patterns 7.3.13 Current Protocols in Bioinformatics Supplement 21 Figure 7.3.10 Analysis of Gene-Expression Data Using J-Express (A) SOM training control window. (B) SOM visualized in PCA window. the map into a two- or three-dimensional plot. The projection is done using the most significant principal components. Since the user can see the adaptation of the map during the training phase, he or she can get an impression of the effect of altering the parameter values. Of course, the user should be aware that the two- or three-dimensional plots do not display the complete information in the data set. The program displays the proportion of the variance explained by the utilized principal components. See Figure 7.3.10 for an example. 7.3.14 Supplement 21 Current Protocols in Bioinformatics After the training of the SOM, the data points are distributed between the neurons in a socalled sweep phase. In this phase, the user chooses whether the object groups collected by the neurons should be disjoint or whether they should be allowed to overlap. The user also sets the maximum distance between a neuron and a data point for the data point to be associated with the neuron. If this threshold is set to a low value, one will get “dense” clusters (low within variance), but at the same time run the risk that some data points are not associated with any of the neurons. The visualization provided by J-Express facilitates the understanding of such effects. A simpler interface of this method has also been added. This interface only asks the user to select the number of neurons the algorithm should use. When finished, the data points are allocated to the neurons using an exclusive sweep. The SOM is started from the Methods pull-down menu in J-Express. Significance analysis of microarrays Significance analysis of microarrays (SAM) is a method that is used to find genes that are differentially expressed between paired or unpaired groups of samples. The genes are scored on the basis of change in gene expression between the states, combined with the standard deviation of repeated measurements. It was developed by Tusher et al. (2001) and uses as score a regularized version of the well known Student t test. Since the distribution of the SAM scores is unknown, significance is estimated by randomly assigning the samples to the sample groups (permutations) and repeating the calculation of SAM scores. This process is repeated a number of times to estimate the distribution of random scores. The original SAM scores are compared to the distribution of SAM scores obtained for the permuted data sets, and used to calculate false discovery rates (FDR). The FDR is calculated for a list of genes with the highest SAM scores, and reflects how many false positives should be expected to be among these. The SAM procedure is simple to use. The only parameters the user has to provide are information about which groups the different samples belong to, which groups the analysis should be performed on, and the number of permutations. SAM is available from the Methods menu. Rank product Rank product is a very simple and straightforward method that can be used to find differentially expressed genes in a data set. It was developed by Breitling et al. (2004). Rank Products is based upon the ranks that a gene gets after calculating fold change between pairs of samples, and then scoring the ranks the genes obtain in the different comparisons. A gene that is ranked high in all of the comparisons will get a good score. The simplicity of the method also makes it a suitable option for doing meta analysis (e.g., analysis of a set of gene rankings each based on analysis of separate data sets). One of the drawbacks of the Rank Products method is that it may score a gene to be both significantly up- and down-regulated. Rank Products can be started from the Methods menu. Gene ontology analysis The Gene Ontology (GO) is a set of well defined terms used to described genes where the terms are also organized in hierarchies (actually directed acyclic graphs, or DAGs) reflecting the relationships between the terms. The GO component in J-Express can be used to understand more about the gene lists resulting from the other analysis steps performed. In addition to browsing the GO tree to see which processes the genes in the lists are involved in, its real power becomes apparent when the GO tree of a list of interesting genes (containing the number of genes mapped to each term in the tree) is compared to the GO tree of a reference list (e.g., a list containing all the genes expressed Analyzing Expression Patterns 7.3.15 Current Protocols in Bioinformatics Supplement 21 on the array). When comparing the two trees, the reference list (together with the relative size of the list of interesting gene set compared to the reference list) is used to calculate expected values for the different entries in the GO tree and thereby see if the genes in the list of interesting genes are enriched or have more genes belonging to a particular entry then what would be expected by chance. A file containing DAGs describing the relations between the different GO terms can be downloaded from http://www.geneontology.org and saved to the JExpress/resources/go folder. The association between the data set and the GO terms must be described in an association file. Association files for many organisms can also be downloaded from the Gene Ontology Web resource and placed in the directory resources/go/goassociations under the J-Express directory in your local J-Express installation. The user must then make sure that the identifier used in the association file also exists as a column in the data set. The J-Express Annotation Manager can be used to add annotation. See the J-Express Help for further details. The GO tree method can be started from the Methods menu. Gene set enrichment analysis Gene set enrichment analysis (GSEA; Subramanian et al., 2005) uses external information in a search for groups of genes that follow the same trends in a data set. It represents an attempt to overcome some of the shortcomings of other more traditional methods by avoiding preselection of the number of clusters and cut-offs. It can be performed on either categorical or continuous data. Categorical data is of the type “before treatment” versus “after treatment,” while continuous data can be time series data. The external information used by GSEA is gene sets. Gene sets can be defined to capture the biological relationships that the investigator is interested in. For example, a gene set can be a list of genes sharing some terms in their descriptions, such as “apoptosis” or “receptor activity,” and can be created from almost any source. Gene Ontology is a commonly used source. GSEA starts by ranking the genes using a method selected by the user. For categorical data, this can be Golub score or SAM, which are methods used to calculate differential expression between two groups of samples. For continuous data, the genes are ranked according to their correlation with a particular search profile. Next the predefined gene sets are scored by calculating an Enrichment Score (ES). The ES for a particular gene set is calculated by starting at the top of the ranked gene list and adding a score to the ES every time a gene that is a member of the gene set is encountered, and subtracting a penalty from the ES every time a gene that is not a member of the gene set is encountered. This creates what is referred to as a running-sum, and can be seen in Figure 7.3.11. The maximum score reached during the calculation of the running sum is used as the ES for the particular gene set. Significance of the gene set scores are estimated by data permutation. There are a few methods available that use gene sets. GSEA is well known in the community and has been implemented in J-Express to facilitate excellent interactivity with the gene expression data itself (through synchronization of the different viewers in J-Express) when interpreting the gene set enrichment analysis results. GSEA can be started from the Methods menu. Analysis of Gene-Expression Data Using J-Express Scripting interface The script interface really adds functionality to the software, and is available for use both at the preprocessing and high-level analysis steps. It allows the user to automate the data analysis and can thus save time when performing repetitive operations on the data sets. The user gets access to the data objects through the script interface and may manipulate 7.3.16 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.11 Results from a GSEA analysis (window bottom right). The figure in the middle of the GSEA window shows the path of the running sum used to find the Enrichment Score (ES) for a gene set. The ES is determined by the highest (or in this case the lowest) point along this walk. The genes that appear at or before this point are the ones contributing to the score and are referred to as the “Leading Edge.” These may be important genes. The data set used shows the life cycle of Plasmodium falciparum (Bozdech et al. 2003). The genes in the dataset were ranked according to their correlation to the search profile shown in the top right-hand corner, and Gene Ontology was used to create gene sets (window top middle). When browsing the GSEA result window, the corresponding GO terms will be selected in the GO tree and the gene profiles will be selected in the Gene Graph window (window bottom left). the data matrices by using some of the methods built into J-Express, or by writing his or her own scripts. The user can also connect data from J-Express with his/her own Java classes. Both Javascript and Jython script are available in J-Express as of version 2.7. Jython scripting enables full support for Python, but also enables use of Java objects directly in the scripts. Some example scripts come with the installation, and more scripts are available at the J-Express forum: http://www.molmine.com/forum. A script is executed on a particular data set by selecting the data set in the project tree and pressing the Execute button in the script window. Both scripts can be started from the Data Set menu, and Jython scripts can also be added to the process list in SpotPix Suite and started from a button with a “play” icon on the J-Express tools panel. GUIDELINES FOR UNDERSTANDING RESULTS The Basic Protocols of this unit describe how J-Express can be used to filter and normalize the results from a set of microarray scans to obtain a gene-expression matrix, and how the different analysis methods in J-Express can be used to explore a gene-expression matrix. J-Express facilitates the interpretation of the results by allowing the user to Analyzing Expression Patterns 7.3.17 Current Protocols in Bioinformatics Supplement 21 visually explore the results within J-Express and to export textual representations of the results that can then be imported into external programs. In the protocols above, the different methods have been illustrated by using an artificial data set. It has been shown that different dissimilarity measures can give quite different viewpoints on the data. It is important to choose a measure that is appropriate for a particular analysis, and to view filtering and normalization methods in conjunction with the choice of dissimilarity measure. The authors have also tried to illustrate the difference between some of the most popular clustering and projection methods. It is important that one have at least a basic understanding of the methods before drawing conclusions regarding the results. In general, microarray experiments can provide an overview of the phenomena under study and form the basis for hypotheses that can be tested, potentially, using other types of (often low-throughput) technology. In order to maximize the benefits from the experiments, a set of powerful analysis methods should be applied and their results compared and assessed. The J-Express package provides some of the most useful and popular analysis methods and allows for comparison between the results. COMMENTARY Background Information Central concepts An important concept in J-Express is that of a data set. This is the central object that the user provides as input for analyses. It may also be queried and stored. The relationships between different data sets are automatically recorded and maintained as part of the projectmanagement system. The system keeps track of the data sets loaded into the system and of the sets later generated by the user through operations on the data and through analyses (Fig. 7.3.12). A data set can be one of two types. The most important is the gene-expression data matrix. This can be input to a selection of clustering and visualization methods (see Basic Analysis of Gene-Expression Data Using J-Express Protocol 2). The other type is spot-intensity data. This can be input to a filtering and normalization procedure giving, as a result, a gene-expression data matrix (see Basic Protocol 1). Another important concept is that of metadata. For each data set stored in the project-management system, J-Express generates metadata that document what steps the user has taken in order to produce the data set. These data can, for example, include information regarding from which file(s) the data were loaded, filtering and normalization procedures followed, and clustering and selection operations performed. The principle is that given the metadata, the user should be able to repeat the steps needed to reproduce the result. Figure 7.3.12 Data flow. Data are loaded from a data medium (typically a hard disk) through a loader/saver module and maintained within the J-Express system as a data set. The projectmanagement system holds the different data sets loaded, as well as derived data sets produced by the user through analysis and processing (e.g., normalization/filtering) steps. The system also stores information on relationships between data sets. 7.3.18 Supplement 21 Current Protocols in Bioinformatics The gene-expression data matrix and object sets A gene-expression data matrix is a rectangular matrix containing one row per gene and one column per array. Entry (i, j) contains a number quantifying the expression value of gene i in array j. If the data matrix has been obtained through J-Express’ own normalization procedure applied to a set of two-channel microarrays (see Basic Protocol 1), the value is the log (base 2) ratio of channel 1 divided by channel 2 intensity values for spot i on array j. If applied to a set of one-channel microarrays, the value typically is a normalized form of the log-intensity of the spot (or spots) corresponding to one gene. The analysis routines in J-Express treat the data as numerical values, and their semantics (or scales) are not explicitly used in any of the analyses. For this reason, the program can also be used to analyze types of data other than gene-expression data. In addition to the numerical values, the gene-expression data matrix can also contain textual information about each row (gene) and each column (array)—collectively referred to as objects. Each object normally has an identifier, and, optionally, a set of information fields in the form of character strings. For genes, the identifier could be a GenBank identifier and the information fields could, for example, contain characterization of the gene’s function or its chromosomal location. The identifiers are also the primary keys used in the J-Express annotation manager (see Basic Protocol 2). Associated with a gene-expression data matrix, one can also have a number of object sets, each containing a subset of the genes or the columns. These can be used to specify a set of genes (or columns) sharing annotation information or grouped by the user, for example, on the basis of clustering analysis results (see Basic Protocol 2). The gene sets can be used to color graphical entities (e.g., expression profiles drawn as line graphs or dots in a projection visualization) representing the objects in visual displays. For example, the user can specify that all genes whose annotation matches “heat shock” be colored red while all genes belonging to a certain cluster be colored blue. Supervised and unsupervised analysis Unsupervised analysis of gene expression data has the goal of identifying groups of genes (or arrays) that are similar to each other, effectively reducing the dimensionality of the data set. For example, a possible goal might be to obtain groups of genes that show similarity in their expression values over all or over a subset of the arrays. It can then be hypothesized that such gene sets are biologically related, and, depending on availability of data, this can be automatically analyzed. Also, hypotheses about a gene’s function can be based on functional properties of other genes found in the same cluster. In the case of supervised analysis, a set of objects (either genes or columns, e.g., expression profiles from different patients) are given labels. When the samples are divided into groups, a primary goal is to identify genes and (predefined) gene sets that show differential expression between the sample groups. For this purpose, methods such as SAM (Tusher et al., 2001), rank products, and gene set enrichment analysis (Subramanian et al., 2005) are applied. The key in all of these methods is that they report for each gene (or gene set) a statistic reflecting differential expression, and, together with this, a p value reflecting how likely this or a more extreme value of the reported statistic (e.g., a t score) is to be found, assuming that the gene is expressed at the same level in both groups. Methods for taking into account multiple testing (the analysis is done not for one gene but typically for thousands of genes) include Bonferroni, False Discovery Rate (FDR), and Q-values (Cui and Churchill, 2003). In J-Express p values are reported together with FDR values or Q-values. Another goal in supervised analysis is to develop a classifier that is able to predict the labels of as yet unlabeled examples. For example, one may wish to develop a method to predict functional properties of genes (e.g., Brown et al., 2000) or cancer subtype of a patient (e.g., Golub et al., 1999). Techniques applied here include support vector machines, K nearest-neighbors’ classifiers, and artificial neural networks. For a fuller discussion of supervised versus unsupervised analysis, see, for instance, Brazma and Vilo (2000). Expression-profile dissimilarity measures An expression profile describes the (relative) expression levels of a gene across a set of arrays (i.e., a row in the gene-expression matrix) or the expression levels of a set of genes in one array (i.e., a column in the matrix). In cluster analysis (see Basic Protocol 2), one seeks to find sets of objects (genes or arrays) with similar expression profiles, and for this one needs to quantify to what degree two expression profiles are similar (or dissimilar). Clustering is more easily explained by using Analyzing Expression Patterns 7.3.19 Current Protocols in Bioinformatics Supplement 21 Figure 7.3.13 Illustration of distance measures for pairs of points in a two-dimensional space. (A) Euclidean distance; (B) Manhattan (city block) distance. dissimilarity (or distance) measures, and this terminology will be used in this unit. One can measure expression dissimilarities in a number of different ways. A very simple measure is Euclidean distance, which is simply the length of the straight lines connecting the two points in multidimensional space (where each element in the expression profile gives the coordinate along one of the axes). Another simple measure is often referred to as city block or Manhattan distance. This simply sums the difference in expression values for each dimension, with the sum taken over all the dimensions. Other measures quantify the similarity in expression-profile shape (e.g., if the genes go up and down in a coordinated fashion across the arrays), and are based on measures of correlation. Figure 7.3.13 illustrates two representative distance concepts in two dimensions. In J-Express the user can, for each clustering method (see Basic Protocol 2), decide which dissimilarity measure should be used. It is a good idea for the user to explore the alternative measures separately in the expressionprofile similarity search engine to become familiar with the properties of each of the measures. Critical Parameters Analysis of Gene-Expression Data Using J-Express 7.3.20 Supplement 21 Experimental design: Intra- and interarray normalization For normalization of data from two-channel platforms using common reference design, it is assumed that the experiment is designed so that the reference sample (shared between the arrays) is hybridized to channel 2 on each array, and the normalization is then carried out for each array by normalizing channel 1 with respect to channel 2 (see Fig. 7.3.14). If the reference is not hybridized to channel 2 on all arrays, the user can swap the data columns to move the reference channel to the second position for each array. J-Express is designed to allow handling of one-channel data. In this case, only channel 1 is used and the arrays are normalized using quantile normalization (Bolstad et al., 2003). Quantile normalization finds the average distribution over all of the arrays in the experiment and then transforms each array to get the average distribution. Another option is to use one array as reference and then the other arrays are normalized with respect to it. Selecting a clustering method J-Express and other gene-expression analysis systems provide a choice of different clustering methods (Basic Protocol 2). It is difficult to provide any definite advice on which method should be used in any one concrete situation. The history of clustering theory in general and of clustering of gene-expression data shows that there is no one method that outperforms all others on all data sets (Jain and Dubes, 1988). Different investigators find different methods and output representations more useful and intuitive. There are, however, some points that one should keep in mind when considering alternative methods. For example, when using a hierarchical clustering method as presented here, it is assumed that it is possible to find a binary (bifurcating) tree that fits the structure of the data well. This may not always be the case. For example, it may be that there are more complex similarity relationships between different clusters than what is naturally described by such a tree. Other methods also have their shortcomings. For example in K-means clustering, the user needs to select the number of clusters beforehand, and the method does not give Current Protocols in Bioinformatics Figure 7.3.14 Different experimental designs using two-channel system. In a two-channel system, one typically uses either a common control hybridized to each array (in either one of the two channels), or one performs competitive hybridizations between all (or a subset of) the pairs of samples under analysis. Presently, J-Express supports the first experimental design (left). Note that, on the left, all samples are hybridized together with a common control (referred to as A in the example), while, if one uses the all-pairs approach, every possible pair of samples is hybridized together. any information about the relationships between the identified clusters. Also, using a self-organizing map (SOM), the choice of underlying topology affects the result. For example, one may choose a two-dimensional grid (as above) or a three- or four-dimensional one. The different choices may produce quite different results. All in all, it is probably a good policy to try out more than one method using alternative parameter values in order to get the most out of a concrete data set. J-Express permits the user to do this, and in the future the program will be extended, with an even wider selection of clustering methods complementing the currently included methods (see Suggestions for Further Analysis) Incorporation of information on gene function J-Express has been extended with multiple modules to allow incorporation of functional data. In particular, the user can utilize the KEGG database (Kanehisa et al., 2002), finding the pathways that significantly overlap with user-defined gene sets (the user is warned that KEGG has license requirements for some user groups). J-Express can also import Gene Ontology (GO) files and offers the user functions to identify significant overlaps between gene groups and GO terms. While these functions are primarily used to help interpretation of expression data analyses, the gene set enrichment analysis allows the user to include biological information to guide the analysis itself (see Basic Protocol 2). Suggestions for Further Analysis The results obtained in an analysis of a data set using J-Express can be stored, and further analysis can be performed externally. It may be desirable, for example, to perform a more indepth analysis of the genes placed together in a cluster by J-Express. For example, one may wish to investigate whether genes with similar expression profiles share statistically significant patterns in their regulatory regions, giving hints of a common regulatory mechanism (see, for instance, Brazma et al., 1998) or to analyze gene expression together with protein expression or interaction data. The J-Express tool will be extended, with more functionality in this direction, in the future. In some cases one may wish to design new experiments (e.g., knockout or RT-PCR experiments). Adapting and extending the J-Express system The plug-in framework. Through a comprehensible plug-in interface, it is possible to connect any Java class to the J-Express framework. This interface gives the opportunity to create bridges between J-Express and existing systems, as well as new ways to manipulate or analyze the data. In short, the plug-in model consists of a main plug-in Java class with a few abstract methods that must be implemented by the programmer (sub-classed). Some plugins, including high-level normalization, filtering, search, and sorting, are already available with full source code, and can be downloaded from the same Web pages as J-Express. Simple examples, together with an Application Analyzing Expression Patterns 7.3.21 Current Protocols in Bioinformatics Supplement 21 Figure 7.3.15 The result of applying filters on the (original) synthetic data set using: (A) requiring at least 5 values with absolute values above 2; (B) lower limit on standard deviation only. Analysis of Gene-Expression Data Using J-Express Program Interface (API) and model description, are installed together with the main program package. Below, we briefly describe two of the plug-ins available from the J-Express Web pages (in the latest versions these are also integrated into the main J-Express framework). Search tools. The search plug-in allows the user to use regular expressions to search the information fields in a gene-expression matrix. For example, the user can search for all genes whose annotation matches “enzyme or kinase,” or for all genes whose upstream se- quences (if included in the gene-expression matrix) match the pattern [AT]AAAT exactly. High-level filtering and normalization. It is sometimes appropriate to apply separate filtering and normalization routines to the geneexpression matrices. For example, one may choose to remove the genes that show little variation in expression measurements. In J-Express, this can be done using the available filtering plug-in, for example, to remove the genes whose standard deviation is below some threshold value (for an example, see Fig. 7.3.15). Furthermore, one may want to 7.3.22 Supplement 21 Current Protocols in Bioinformatics Figure 7.3.16 J-Express allows the user to normalize the expression profiles of genes (rows in the gene-expression matrix). The example shows the results of normalizing the synthetic data set by (A) mean normalization and (B) mean and variance normalization. focus on the shape of the expression profiles and not so much on the amplitude of the change or on the offset of all values. In such cases, one can use mean normalization or mean-andvariance normalization (see Fig. 7.3.16 for an illustration). Both normalization procedures operate on the expression profile of each gene separately. While the first subtracts the mean from each profile (so that the mean of each profile gets a mean of zero), the second also divides the resulting numbers by the variance of the profile (so that the expression profile mean becomes zero and its variance becomes one). The second is well suited if one seeks Analyzing Expression Patterns 7.3.23 Current Protocols in Bioinformatics Supplement 21 to find genes behaving in a correlated manner (e.g., increasing and decreasing in expression level in a coordinated fashion), and allows one to use simple (e.g., Euclidean) dissimilarity measures also for this kind of analysis. Scripting language. J-Express has a separate module supporting scripting in Jython (a Java implementation of Python). This allows users to describe their standard analysis operations as a program and also to add, for example, simple data transformation and analysis functions to J-Express. Future plans for J-Express The J-Express system provides a powerful and integrated platform for the analysis of microarray gene-expression data. It is platform-independent in that it requires only the availability of a Java virtual machine on the system. The system includes a range of analysis tools and, importantly, a projectmanagement system supporting the organization and documentation of an analysis project. J-Express is under development and extension, and future versions will include new functionality as well as improved visualization and management capabilities. J-Express was one of the first tools to include functionality for importing and exporting MAGE-ML files (Spellman et al., 2002). However, the functionality can be extended to take advantage of, for example, the description of the experimental design in a MAGE-ML file to automatically suggest or execute analysis pipelines taking this information into account. We would also like to develop functionality that allows the user to consult other data sets when analyzing his/her own data and for performing meta analysis. The scripting functionality of J-Express allows flexible addition of analysis modules, and future work will include developing and making available a larger set of scripts that J-Express users can utilize and adapt to their own needs. Additional sources of information To help users get started with JExpress there are tutorials available at the http://www.molmine.com Web site. In addition, the J-Express analysis guide MAGMA shows the user step by step how to do different types of analysis. MAGMA is available from http://www.microarray.no/magma. Analysis of Gene-Expression Data Using J-Express Literature Cited Beibbarth, T., Fellenberg, K., Brors, B., ArribasPrat, R., Boer, J.M., Hauser, N.C., Scheideler, M., Hoheisel, J.D., Schütz, G., Poustka, A., and Vingron, M. 2001. Processing and quality control of DNA array hybridization data. Bioinformatics 16:1014-1022. Bø, T.H., Dysvik, B., and Jonassen, I. 2004. LSimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32:e34. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 22:185-193. Bozdech, Z., Llina, M., Pulliam, B.L., Wong, E.D., Zhu, J., and DeRisi, J.L. 2003. The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biol 1:001016. Brazma, A. and Vilo, J. 2000. Gene expression data analysis. FEBS Lett. 480:17-24. Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. 1998. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8:12021215. Breitling, R., Armengaud, P., Amtmann, A., and Herzyk, P. 2004. Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 573:83-92. Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M. Jr., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. U.S.A. 97:262-267. Cleveland, W.S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74:829-836. Cui, X. and Churchill, G.A. 2003. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 4:210. Dysvik, B. and Jonassen, I. 2001. J-Express: Exploring gene expression data using Java. Bioinformatics 17:369-370. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:531-537. Holter, N.S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J.R., and Fedoroff, N.V. 2000. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc. Natl. Acad. Sci. U.S.A. 97:8409-8414. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., and Speed, T.P. 2003. Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res. 31:e15 7.3.24 Supplement 21 Current Protocols in Bioinformatics Jain, A.K. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey. Joliffe, I.T. 1986. Principal Component Analysis. Springer-Verlag, New York. Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42-46. Kohonen, T. 1997. Self-Organizing Maps. SpringerVerlag, New York. Peña, J.M., Lozano, J.A., and Larrañaga, P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recogn. Lett. 20:1027-1040. Quackenbush, J. 2001. Computational analysis of microarray data. Nat. Rev. Genet. 2:418-427. Raychaudhuri, S.J., Stuart, M., and Altman, R.B. 2000. Principal components analysis to summarize microarray experiments: Application to sporulation time series. Pacific Symposium on Biocomputing, 455-466. Stanford Medical Informatics, Stanford University, Calif. Spellman, P.T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W.L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B.J., Robinson, A., Bassett, D., Stoeckert, C.J. Jr., and Brazma, A. 2002. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 3:RESEARCH0046. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Mesirovak, J.P. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102:15545-15550. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907-2912. Tornen, P., Kolehmainen, M., Wong, G., and Castren, E. 1999. Analysis of gene expression data using self-organizing maps. FEBS. Lett. 451:142-146. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Bostein, D., and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17:520-525. Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98:5116-5121. Workman, C., Jensen, L.J., Jarmer, H., Berka, R., Gautier, L., Nielsen, H.B., Saxild, H.-H., Nielsen, C., Brunak, S., and Knudsen, S. 2002. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biology 3:research0048. Analyzing Expression Patterns 7.3.25 Current Protocols in Bioinformatics Supplement 21 DRAGON and DRAGON View: Information Annotation and Visualization Tools for Large-Scale Expression Data UNIT 7.4 The Database Referencing of Array Genes ONline (DRAGON) database system consists of information derived from publicly available databases including UniGene (http://www.ncbi.nlm.nih.gov/UniGene/), SWISS-PROT (http://www.expasy.ch/sprot/), Pfam (http://www.sanger.ac.uk/Software/Pfam/; UNIT 2.5), and the Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.ad.jp/kegg/). The DRAGON Annotate tool makes use of relational databasing technology in order to allow users to rapidly join their input gene list and expression data with a wide range of information that has been gathered from the abovementioned multiple public-domain databases (Bouton and Pevsner, 2001), rapidly supplying information pertaining to a range of biological characteristics of all the genes in any large-scale gene expression data set. The subsequent inclusion of this information during data analysis and visualization allows for deeper insight into gene expression patterns. The Annotate tool makes it easy for any user with access to the Internet and an Internet browser to annotate large gene lists with information from these multiple public databases simultaneously. Subsequent to annotation with the Annotate tool, the DRAGON View visualization tools allow users to analyze their expression data in relation to biological characteristics (Bouton and Pevsner, 2002). The set of DRAGON View tools provides methods for the analysis and visualization of expression patterns in relation to annotated information. Instead of incorporating the standard set of clustering and graphing tools available in many large-scale expression data analysis software packages, DRAGON View has been specifically designed to allow for the analysis of expression data in relation to the biological characteristics of gene sets. PREPARING DATA FOR USE WITH THE DRAGON DATABASE AND ANALYZING DATA WITH DRAGON VIEW BASIC PROTOCOL This protocol describes how to prepare a tab-delimited text file for use in the DRAGON database, how to understand the resulting data set, and then how to use the DRAGON View visualization tools in order to analyze the data set in relation to the annotated information gained from DRAGON. To demonstrate this process, the freely available data set associated with the Iyer et al. (1999) study examining the response of human fibroblasts to serum starvation and exposure is used. This is a good example data set because it is freely available, concerns the expression of human genes across a time course, has been well documented, and is sufficiently large to yield some interesting results. For all stages of this demonstration more information can be found on the Learn page of the DRAGON Web site (http://pevsnerlab.kennedykrieger.org/learn.htm). Necessary Resources Hardware Windows, Linux, Unix, or Macintosh computer with Internet connection (preferably broadband connection, e.g., T1, T3, cable, or DSL service) Analyzing Expression Patterns Contributed by Christopher M.L.S. Bouton and Jonathan Pevsner Current Protocols in Bioinformatics (2003) 7.4.1-7.4.22 Copyright © 2003 by John Wiley & Sons, Inc. 7.4.1 Supplement 2 Software Internet browser: e.g., MS Internet Explorer 5 (or higher) or Netscape 6 (or higher) on Windows or Macintosh systems; Opera, Netscape 6 (or higher), or Mozilla on Linux-based systems. Internet Explorer 5 or higher and Netscape 6 or higher are preferred, because Netscape 4.x is not capable of supporting all of the functionality provided in the DRAGON Paths tool. Also required: Spreadsheet program: e.g., MS Excel on Windows or Macintosh systems or Sun Microsystems Star Office suite on Linux systems. Text editor: e.g., TextPad (http://www.textpad.com/) or Notepad on Windows systems; XEmacs (http://www.xemacs.org) on Linux systems. Finally, for advanced users who may want to have more flexibility in the manipulation of their text files, the Perl programming language is powerful and easy to use and allows the user to perform automated text-formatting, file-creation, and file-alteration functions that are useful when analyzing large data sets. Activestate (http://www.activestate.com) has developed a version of Perl available for Windows computers (http://www.activestate.com/Products/ ActivePerl/). Otherwise http://www.perl.com Web site provides downloads of Perl for Linux, Unix, and Macintosh computers. Files The Iyer et al. (1999) example data files were obtained from the Stanford Microarray data Web site (http://genome-www.stanford.edu/serum/data.html). The two files used for demonstration purposes in this unit may be downloaded respectively at the following URLs: http://genome-www.stanford.edu/serum/fig2data.txt http://genome-www.stanford.edu/serum/data/fig2clusterdata.txt Both files are also available at the Current Protocols Web site: http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm Optional: The DRAGON database is generated through the automated parsing of flat files provided by publicly available databases (see the DRAGON Web site for a list of the database flat files used by DRAGON). The information in these files is then loaded into a back-end MySQL (http://www.mysql.com; UNIT 9.2) relational database for use by DRAGON (Fig. 7.4.1). Although it may be easy and more intuitive for most users to access the information in these files via the DRAGON Web site, some readers may want to use this information in their own relational databases. For these purposes, all of the tables used in the DRAGON database are provided for download on the DRAGON Web site (http://pevsnerlab.kennedykrieger.org/download.htm) or can be ordered on CD if desired (http://pevsnerlab.kennedykrieger.org/order.htm). 1. Acquire and prepare data in a master matrix file (see Background Information). Preparing data for analysis is a critical first step in large-scale gene expression data experiments. This preparation requires putting the data into a format that allows for the comparison of gene expression values across all of the conditions in a given experiment. Many expression analysis software packages attempt to make this job easier through automated data-importing functions that keep track of the association of expression values with gene identifiers. However, most researchers will still want to generate what can be called a “master matrix text flat file” or “master matrix file” (see Background Information) at some point during their analysis process. DRAGON and DRAGON View The two example data files downloaded from the Stanford Microarray Web site (associated with the study of Iyer et al., 1999) are both examples of master matrix files. To begin using these files for the DRAGON demonstration, information from the two files must be merged. The expression data in the fig2clusterdata.txt file is the correct type of ratio data for demonstration purposes; however, only the fig2data.txt file has the GenBank 7.4.2 Supplement 2 Current Protocols in Bioinformatics Figure 7.4.1 The DRAGON home page provides links to all available tools and data sources contained in DRAGON and DRAGON View. The page also contains links to all of the public data files that are used by DRAGON to generate its database. accession numbers available for each sequence on the array. Due to the presence of a common unique gene id field in both files, merger of information between the two files is simple. Both files are opened in MS Excel. All columns in both files are sorted in ascending order by the unique gene id (UNIQID) column. Because the same set of genes is represented in each file, the sorted information between the two files matches perfectly. If the sets of genes provided in the user’s files do not perfectly match, a merging of data is still possible, but requires a relational database management system such as MS Access, Oracle, or MySQL in order to allow for an appropriate SQL “join” of the common data columns in each file (UNIT 9.2). The GenBank accession number column from the fig2data.txt file is then pasted into the fig2clusterdata.txt file. The resulting file, figure2_combined_data.txt is used for all further steps in the demonstration. 2. Connect to the DRAGON Web site. The DRAGON and DRAGON View tools can be accessed on the DRAGON Web site at http://pevsnerlab.kennedykrieger. org/dragon.htm. Each of the tools available on the DRAGON site is listed on the front page of the site (Fig. 7.4.1). All of the tools in DRAGON and DRAGON View are based on Common Gateway Interface (CGI) scripts written in the Perl programming language. As a result, they can all be used via an Internet browser. The DRAGON tools include Search, Annotate, and Compare (Compare is still under construction). The DRAGON View tools include Families, Order, and Paths. In addition to DRAGON and DRAGON View, DRAGON Map, developed by George W. Henry, is a set of tools that allow users to inspect the global expression properties Analyzing Expression Patterns 7.4.3 Current Protocols in Bioinformatics Supplement 2 of sequences defined in the UniGene database. Also, a powerful set of normalization tools for microarray data called SNOMAD (Colantuoni et al., 2002) accompanies the DRAGON Web site. Following normalization with methods such as those provided by SNOMAD, the tools that users use most often on the DRAGON Web site are DRAGON Annotate and, then, the DRAGON View Families visualization tool. The usage of DRAGON Annotate is demonstrated in the following steps of this protocol; the usage of DRAGON Families is demonstrated in the Support Protocol. In Figure 7.4.1, Internet Explorer 6 is pointed to http://pevsnerlab.kennedykrieger. org/dragon.htm, and the home page is displayed. Annotate the data set 3. Click on the Annotate link on the main page directly under the title, slightly to the left (Fig. 7.4.1). The Annotate Web page is structured to help the user through the annotation process by guiding the selection of variables through five sections of the Web page. The general flow of usage for most of the tools in DRAGON and DRAGON View is: Upload or Paste data, choose output variables, choose output format (HTML, text, or E-mail) and submit analysis. An overview of all five sections is provided at the top of the page in the Introduction section. Additionally, a synopsis of the Annotate tool is provided in the Blurb section at the top of the page. 4. At the top of the Annotate page (Fig. 7.4.2, panel A) the first thing to do is to import the user’s data into the system. This is accomplished by first defining for the system what sort of method is going to be used (section 1) and then either uploading a master matrix text file (section 2a) or pasting the contents of an input file into the Web page’s text box (section 2b). The deciding factor for whether one uploads or pastes data should be the size of the data set to be annotated. In general, there is a size limit set by Internet browsers on the amount of text that can be pasted into a Web page text box. Because of this one needs to be cautious when entering data sets of more then a few hundred rows into the text box on the Annotate Web page because data sets of this size or greater may be cropped by the text box without warning. The exact point at which the text box crops a data set depends on how much data is provided in each row of the input data set. Because of this cropping problem, it is recommended that, if the user has a data set of more than a few hundred rows, that the upload function in section 2a on the page be used to import data, and, if the paste function is used, that the user check the very bottom of the pasted data set after it is in the text box in order confirm that the entire data set has been pasted. In the example for demonstration, because of the size of figure2_combined_ data.txt, the upload option is the method of choice for importing data. In section 1, the “I am going to upload my data using the file entry field below (Go to 2a).” radio button option is selected. In section 2a, the Browse button is clicked and the figure2_ combined_data.txt file is selected from the appropriate directory. Supplying an identifier in the text box in section 2a is optional but aids in identifying resulting files if they are E-mailed. DRAGON and DRAGON View 5. Once the data has been imported, the user then needs to choose the types of information with which the data set will be annotated (section 3; see Fig. 7.4.2, panel B). Currently, information is available from four public database sources (see Background Information). These are UniGene, SWISS-PROT, Pfam (UNIT 2.5) and the Kyoto Encyclopedia of Genes and Genomes (KEGG). Different types of information derived from these databases represent various biological attributes of the gene, its encoded protein, the protein’s functional domains, the protein’s cellular functions, and the protein’s participation in cellular pathways (see Background Information). 7.4.4 Supplement 2 Current Protocols in Bioinformatics A B Figure 7.4.2 The DRAGON Annotate page. (A) The user is allowed to input data into a dialog box, or a tab-delimited text file can be uploaded from a local file. (B) The user selects options, then sends a request for annotation to the DRAGON database. Results may be returned as an HTML table, as a tab-delimited text file (suitable for import into a spreadsheet such as Microsoft Excel), or as an E-mail. Analyzing Expression Patterns 7.4.5 Current Protocols in Bioinformatics Supplement 2 The user chooses different types of data by simply selecting the check boxes to the left of each type of information desired. For the example, in section 3 of the annotation page (Fig. 7.4.2, panel B), the UniGene Cluster ID, Cytoband, LocusLink, UniGene Name, and SWISS-PROT Keywords options are checked. 6. Section 4 of the Annotation page (Fig. 7.4.2, panel B) allows the user to define certain criteria related to the format of the imported data set. The most important criterion defined here is the column in which the GenBank accession numbers provided in the input data set are located, because, as mentioned above, GenBank accession numbers are required for the proper functioning of the Annotate tool. For example, if the GenBank accession numbers are located in the second column of the data set, then a 2 would be entered into the “Column number containing GenBank numbers (with the farthest left column being 1):” text field. In addition, the user can specify what sort of delimiter is being used in the input file (default is tab). If something other than a tab, such as a comma, were being used, then a “,” (without the quotes) would be entered in the “Text to use as field delimiter (assumed to be a tab “\t” if left blank):” text area; if a pipe character were being used then a “|” (without quotes) would be entered. Finally, the newline character being used can be specified (the default is recognized as either a Windows or Linux/Unix newline character). An important point about the choices made in section 4 of the Annotation page is that the different data sources used to annotate the input data set can be more or less limiting. For example, in any given data set there are only a small number of genes that will be annotated with information from the KEGG database. Alternatively, a much larger set of genes will be annotated with information from the Pfam database. Because of this it is often best to perform multiple annotations of a given data set each time with a single type of information. For example, with the fibroblast data set in the example from Iyer et al. (1999), three successive annotations are performed. First the data set is annotated with all of the UniGene database information. A second copy of the data set is annotated with SWISSPROT keywords, and a third copy of the data set is annotated with Pfam family names and accession numbers. Performing successive annotations like this instead of one query including all three data types prevents the loss of one type of information (i.e., Pfam numbers) because another type of information (i.e., SWISS-PROT keywords) is not available for a given gene. Since the GenBank accession numbers are contained in column 3 of the figure2_ combined_data.txt file, a 3 is placed in the “Column number containing GenBank numbers (with the farthest left column being 1):” text area in section 4. 7. Section 5 of the Annotation page (Fig. 7.4.2, panel B) allows the user to select two things, the desired output file format and how the annotated data is to be added to the input data set. The three possible output file formats are HTML, text, and E-mail. There are distinct benefits to each type of output format. The HTML-based output allows the user to link out to additional information about both the original input data and the additional information that was provided by the annotation. The drawback of the HTMLbased output is that it is useful only for smaller data sets. The text-based output is also useful only for smaller data sets, but provides the option of downloading the output file to one’s computer. DRAGON and DRAGON View The E-mail output format is the best option for most data sets for several reasons. The first is that the user is able to receive data sets of any size using this output format. It is important to note that if a large data set is used with another output format, it is possible that the processing of the data will extend beyond the “time-out” period set in by the Internet browser being used. If this occurs, the user will receive an error message from the browser stating something to the effect of, “The page cannot be displayed,” and the user may think that there has been an error in the running of the Annotate tool. This is not the case; the 7.4.6 Supplement 2 Current Protocols in Bioinformatics data processing has simply taken longer than the user’s browser is set to wait for a signal back from the Annotate tool. A second reason why the E-mail output format is preferable in most cases is that multiple data sets can be rapidly entered into the Annotate tool without waiting for each of the analyses to finish. In the example the Output to E-mail Address option is selected in section 5 and an E-mail address (e.g., [email protected]) is entered into the text area. 8. In addition to the output format, the user can set the “All values on one line” or “Multiple rows, one value per row” criteria in section 5. The user’s choice here determines, to a great extent, what can be done with the output data file, and is integrally related to the user’s intended purpose in using the Annotate tool. This is because there are two primary uses for the Annotate tool. First, a user may simply want to know more about each of the gene on the gene list. For example, one might simply want to peruse the information while further studying one’s gene list, or one might be required to include this information in a publication or patent filing. Either way, one would want each row in the file to remain constant and to simply have additional pieces of information added to the end of that gene row. If this is the user’s intent, then the “All values on one line” option should be selected. The second use of the Annotate tool is where one wants to use the output to analyze expression data in relation to the functional characteristics of the genes in the data set. This can be performed using the DRAGON View tools or by employing other gene expression data analysis software packages such as Partek or GeneSpring. In order to accomplish this, the “Multiple rows, one value per row” option should be selected. The “Multiple rows, one value per row” option is selected from the drop-down menu. 9. After the entry of each input data set the user presses, “Submit Gene List” and instantly receives a message stating, Your data is being processed, and will be mailed to [email protected]. As soon as this message is received, the user can go back and enter in another data set for processing. All of the output files will be E-mailed to the address supplied by the user as soon as they become available. 10. Open and inspect results (see Guidelines for Understanding Results). In the example used for this unit, the results of the annotation of the figure2_ combined_data.txt file would be received in the E-mail inbox of the address specified. The output data file would be received as an attachment to the E-mail message with the subject line DRAGONOutput. The message must be opened and allowed to fully load. It is important to allow the message to fully load into the E-mail client; otherwise the attached file can sometimes be truncated. Different E-mail clients provide different methods for downloading and saving E-mail attachments to the computer. For this demonstration, the attached output file would be clicked on with the right mouse button on a Windows computer. Right-clicking on the file opens a directory-browsing window that allows for the choice of where to save the file on the computer’s hard drive. In subsequent discussion (see Support Protocol), this downloaded file is named figure2_ combined_data_KWS.txt, where KWS stands for “Keywords.” 11. Analyze results with DRAGON Families (see Support Protocol). ANALYZING DATA WITH THE DRAGON Families TOOL This protocol and accompanying demonstration focus on the use of the DRAGON Families tool. Refer to the Learn page of the DRAGON Web site (http://pevsnerlab. kennedykrieger.org/learn.htm) or the paper describing DRAGON View (Bouton and Pevsner, 2002) for more information on the other DRAGON View tools (also see Background Information). SUPPORT PROTOCOL Analyzing Expression Patterns 7.4.7 Current Protocols in Bioinformatics Supplement 2 Necessary Resources Hardware Windows, Linux, Unix, or Macintosh computer with an Internet connection (preferably broadband connection, e.g., T1, T3, cable, or DSL service) Software Internet browser: e.g., MS Internet Explorer 5 (or higher) or Netscape 6 (or higher) on Windows or Macintosh systems; Opera, Netscape 6 (or higher) or Mozilla on Linux-based systems. Internet Explorer 5 or higher and Netscape 6 or higher are preferred, because Netscape 4.x is not capable of supporting all of the functionality provided in the DRAGON Paths tool. Also required: Spreadsheet program: e.g., MS Excel on Windows or Macintosh systems or Sun Microsystems Star Office suite on Linux systems. Text editor: e.g., TextPad (http://www.textpad.com/) or Notepad on Windows systems; MEmacs (http://www.xemacs.org) on Linux systems. Files An Annotated master matrix file created by running the DRAGON Annotate Tool (figure2_combined_data_KWS.txt; see Basic Protocol) Prepare and format the data 1. Generate an annotation file by using DRAGON Annotate Tool (see Basic Protocol). In the example used in this unit, the annotation file is named figure2_combined_ data_KWS.txt. 2. Before submitting the figure2_combined_data_KWS.txt file to the DRAGON Families tool the tab-limited file must be converted into a comma-delimited file. a. The tab-delimited file figure2_combined_data_KWS.txt is opened in MS Excel. b. The File menu option is selected, then Save As... is selected. c. In the “Save as type:” drop-down menu at the bottom of the Save As window, “CSV (Comma delimited) (*.csv)” is selected. d. The file figure2_combined_data_KWS.csv is saved to the same directory as figure2_combined_data_KWS.txt. This process simply replaces the tab delimiter in the file with comma delimiters. 3. It is then necessary to get rid of any commas that may exist in the file. In MS Excel the entire data set is selected with by clicking Ctrl+A. Then the Replace function is opened by clicking Ctrl+H (Replace can also be selected in the Edit menu option). A comma is typed in the “Find what:” box and nothing is typed in the “Replace with:” box. The Replace All button is then clicked. This deletes all commas from the data in the file such as those that might be present in gene names or SWISS-PROT keywords. DRAGON and DRAGON View 4. Finally a new file is created for each time point in the file. In the example here, each file contains four columns: the GenBank accession numbers, the gene names, the SWISS-PROT keywords, and the expression values for that time point. For each file, these four columns are selected in the figure2_combined_data_KWS.csv file by holding down the Ctrl key and clicking at the top of each column. Then, a new spreadsheet is opened with Ctrl+N (or using the File menu option) and the four 7.4.8 Supplement 2 Current Protocols in Bioinformatics Figure 7.4.3 The DRAGON Families page. columns are pasted into the new spreadsheet with Ctrl+V (or using the Edit menu option). 5. In each new file, the expression values need to be the first column in the file. This is accomplished by selecting the expression values column in each file, cutting it by clicking Ctrl+X (or using the Edit menu option), and pasting it into the first column by holding down the right mouse button over the first column and selecting Insert Cut Cells. 6. Each of these new files is saved as a .csv file in the same directory as the figure2_combined_data_KWS.csv file. Multiple .csv files are saved using this method and each is named by the time point data it contains (in the example here, 15mins.csv, 1hr.csv, 6hrs.csv, 24hrs.csv). Run DRAGON Families 7. Start DRAGON Families by opening the main DRAGON page and selecting DRAGON Families from the links at the top of the page. To get a sense of what the input data should look like, one or both of the example files on the page can be viewed. These files should have a similar format to the figure2_ combined_data_KWS.csv if the figure2_combined_data_KWS.csv were opened in a text editor such as TextPad (see “Software” above) instead of MS Excel. 8. The DRAGON Families site (Fig. 7.4.3) is designed in a manner similar to the DRAGON Annotate site. As described for the annotate page (see Basic Protocol), the flow of data entry on the site is guided through sections of the page where a specific task is accomplished in each section. For this demonstration the upload option is selected in section 1. Analyzing Expression Patterns 7.4.9 Current Protocols in Bioinformatics Supplement 2 Figure 7.4.4 As the final step in the analysis of the demonstration data, each time point contained in the Iyer et al. (1999) data set, after having been associated with SWISS-PROT keyword information by DRAGON Annotate, is analyzed using the DRAGON Families tool. The most coordinately up-regulated gene families are shown here for three time points (15 min, 6 hr and 24 hr). Each gene is represented in its corresponding family as a box that is clickable and hyperlinked to the NCBI LocusLink entry for that gene. Across each row, all the boxes correspond to genes in a given family. Each box is also color-coded on a scale from red (up-regulated) to green (down-regulated). A scale at the top of the analysis page (not shown) gives the association of colors with ratio values. For all the functional families that are annotated, the program returns the families ranked in order according to the average ratio expression value for all of the genes in that group. Note that overall there is less differential regulation occurring at the 15-min time point since there are no bright red squares present. By 6 hr certain gene families, particularly those associated with inflammatory responses, are coordinately up-regulated. Finally by 24 hr, cell cycle and mitotic gene families are coordinately differentially regulated, indicating that the cells are progressing through the cell cycle. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://currentprotocols.com/colorfigures. 9. In section 2a (Fig. 7.4.3), the Browse button is used to find and select the first time point .csv file, 15mins.csv. The name 15mins is typed into the text area below the Browse button. 10. In section 3 (see Fig. 7.4.3), various parameters can be selected describing the data set. The Red/Green Output, Human, and SWISS-PROT Keywords are selected for this file. 11. In section 4, the columns containing the required data are defined as 1 for the expression data, 3 for the GenBank accession numbers, and 4 for the Type values, which in this case are SWISS-PROT keywords. 12. The text areas in Section 5 are left blank, thereby defining them as default variables. If another delimiter, such as a pipe (|) is used, this can be entered in the delimiters text area in order to replace the default comma (,) delimiter. Similarly, if another type of newline character is used, this can be entered in the newline text area. The Analyze Data button is clicked. A new page appears showing the results of the DRAGON Families analysis. DRAGON and DRAGON View Save results 13. As the first step in saving the page, especially if a DRAGON Families output page is going to be used often, it is a good idea to save the page as an HTML file to the 7.4.10 Supplement 2 Current Protocols in Bioinformatics local computer’s hard drive. This is accomplished by selecting the File menu option in the Internet browser and choosing Save As. Some browsers will warn that the page may not be displayed properly after being saved. Click OK to this warning. The DRAGON Families HTML output pages have been designed so that even if viewed locally on the user’s computer they will still be viewed correctly if the computer is connected to the Internet at the time. 14. As a second step in saving the page, with the browser window selected (select a window by simply clicking on any portion of it), click the Alt+Print Screen keys to capture a copy of the image. This image is then pasted by clicking Ctrl+V into MS Powerpoint for presentation and publication purposes. Figure 7.4.4 displays such an image. 15. Once the output HTML page has been saved and any images of the analysis desired have been captured and stored in another software system (i.e., MS Powerpoint) click the Back button in the Internet browser to go back to the data input page; all of the user’s settings remain selected. A second file can be input by clicking the Browse button again and selecting the next desired file. This process of input, analysis and output storage can be repeated as many times as is desired. GUIDELINES FOR UNDERSTANDING RESULTS DRAGON Annotate Results The general structure of an annotated text output file is simple. All of the original data provided in the user’s input data set are preserved and additional information derived from the annotation process is added to the far right-hand side of the data set as additional columns in the file. The delimiter that the user has defined for the data set is used to delimit the newly added information. The only major difference that may occur in the output file is dictated by whether the user selected the “All values on one line” option or the “Multiple rows, one value per row” option (Fig. 7.4.2, panel B). If the former option is selected, then any information provided will be added on the same row as all of the input gene’s information. However, if the latter option is selected, then the user may note a significant difference in the structure of this output file. Specifically, if any of the types of information with which one has chosen to annotate contains more than one value for a given gene, then the information for that gene in the input data set will be duplicated on as many rows as there are values for that gene. Each value will be added to the end of one row. For example, multiple SWISS-PROT Keywords and/or Pfam families can be associated with a gene and its encoded protein. If the “Multiple rows, one value per row” option is chosen when annotating with either of these types of information, then in the output file the gene’s input data will be duplicated on multiple rows with a new SWISS-PROT keyword or Pfam family number at the end of each row. This type of output allows the user to view data in reference to the biological characteristics of the genes and the encoded proteins in the data set. For example, once downloaded to the user’s computer the output file can be opened in a spreadsheet program such as MS Excel and the column containing the newly annotated biologically relevant information can be used to sort the entire data set. The result is that the expression data present in the input data set are now categorized according to biological function instead of individual genes. As a result the expression patterns of sets of genes related by functional properties such as the SWISS-PROT keywords “Nuclear protein,” “Calciumbinding,” or “Proteoglycan” can be rapidly identified and analyzed for coordinate regulation or other interesting properties. Analyzing Expression Patterns 7.4.11 Current Protocols in Bioinformatics Supplement 2 This type of analysis is difficult to perform without visualization tools that make use of the annotated information in order to define categories of genes related by certain properties (e.g., shared keywords, functional protein domain classifications, and chromosomal localization). Some expression data software packages allow for this type of categorical analysis. For example, the Partek Pro software package (http://www. partek.com) allows for the color-coding of genes and experiments according to categorical information that is incorporated into clustering and principal component analysis (PCA) views of the data. However, in an effort to make these types of tools more readily accessible to the users of DRAGON, the set of DRAGON View tools has been developed. The design and implementation of the DRAGON View tools is ongoing and upgrades will be documented in updates to this chapter as well as on the DRAGON and DRAGON View Web sites. DRAGON Families Results Each of the time points in the Iyer et al. (1999) data set were analyzed in DRAGON Families as described above. The most up-regulated gene families identified by DRAGON Families for three of these time points are shown in Figure 7.4.4. The lack of any bright red squares in the 15-min data corresponds with the early phase in the experiment. Interesting though, even at this early stage, three families can be noted that are more dramatically up-regulated later in the experiment. These are the “Inflammatory Response”, “Chemotaxis,” and “Cytokine” families. As would be expected, the predominant gene families being coordinately regulated at the 6-hr time point had to do with inflammatory response. Primarily, these families were “Inflammatory Response,” “Chemotaxis,” and “Cytokine.” This result agrees well with the findings of Iyer et al. (1999). Additional families identified at the 6-hr time point also agree with what was reported in the original paper (see Figures 4 and 5 in Iyer et al., 1999). These families include angiogenesis (“Angiogenesis”) and blood coagulation (“Blood Coagulation”) families. Finally, at the 24-hr time point it is apparent that cell cycle and proliferation mechanisms are at work. Genes in these families include members of the “Cyclins,” “Mitosis,” “DNA Repair,” and “Cell Division” families. This result also agrees well with what was reported in the original paper. The key point associated with these findings is that similar results were obtained in the original Iyer et al. (1999) study even though an alternative, but complimentary, approach was used to derive these results. Instead of starting with hierarchical clustering methods and then manually searching the clustered genes for similar functionality, all genes in the data set were first annotated with functional attributes and were then analyzed for coordinate function. If this type of analysis were run early in the original fibroblast study, the investigators would have very rapidly gotten a sense of the types of biological processes that were occurring in their data. This early knowledge could have informed their further, in-depth analysis of the clustering of expression profiles. These two methods are not mutually exclusive; instead they are complimentary, and use of both acts to provide a more rapid, comprehensive understanding of the biological patterns in a large-scale expression data set. COMMENTARY Background Information DRAGON and DRAGON View Why use DRAGON and the DRAGON Annotate tool? Researchers conducting large-scale gene expression research using technologies such as Serial Analysis of Gene Expression (SAGE; Velculescu et al., 1995) and microarrays (Bowtell, 1999; Cheung et al., 1999; Duggan et al., 1999; Lipshutz et al., 1999) often find themselves wanting to rapidly and simultaneously acquire information relating to the accession 7.4.12 Supplement 2 Current Protocols in Bioinformatics numbers, biological characteristics, and other attributes of large numbers of gene sequences. Examples of these situations might be: 1. Before starting a microarray experiment, an investigator would like to compare different microarray technologies in order to assess which platform best represents a functional class of genes in which they are interested. 2. As part of the analysis of a large-scale gene expression experiment, an investigator hypothesizes that genes in a particular chromosomal region or of a given functional class should be differentially regulated. 3. An investigator may want to acquire the most up-to-date information in the public databases regarding the genes on their microarray platform. In these instances, the ability to “click” through numerous Web sites and copy and paste information for single genes into a spreadsheet is not helpful when one has to do the same thing for thousands of genes. This can take tens to hundreds of hours, and is unnecessary due to the availability of computational methods such as those provided by the DRAGON Annotate tool (see Basic Protocol). The DRAGON Annotate tool associates biologically relevant information derived from numerous public databases with gene expression information from microarray experiments. The subsequent analysis process includes the association of relevant information with microarray data and the visualization of microarray data in the context of associated biological characteristics. To illustrate the use of DRAGON, the authors of this unit have applied it to a microarray data set available via the Web. During the analysis of this data set visual analysis methods were used to examine the correlation between gene expression patterns and biological characteristics such as membership in protein families and description by keywords. Results in the demonstration data set using the DRAGON and DRAGON View approaches closely matched those reported in the original study that generated the demonstration data, and suggest that these methods are complementary to the exploratory statistical approaches typically employed when examining large-scale gene expression data. By integrating biologically relevant information with the analysis of large-scale expression data, certain types of gene expression phenomena can be discerned more easily and examined in light of the experimental paradigm being tested. A comprehensive definition of biological data regarding each gene on a mi- croarray list through the interconnection of as many public databases as possible is an eventual goal in the development of DRAGON and DRAGON View. DRAGON would then be able to supply a multidimensional network of information related to the expression patterns and biological characteristics of all genes on a microarray. This growth of DRAGON is dependent upon the continued integration of public databases (Frishman et al., 1998). Along with the question of database integration comes the crucial matter of data integrity within and across databases (Macauley et al., 1998). Utility of DRAGON families and other DRAGON tools One of the first questions that might be asked when considering the relationship between functional relatedness and expression patterns is whether genes that are functionally related are also coordinately regulated. Often this type of question is addressed using descriptive or exploratory statistical tools. A variety of methods can be applied to the entire set of gene expression data to describe expression patterns or signatures within the data including Kmeans and hierarchical clustering algorithms (Michaels et al., 1998; Wen et al., 1998), principal component analysis, genetic network analysis (Liang et al., 1998; Somogyi et al., 1997; Szallasi, 1999; UNIT 7.3) and self-organizing maps (Tamayo et al., 1999; Toronen et al., 1999; UNIT 7.3). These methods identify similarity in the expression patterns of groups or clusters of genes across time or sample. By grouping genes according to their expression patterns, investigators can then attempt to draw inferences about the functional similarity of coordinately regulated genes. Previous studies have found that characteristics such as promoter elements, transcription factors, chromosomal loci, or cellular functions of encoded proteins have been associated with the coordinate expression of genes (Chu et al., 1998; Eisen et al., 1998; Gawantka et al., 1998; Heyer et al., 1999; Zhang, 1999; Spellman and Rubin, 2002). Others have used this assumption in order to test the effectiveness of various clustering methods (Gibbons and Roth, 2002). Starting with exploratory statistical methods and attempting to identify shared function is a powerful method with many benefits. However, using large-scale annotation systems such as DRAGON Annotate, expression data can now be explored from the opposite direction. Instead of starting with data and inferring function, the investigator can start with known Analyzing Expression Patterns 7.4.13 Current Protocols in Bioinformatics Supplement 2 DRAGON and DRAGON View shared function and test for coordinate expression patterns. Neither method obviates the need for the other; instead these two methods can provide complementary analyses of a data set which, when paired, provide deeper insight into the significant biological findings of the experimental system being examined. By starting with expression data and inferring similar biological characteristics through clustering, annotation with DRAGON and subsequent analysis with the DRAGON View tools allow the investigator to start with biological characteristics in order to identify which of those characteristics are associated with coordinate gene expression. This approach to analyzing expression data is not typically used, because the task of understanding the biological characteristics of the thousands of genes typically presented in a microarray data set is usually left to the investigator’s knowledge of the system in question, literature searches, and the tedious process of researching individual genes in public databases via the World Wide Web. As discussed, the DRAGON Annotate tool (see Basic Protocol) solves this problem, thereby making functional class–based gene expression analysis possible. Currently the primary tool with which to perform this type of analysis is DRAGON Families (see Support Protocol), and is found on the DRAGON View Web site. The DRAGON Families tool (see Support Protocol) sorts several hundred functional groups of genes to reveal families that have been coordinately up-regulated or down-regulated. DRAGON Families represents each gene in its corresponding family as a box that is clickable and hyperlinked to the NCBI LocusLink entry for that gene. Across each row, all the boxes correspond to genes in a given family. Each box is also color coded on a scale from red (up-regulated) to green (down-regulated). Furthermore, for all the functional families that are annotated, the program returns the families ranked in order according to the average ratio expression value for all of the genes in that group (Fig. 7.4.5, panel A, values in parentheses). Two other tools are currently available for use in DRAGON View. The DRAGON Order tool is similar to DRAGON Families in that it visualizes the expression data from a user’s gene expression data as sorted into functional groups (Fig. 7.4.5, panel B). However, DRAGON Order automatically presorts data based on ratio expression values. For each functional group the tool generates a series of bars (vertical lines), each of which represents a protein in that functional group. The position of the vertical bar indicates the extent to which that gene is up- or down-regulated. An equal distribution of vertical lines across the whole row means that there is no significant coexpression of a set of genes in that group. However, clusters of lines at either the far left or the far right of any given row are potentially interesting because they indicate that a set of related genes are all up- or down-regulated. This kind of information would be difficult to detect by manual inspection of microarray data sets. The DRAGON Paths tool relies on cellular pathway diagrams downloaded by file transfer protocol from the KEGG database (Kanehisa and Goto, 2000; Kanehisa et al., 2002). DRAGON Paths maps gene expression values onto cellular pathway diagrams (Fig. 7.4.4, panel C). By viewing the expression levels derived from microarray data within the context of cellular pathways, the user is able to detect patterns of expression as they relate to networks of genes associated by cellular pathways. DRAGON and DRAGON view architecture The general structure of DRAGON and DRAGON View is that of a relational database (UNIT 9.1) accessed via the Internet through Common Gateway Interface (CGI) scripts written in the Perl programming language (http://www. perl.com). The CGI scripts handle user requests and take care of the updating and management of the data contained in the database (Fig. 7.4.6). The database acts as a repository of the information collected from the public databases and provides rapid, flexible access to this information. As opposed to using BLAST (http://www.ncbi.nlm.nih.gov/BLAST/; UNITS 3.3 & 3.4), Blat (http://genome.ucsc.edu/cgi-bin/hg Blat?command=start), or other sequence similarity searching methods, in order to associate the user’s input gene list with other database information, all annotation occurs using GenBank accession numbers provided in the user’s gene lists. These accession numbers are joined with other accession numbers via association tables provided by the public databases (Fig. 7.4.7). For example, the Pfam database (http://www.sanger.ac.uk/Software/Pfam/; UNIT 2.5) provides a table containing a list of every SWISS-PROT number that is contained in a Pfam family. The SWISS-PROT database (http://www.expasy.ch/sprot/) provides a table of GenBank accession numbers that are associated with a given SWISS-PROT accession number. Thus, the correct combination of these tables allows for the association of the user’s GenBank accession numbers with SWISS- 7.4.14 Supplement 2 Current Protocols in Bioinformatics Figure 7.4.5 Examples of the graphical outputs of the three types of DRAGON View tools. (A) DRAGON Families produces rows of green (down-regulated), red (up-regulated), and gray (unchanged) boxes (see scale for the range of ratio values represented by each color). Each box represents one gene and is hyperlinked to its corresponding UniGene entry. Each row has a type identifier to its right that is hyperlinked to its description. To the far right is the average ratio expression value for all of the genes in that family. All rows are sorted from the most up-regulated family to the most down-regulated family. (B) DRAGON Order produces rows of black lines. Each line represents one gene and its location in the row represents its position on a gene list sorted by ratio expression values. Lines at the far left of represent the most up-regulated genes (+) and lines at the far right represent the most down-regulated (–). Each row’s type (e.g., SWISS-PROT keywords) is listed to the right. (C) DRAGON Paths maps the location and ratio expression value of genes from the submitted gene list on to KEGG cellular pathway diagrams. A green (down-regulated), red (up-regulated) or gray (unchanged) circle followed by the ratio expression value is mapped to the upper left corner of each corresponding protein box. Each protein box is hyperlinked to its corresponding LocusLink entry. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://currentprotocols.com/colorfigures. Analyzing Expression Patterns 7.4.15 Current Protocols in Bioinformatics Supplement 2 Web accessible databases Unigene Automated perl scripts DRAGON database CGI scripts Web site parseunigene.pl -Hs.data.Z -Mm.data.Z -Rn.data.Z Swissprot parseswissprot.pl querytable.cgi -sprot.dat.Z Pfam -Pfam-A.full.gz parsepfam.pl query.cgi MySQL database user interaction via annotate and search pages The server is a Dell PowerEdge 6300 running: -Red Hat Linux 6.2 -Apache Web Server -Perl 5.6 KEGG -hsa_gene _map.tab parsekegg.pl Figure 7.4.6 Database architecture for DRAGON. The data contained in the DRAGON database is derived from Web-accessible databases that are downloaded by FTP, parsed using Perl scripts, and stored in tables in the MySQL relational database management system. The DRAGON database is housed on a Dell PowerEdge 6300 dual processor server. The front end consists of a Web site that is searched using Perl (.cgi) scripts to allow for user-defined queries of the database. DRAGON and DRAGON View PROT accession numbers and any associated Pfam family numbers and names. While being dependent on the completeness and accuracy of the association tables provided by the public databases, this method provides for faster, more efficient annotation. Furthermore, by annotating a gene list with various types of accession numbers, a spectrum of information concerning the biological characteristics of the genes, their associated proteins, and the protein’s par- ticipation in cellular pathways can be gained (Fig. 7.4.8). Querying DRAGON GenBank accession numbers are used as the sole type of input accession number for the DRAGON Annotate tool for several reasons. First, GenBank accession numbers are the most common type of accession number provided with microarray gene lists. Second, GenBank 7.4.16 Supplement 2 Current Protocols in Bioinformatics Figure 7.4.7 Overview of the information in DRAGON. This diagram represents a subset of the tables now available in DRAGON and the possible connections between them. Depending upon what type of information is desired different sets of tables are joined with the table containing microarray gene expression data that is as example, “Incyte Array Data” and “Incyte Numbers” in this diagram. Two “UniGene Human Numbers” tables are used to expand the “GenBank #s” from the “Incyte Numbers” table into all “GenBank #s” associated with each “UniGene ID” thereby providing a bridge between “GenBank #s” from the “Incyte Numbers” table and the “Swissprot Numbers”, “TrEMBL Numbers”, “Transfac Factors” and “Transfac Sites” tables. Further characterization of the proteins that genes from the microarray encode occurs by joining with tables derived from the SWISS-PROT, Pfam, Interpro and OMIM databases. accession numbers are not retired or drastically changed over time like some other types of accession numbers. Finally, although GenBank numbers represent sequence fragments, these fragments are collected with other fragments into clusters by the NCBI’s UniGene database (http://www.ncbi.nlm.nih.gov/UniGene/) or the TIGR Gene Index (http://www.tigr.org/tdb/ tgi/hgi/) in order to identify their representation of genes. All of these reasons make GenBank accession numbers among the best types of accession numbers to use for input into the DRAGON Annotate tool. Work is currently ongoing that would allow for the use of other types of input accession numbers (i.e., TIGR accession numbers or LocusLink accession numbers). Master matrix file A number of the more widely used expression analysis software packages such as GeneSpring (http://www.silicongenetics.com/ cgi/SiG.cgi/index.smf), Partek Pro (http://www. partek.com), and Cluster/Treeview (http:// rana.lbl.gov/EisenSoftware.htm; UNIT 6.2) use master matrix files as one of their primary formats for data importing. Because of their ease of use, simple integration with existing expression data analysis software packages, and human-readable nature, master matrix text files are relied on, exclusively, for data importing and analysis by DRAGON and DRAGON View. An overview of the structure of a master matrix file will aid in understanding how to use DRAGON and DRAGON View more effectively. Specifically, there are a few important attributes of a master matrix file that make it useful during expression data analysis. First, this type of file can be thought of as a “master” file because it usually contains all of the data available for a given experiment. For example, in the fibroblast data set used for demonstration Analyzing Expression Patterns 7.4.17 Current Protocols in Bioinformatics Supplement 2 microarray Transfac factor no. Transfac site no. protein SWISS PROT no. keywords sequence protein name function Pfam no. 1 Pfam family name Interpro no. 1 gene GenBank no. Unigene ID chromosomal localization Unigene Name disease-related? (OMIM no.) Pfam no. 2 Pfam family Name Interpro no. 2 Involvement in cellular pathways Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway no. Figure 7.4.8 DRAGON uses accession numbers to define biological characteristics of genes and proteins. A microarray is a regular array of thousands of unique cDNAs or oligonucleotides spotted on a solid support. Each spot contains cDNA corresponding to a specific gene that encodes a protein. Accession numbers derived from publicly available databases provide information about the biological characteristics of both the gene and its corresponding protein. At the gene level, “Transfac Site” and “Transfac Factor” numbers indicate the presence of promoter regions on the gene and factors that bind to those promoter regions respectively. The “GenBank no.” and “UniGene ID” refer to EST sequences corresponding to fragments of the gene and a cluster of those EST sequences respectively. The “UniGene Cytoband” indicates the chromosomal location of the gene. The “UniGene Name” is the name of the gene. The “OMIM no.” indicates whether the gene is known to be involved in any human diseases. At the protein level, “Pfam no.” and “Interpro no.” indicate which functional domains the protein contains. The “SWISS-PROT no.” is a unique identifier for the protein and can be derived from either the SWISS-PROT or TrEMBL databases. “SWISS-PROT Keywords” are derived from a controlled vocabulary of 827 words that are assigned to proteins in the SWISS-PROT database according to their function(s). “SWISS-PROT Sequence” is the amino acid sequence for the protein. “SWISS-PROT Name” is the SWISS-PROT database name for the protein. 7.4.18 Supplement 2 Current Protocols in Bioinformatics purposes in this unit, the fig2clusterdata.txt file can be considered a master matrix file because it contains data from all of the time points monitored in the Iyer et al. (1999) experiment. Secondly, these are “matrix” files because they contain data from experimental conditions (i.e., time points, disease versus control, treated versus untreated) as columns in the file, and from gene sequences represented on the microarray as rows in the file. As a side note, each row in a master matrix file should contain a unique id that can be used to identify all of the data in that row as belonging to a given sequence. This unique id is particularly important when multiple elements on a given microarray represent different sequence fragments derived from the same gene. For example, in the fig2clusterdata. txt file, the unique clone id’s (“UNIQID”) are used to identify each gene sequence. Alternatively, in Affymetrix GeneChip data sets, there are often numerous element sets on a GeneChip that represent different portions of the same gene (see http://www.affymetrix.com for an indepth discussion of the structure of Affymetrix GeneChips). Primarily, this redundancy is designed into the chip in order to provide internal controls for the expression data measured by the chip. One would most often expect to see separate element sets representing the same gene displaying similar expression levels and profiles (unless of course something like the differential regulation of alternative splice forms is occurring). In order to identify the unique sequences representing a given gene, Affymetrix provides a set of unique identifiers that consist of a GenBank or Ref_Seq accession number followed by a series of underscores and letters such as AA123345_i_at. This method of unique sequence identification achieves two important goals: first, the GenBank accession number is provided, allowing for a link to the public databases; second, the underscored tag at the end maintains the uniqueness of each sequence representing a given gene. Finally, master matrix files are “text flat files” because they are normally stored as either comma-delimited files in which commas are used to indicate column boundaries or tab-delimited files in which tabs are used to indicate column boundaries. Such text files that do not have any relational structure are referred to as “flat.” In other words, they are not stored as multiple tables in a relational database such as MySQL (http://www.mysql .com; UNIT 9.2), Oracle (http://www.oracle.com), or MS Access (http://www.microsoft.com/office/ access/). Instead, all of the data is contained in one simple text file. Further research and development for DRAGON and DRAGON view Development of DRAGON and DRAGON View is an ongoing process at many levels. There are numerous “bug fixes” that are constantly being addressed. In addition, novel tools and methods are being developed that will allow for additional methods of annotation and analysis. For example, the integration of additional databases including the Gene Ontology database (GO; http://www.geneontology.org; UNIT 7.2), Interpro (http://www.ebi.ac.uk/inter pro/), the International Protein Index (IPI; http://www.ensembl.org/IPI/), the Ensembl genome databases (http://www.ensembl.org), the TRANSFAC database (http://www.cbi.pku. edu.cn/TRANSFAC/), the Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu/), along with other databases, is is planned for the DRAGON Annotate tool. In addition, a system allowing for the use of other types of input accession numbers (e.g., those of TIGR, LocusLink, and SWISS-PROT) is being developed. Numerous additions and upgrades are also planned for the DRAGON View tools. For example, a batch import mode is required for most of the tools so that users do not have to break their master matrix files up into numerous data input files, as had to be done for this demonstration. Additionally, as a complement to the visual methods already in use, quantitative statistical methods are being developed to allow for the identification of the statistical significance of the coordinate differential regulation of annotated gene families. As discussed further below (see Critical Parameters and Troubleshooting), proteomic technologies—such as antibody microarrays (Lal et al., 2002), high-throughput mass spectroscopy methods, fluorescent differential 2-D PAGE methods, and transfected cell microarrays (Bailey et al., 2002)—which allow for the measurement of large-scale protein expression patterns will eventually supplant the need for the measurement of gene expression patterns. The types of methods and analyses that can be carried out using DRAGON and DRAGON View are just as applicable to large-scale protein expression analysis as they are to largescale gene expression analysis. In fact, many of Analyzing Expression Patterns 7.4.19 Current Protocols in Bioinformatics Supplement 2 the types of information annotated by DRAGON (e.g., Pfam functional domains, cellular functions, and cellular pathway participation) are more directly related to proteins than they are to their encoding genes. The primary difference, were the DRAGON and DRAGON View tools to be used with proteomics data, would simply be the type of accession numbers that would be used in the input data set. Given this, future developments of DRAGON will allow for the use of protein accession numbers such as SWISS-PROT accession numbers (http://www.expasy.ch/sprot/) an d International Protein Index accession numbers (http:// www.ebi.ac.uk/IPI/IPIhelp.html). The DRAGON and DRAGON View tools were originally developed to help answer a simple question for which there was no good method for investigation. In keeping with the intent of their original development, it is the hope of the authors that the methods provided by the DRAGON and DRAGON View tools will continue to facilitate novel types of research and analysis concerning large-scale gene and eventually protein expression data. Critical Parameters and Troubleshooting DRAGON and DRAGON View Often, errors with DR AGON and DRAGON View tools are due to formatting issues in the input data text file. The critical points to remember with formatting are: (1) what type of delimiter is being used (i.e., comma or tab) and (2) where is the critical information in the file (i.e., what columns contain GenBank accession numbers and other data of interest). An important point about delimiters is that the input data file will be read incorrectly if the character being used as the delimiter is found anywhere else in the input file besides the separation between columns. For example, in the demonstration described in the Support Protocol, all of the commas in the figure2_combined_data.csv file were replaced with nothing. This was done because commas in gene names and type information (e.g., Pfam family names, SWISSPROT keywords) will be read as delimiters when importing a .csv file into DRAGON or DRAGON View tools. In order to check the validity of the information provided by the DRAGON Annotate tool, users can perform one of several quality control measures. First, random genes from the annotated gene set can be selected and searched for on the Web sites that were used in the annotation process. For example, if a gene was annotated with UniGene and SWISS-PROT information, then the user can search for the gene on those two Web sites in order to confirm the information derived from DRAGON Annotate. Alternatively, there are other large-scale annotation tools available that can be used instead of, or in addition to, DRAGON Annotate. These include TIGR Manatee (http://manatee. sourceforge.net/), Resourcerer (http://pga.tigr. org/tigr-scripts/magic/r1.pl), and Affymetrix NetAffx (http://www.affymetrix.com/analysis/ index.affx). These tools have strengths and limitations relative to the DRAGON Annotate tool. For example, the Affymetrix NetAffx Web site provides annotation for all of the Affymetrix GeneChips. However, the limitation with the site is that the user is only allowed to annotate 500 genes at a time and can only annotate genes that are on one of the Affymetrix GeneChips. Depending on their ease-of-use, these tools can be used in addition to DRAGON Annotate in order to compare the output of the systems. If errors are discovered in the information annotated by DRAGON relative to the public database Web sites of other annotation tools, they may be due to a need for updating of the DRAGON data or to the fact that DRAGON is under construction. In either case, feedback concerning this type of matter is greatly appreciated and should be sent via E-mail to the questions address provided on the DRAGON Web site. One major assumption that is made in the use of DRAGON and DRAGON View, and, indeed, in the analysis of most large-scale gene expression data, is that the information associated with the expression data relates to the gene whose expression patterns are being directly measured. This of course is not the case for many types of information. For example, SWISS-PROT information and Pfam information are associated with the encoded protein, not the gene being measured. Perhaps in the majority of cases it is safe to assume that gene expression levels are indicative of changes in encoded protein expression levels. However, eventual widespread use of proteomics technologies allowing for the large-scale measurement of protein expression levels will make this assumption unnecessary. When this becomes the case, it will be possible to apply the same annotation and analysis methods provided by DRAGON and DRAGON View to large-scale protein expression data just as easily as largescale gene expression data. 7.4.20 Supplement 2 Current Protocols in Bioinformatics Suggestions for Further Analysis There is no fundamental reason why the DRAGON View tools need to be the only tools used for the analysis of expression data in relation to annotated functional classes derived from the DRAGON Annotate tool. Once the user obtains the annotated output data file, any one of a number of analyses can be performed. This flexibility in analysis options is a critical design feature of the DRAGON and DRAGON View systems. For example, instead of using DRAGON Families to search for the coordinate regulation of functionally related gene groups, a K-means clustering strategy can be used to perform the same type of analysis (e.g., UNIT 7.3). An annotated expression data set can be clustered using a K-means clustering algorithm by gene expression values over time. Following clustering, the user can search for genes that are both clustered into the same group and are associated with the same type of annotated functional information. This is just one example of the type of analysis that can be performed once the user has access to the large amounts of functionally relevant information concerning all of the members of a gene expression data set that DRAGON Annotate provides. Literature Cited Bailey, S.N., Wu, R.Z., and Sabatini, D.M. 2002. Applications of transfected cell microarrays in high-throughput drug discovery. Drug Discov. Today 7:S113-S118. Bouton, C.M. and Pevsner, J. 2001. DRAGON: Database Referencing of Array Genes Online. Bioinformatics 16:1038-1039. Bouton, C.M. and Pevsner, J. 2002. DRAGON View: Information visualization for annotated microarray data. Bioinformatics 18:323-324. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A.. 95:14863-14868. Frishman, D., Heumann, K., Lesk, A., and Mewes, H-W. 1998. Comprehensive, comprehensible, distributed and intelligent databases: Current status. Bioinformatics 14:551-561. Gawantka, V., Pollet, N., Delius, H., Vingron, M., Pfister, R., Nitsch, R., Blumenstock, C., and Niehrs, C. 1998. Gene expression screening in Xenopus identifies molecular pathways, predicts gene function and provides a global view of embryonic gene expression. Mech. Dev. 77:95141. Gibbons, F.D. and Roth, F.P. 2002. Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 12:1574-81. Heyer, L.J., Kruglyak, S., and Yooseph, S. 1999. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res. 9:1106-1115. Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M., Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O. 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283:83-87. Kanehisa, M. and Goto S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28:27-30. Kanehisa, M. et al. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42-46. Lal, S.P., Christopherson, R.I., and dos Remedios, C.G. 2002. Antibody arrays: An embryonic but rapidly growing technology. Drug Discov. Today 7:S143-S149. Liang, S., Fuhrman, S., and Somogyi, R. 1998. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 3:18-29. Bowtell, D.D.L. 1999. Options available-from start to finish-for obtaining expression data by microarray. Nat. Genet. Suppl. 21:25-32. Lipshutz, R.J., Fodor, S.P.A., Gingeras, T.R., and Lockhart, D.J. 1999. High density synthetic oligonucleotide arrays. Nat. Genet. Suppl. 21:2024. Cheung, V.G., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R., and Childs, G. 1999. Making and reading microarrays. Nat. Genet. Suppl. 21:15-19. Macauley, J., Wang, H., and Goodman, N. 1998. A model system for studying the integration of molecular biology databases. Bioinformatics 14:575-582. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282:699-705. Michaels, G.S., Carr, D.B., Askenaki, M., Fuhrman, S., Wen, X., and Somogyi, R. 1998. Cluster analysis and data visualization of large-scale gene expression data. Pacific Symp. Biocomp. 3:42-53. Colantuoni, C., Henry, G., Zeger, S., and Pevsner, J. 2002. SNOMAD (Standardization and NOrmalization of MicroArray Data): Web-accessible gene expression data analysis. Bioinformatics 18:1540-1541. Duggan, D.J., Bittner, M., Chen, Y., Meltzer, P., and Trent, J.M. 1999. Expression profiling using cDNA microarrays. Nat. Genet. Suppl. 21:10-14. Somogyi, R., Fuhrman, S., Askenazi, M., and Wuensche, A. 1997. The gene expression matrix: Towards the extraction of genetic network architectures. Proc. Second World Cong. Nonlinear Analysts 1996. 30:1815-1824. Spellman, P.T. and Rubin, G.M. 2002. Evidence for large domains of similarly expressed genes in the Drosophila genome. J. Biol. 1:5.1-5.8. Analyzing Expression Patterns 7.4.21 Current Protocols in Bioinformatics Supplement 2 Szallasi, Z. 1999. Genetic network analysis in light of massively parallel biological data acquisition. Pac. Symp. Biocomp. 4:5-16. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoetic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907-2912. Toronen, P., Kolehmainen, M., Wong, G., and Castren, E. 1999. Analysis of gene expression data using self-organizing maps. FEBS Lett. 451:142-146. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. 1995. Serial analysis of gene expression. Science 270:484-7. Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., and Somogyi, R. 1998. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. U.S.A. 95:334-339. Zhang, M.Q. 1999. Large-scale gene expression data analysis: A new challenge to computational biologists. Genome Res. 9:681-688. Key References Bouton and Pevsner, 2001. See above. Bouton and Pevsner, 2002. See above. Original publication concerning the DRAGON View visualization tools. Bouton, C.M., Hossain, M.A., Frelin, L.P., Laterra, J., and Pevsner, J. 2001. Microarray analysis of differential gene expression in lead-exposed astrocytes. Toxicol. Appl. Pharmacol. 176:34-53. Research publication that reports use of DRAGON and DRAGON View in the context of a toxicogenomic microarray study. Iyer et al.. 1999. See above. Reports the microarray study from which the example data sets for this unit were derived. Contributed by Christopher M.L.S. Bouton LION Bioscience Research Cambridge, Massachusetts Jonathan Pevsner Kennedy Krieger Institute and Johns Hopkins University School of Medicine Baltimore, Maryland Original publication concerning the DRAGON database. DRAGON and DRAGON View 7.4.22 Supplement 2 Current Protocols in Bioinformatics Integrating Whole-Genome Expression Results into Metabolic Networks with Pathway Processor UNIT 7.6 Genes never act alone in a biological system, but participate in a cascade of networks. As a result, analyzing microarray data from a pathway perspective leads to a new level of understanding the system. The authors’ group has recently developed Pathway Processor (http://cgr.harvard.edu/cavalieri/pp.html), an automatic statistical method to determine which pathways are most affected by transcriptional changes and to map expression data from multiple whole-genome expression experiments on metabolic pathways (Grosu et al., 2002). The Pathway Processor package (Fig. 7.6.1) consists of three programs, Data File Checker, Pathway Analyzer (see Basic Protocol), and Expression Mapper (see Support Protocol). The final protocol in the unit presents a method for comparing the results from multiple experiments (see Alternate Protocol). The first program included with the Pathway Processor package, called Data File Checker, examines the input microarray data and checks whether it has the correct format for Pathway Analyzer and Expression Mapper. The output form data file checker is a text file called data.txt that constitutes the input of the two other programs. SCORING BIOCHEMICAL PATHWAYS WITH PATHWAY PROCESSOR Pathway Analyzer is a new method that uses the Fisher Exact Test to score biochemical pathways according to the probability that as many or more genes in a pathway would be significantly altered in a given experiment as would be altered by chance alone. Results from multiple experiments can be compared, reducing the analysis from the full set of individual genes to a limited number of pathways of interest. BASIC PROTOCOL This tool is the first to include a statistical test to determine automatically the probability that the genes of any of a large number of pathways are significantly altered in a given experiment. Pathway Processor also provides a user-friendly interface, called Expression Mapper (see Support Protocol), which automatically associates expression changes with genes organized into metabolic maps (Grosu et al., 2002). The Pathway Processor program, initially designed for the analysis of yeast and B.subtilis expression data, can readily be adapted to the metabolic networks of other organisms. The program can also be adapted to metabolic pathways other that those reported in KEGG. Necessary Resources Hardware PC running Microsoft Windows. The authors have found that a 700 MHz Pentium PC with 512 Mb of RAM performs very well. Software Pathway Processor is written completely in Sun Microsystems Java. It is freely available on the Web page of the Bauer Center for Genomics Research (http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul Grosu ([email protected]) or Duccio Cavalieri (dcavalieri@cgr. harvard.edu). The program can be downloaded from the Web together with the detailed User’s Instruction Manual. Contributed by Duccio Cavalieri and Paul Grosu Current Protocols in Bioinformatics (2004) 7.6.1-7.6.19 Copyright © 2004 by John Wiley & Sons, Inc. Analyzing Expression Patterns 7.6.1 Supplement 5 Figure 7.6.1 Flowchart of the Pathway Processor Project, including a screenshot of the directory structure of Pathway Processor. Analyzing Expression Results with Pathway Processor Files The tab-delimited data text file is the file where one’s expression data will reside. This data file must have the name data.txt, and will need to reside in the data folder of the programs for which it will be used (this will be described in greater detail later on; see step 1). This is the file used by Pathway Analyzer and Expression Mapper. The file must contain normalized data in the format of ratios. Data should not be log-transformed, since the programs will take care of that where necessary. The file must not have any headers and is of the following format: (1) the first column must contain the yeast ORF names (for B. subtilis, use the SubtiList accession numbers; e.g., BG11037; see note below); (2) the last column must contain the normalized ratios; (3) there can be as many columns in between as desired, but the authors recommend that only locus names be placed as the middle column; this provides a quicker identification of the ORF in Expression Mapper. Figure 7.6.2 shows an example. There are some requirements and restrictions on the data file, i.e.: (a) the data file must not contain any empty ORFs or ratios; (b) the data file must not contain any 0 ratios since this will be a problem when taking the log of these Ratios; (c) the data file must not contain duplicate ORFs since the statistics will be skewed; (d) the data file must not contain any blank rows or columns; (e) the data file must not contain any header columns nor extra lines or spaces except for the text that is in each cell. Each cell must contain only one line and cannot be spread across multiple lines. 7.6.2 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.2 A valid data.txt file. NOTE: For Bacillus subtilis it is necessary to use the SubtiList Accession numbers in data.txt. For example, instead of using aadK, one needs to use BG11037. This is a procedure that can easily be performed in Microsoft Access where one has a table of one’s data and another table that associates the gene names (e.g., aadK) with the corresponding SubtiList Accession number (in this case, BG 11037). Such associations can be entered as a table in Microsoft Access from different locations available freely on the Internet. Feel free to contain either Paul Grosu ([email protected]) or Duccio Cavalieri ([email protected]). Installing Pathway Processor 1. Pathway Processor comes as a compressed file called pathway_processor. zip. The user will need to unzip this file and all the proper directories and files will be created. All three programs (Pathway Analyzer, Expression Mapper, and Data File Checker) have the same directory architecture. For each program there exists one main directory and three subdirectories (data, library, results; Fig. 7.6.1). The program and the three subdirectories reside in the Main Folder. In the data folder, the user will put the data.txt file. The library folder contains data that the program will use to process the user’s data. The results folder will output all of the user’s results. The JRE1.3.1 folder is used by the program to start running. 2. After performing the operations described in step 1, run the data file through the Data File Checker program (steps 3 to 5). This program will remove any ORFs that are either not present in the pathway matrix against which the data is compared to perform the statistics (this will be explained in more detail later), as well any data that contain 0 ratios. Running the Data File Checker 3. Place the data.txt file in the data subdirectory of the data_file_checker folder. 4. Go to the data_file_checker folder and double-click on the run.bat file. Click the Process Request button in the dialog box that appears. The program will parse the data.txt file and remove any ORFs that have 0 ratios or that are not part of the latest SGD ORF listing. This SGD ORF listing is used by Pathway Analyzer in a matrix form to do the statistical calculations. Updates to the pathway matrix file will be done on a weekly basis. The pathway matrix file is called pathway_ file.txt and resides in the following subdirectories: For the Data_File_Checker: pathway_processor\\data_file_checker\\library\\pathway_file For Pathway_Analyzer: pathway_processor\\pathway_analyzer\\library\\pathway_file Analyzing Expression Patterns 7.6.3 Current Protocols in Bioinformatics Supplement 5 A B Figure 7.6.3 (A) Screen shot of the message window one receives when the Data File Checker application has successfully parsed one’s data file. (B) Screenshot of the message window one receives when the Data File Checker application has encountered an error while parsing one’s data file. This message will alert the user to the row (line number) at which the error has occurred. The user will need to open the file, usually with Microsoft Excel, and make the correction and rerun the Data File Checker application. The data files always need to be saved as tab-delimited text files. 5a. Scenario 1: If the data.txt file was of the correct format, the message shown in Figure 7.6.3A will come up. In the results folder, the new processed data.txt file will be found. This can be placed in the data directory of pathway_analyzer or expression_mapper (see Support Protocol). 5b. Scenario 2: If the data.txt file was not of the correct format, then the message shown in Figure 7.6.3B will come up. The next step would be to correct the data file where the error has occurred and then try to run data_file_checker again on the new data file. Running Pathway Analyzer 6. Place the data.txt file (from step 5a) in the data subdirectory of the pathway_ analyzer folder. 7. Go to the pathway_analyzer folder and double-click on the run.bat file. The screen shown in Figure 7.6.4 will come up: Analyzing Expression Results with Pathway Processor 8. The next step is to set the appropriate fold change cutoff. Pathway Analyzer will start with a preset fold change cutoff for the Fisher Exact Test Statistic. The user should choose the fold change based on the number of replicates that are combined to create the data set, on the confidence that he or she has in the data, and on the type of experiment. The Fisher Exact Test is based on the number of genes that pass the cutoff, without considering the variance. In the experiment used as an example, the 1.8 fold change was chosen also by looking at the Gaussian distribution of the fold changes 7.6.4 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.4 Screenshot of the Pathway Analyzer application main window. in the experimental data set. It was observed that, in this particular data set, the number of genes included between 1.5 and 1.6 was much larger than 1.8 and 1.9, and could be the result of noise or variability in the measurements. It is suggested that the analysis be performed with different cutoffs and that the one that gives the best values of the Fisher Exact Test be chosen. In Pathway Analyzer, the user specifies the magnitude of the difference in ORF expression that is to be regarded as above background. The program uses the expression “fold change” to indicate the relative change in gene expression, represented as the multiplier by which the level of expression of a particular ORF is increased or decreased in an experiment. 9. Click on the Process Request button. The status bar will then change from “Waiting for process request” to “Working...Please wait for job to finish...”. When the program is finished, the status bar will change to “Job done. Waiting for process request.” The program will parse the data file and compare it to pathway_file.txt (pathway matrix file). From this comparison it will generate the Fisher Exact Test. All ratios are transformed to log base 2 values before performing any kind of analysis. The first set of tab-delimited text files that are generated are the following, all of which will be saved in the results subdirectory of the pathway_analyzer directory: gene_expression_pathway_summary_file.txt pathway_summary_file.txt The gene_expression_pathway_summary_file.txt will list, per pathway, all the genes, with the associated fold change, that passed the cutoff. Table 7.6.1 is a small sample of what it will look like. The KEGG map number of each pathway is also listed in the header (first) row. This will come in handy for the Expression Mapper (see Support Protocol). The second file (Table 7.6.2), pathway_summary_file.txt, is the file containing the Fisher Exact Test signed and unsigned t statistic information. Table 7.6.3 contains a description of the content of the columns of the pathway_summary_file.txt. The Signed Fisher Exact Test values will come in handy when doing pathway analysis among multiple experiments, which is described in the Alternate Protocol. Analyzing Expression Patterns 7.6.5 Current Protocols in Bioinformatics Supplement 5 Table 7.6.1 Visualization of a Detail of Two Columns of gene_expression_pathway_ summary_file.txt, Opened Using Microsoft Excel Pentose and glucuronate interconversions map40 YBR204C *** Fold Change: 1.82 YKL140W - TGL1 *** Fold Change: 2.05 YKL035W - UGP1 *** Fold Change: 2.38 Fructose and mannose metabolism map51 YDL055C - PSA1 *** Fold Change: −2.18 YGL253W - HXK2 *** Fold Change: −1.82 YDR368W - YPR1 *** Fold Change: 1.85 YCL040W - GLK1 *** Fold Change: 1.93 YKR009C - FOX2 *** Fold Change: 2.29 YJR159W - SOR1 *** Fold Change: 2.39 YIL107C - PFK26 *** Fold Change: 2.87 YJL155C - FBP26 *** Fold Change: 3.60 YDL243C - AAD4 *** Fold Change: 3.83 YCR107W - AAD3 *** Fold Change: 5.00 YFL056C - AAD6 *** Fold Change: 7.25 YJR155W - AAD10 *** Fold Change: 10.05 Table 7.6.2 Visualization of a Detail of the Second File Obtained from Pathway Analyzer, pathway_summary_ file.txt, Opened Using Microsoft Excel Genes in pathway present in the data file Genes exceeding fold change cutoff (−1.8, 1.8) Fisher Exact Test (−1.8, 1.8) Up-regulation/ Down-regulation of pathway (−1.8, 1.8) Signed Fisher Exact Test (−1.8, 1.8) 39 18 0.0084793 0.134658288 0.0084793 3 0 1 0 1 23 18 4.80E-07 2.477863857 4.80E-07 22 13 0.0016249 0.909243073 0.0016249 8 3 0.3770615 1.78955758 0.3770615 36 14 0.085082 0.539119765 0.085082 29 8 0.5521999 1.257629699 0.5521999 13 3 0.7303117 2.138632217 0.7303117 4 0 1 0 1 Pathway Glycolysis/ Gluconeogenesis, map10 Styrene degradation map11 Citrate cycle (TCA cycle) map 20 Pentose phosphate cycle map30 Pentose and glucuronate interconversions map 40 Fructose and mannose metabolism map 51 Galactose metabolism map 52 Ascorbate and aldarate metabolism map 53 Fatty acid biosynthesis (path 1) map 61 7.6.6 Supplement 5 Current Protocols in Bioinformatics Table 7.6.3 Description of the Content of the Columns of the pathway_summary_file.txt Column name Column description Genes in pathway present in the data file Lists the number of genes in the particular pathway—in the last column in that row—and also present in the data file Lists the number of genes that passed the cutoff in the particular pathway, listed in the last column in that row. In parentheses is listed the fold change range that was entered when the program was run. Lists the Fisher Exact Test value. In parentheses is listed the fold change range that was entered when the program was run. Calculates the difference between the means of the log2 ratios of the genes that passed the cutoff within the pathway and all the genes that passed the cutoff. If the number is greater than zero, the pathway is up-regulated compared to the rest of the genome. If the pathway is less than zero it is down-regulated. If it is zero, it is not significant. In parentheses is listed the fold change range that was entered when the program was run. Genes exceeding fold change cutoff (−1.8, 1.8) Fisher Exact Test (−1.8, 1.8) Up-regulation/down-regulation of pathway (−1.8, 1.8) Signed Fisher Exact Test (−1.8, 1.8) Pathway Takes the sign of the up-regulation/down-regulation column—only if it is non-zero—and multiplies it by the Fisher Exact Test column value. If the up-regulation/down-regulation column is 0, then the value is automatically set to 0. If the value is greater or equal to −0.0001 and less than 0, then the value is automatically set to −0.0001. This is done so that colors can be plotted correctly, since colors in ranges very close to zero in the negative region are considered 0 by some visualization programs. In parentheses is listed the fold change range that was entered when the program was run. This column lists the pathway name. The KEGG map number, which can be used in Expression Mapper, is listed in parentheses. 10. The extent of the alteration of the genes that show major changes, and their position in the pathways identified as of greatest interest with Pathway Analyzer, can be now visualized. It is sufficient to annotate the number of the pathway of interest, as reported in parentheses next to the KEGG map number in the pathway column (Table 7.6.2), and proceed to the analysis with Expression Mapper (see Support Protocol). DETAILED ANALYSIS WITH EXPRESSION MAPPER Expression Mapper allows a detailed examination of the relationships among genes in the pathways of interest. This program features a unique graphical output, displaying differences in expression on metabolic charts of the biochemical pathways to which the ORFs are assigned. The gene names are visualized on the metabolic chart together with the fold change, next to the biological step to which the gene has been associated. SUPPORT PROTOCOL The letters are colored in red if the gene is up-regulated and green if the gene is down-regulated; the color intensity is proportional to the extent of the change in expression. Single pathways of interest can be then studied in detail using Expression Mapper. Analyzing Expression Patterns 7.6.7 Current Protocols in Bioinformatics Supplement 5 Figure 7.6.5 Screenshot of the Expression Mapper application’s Map Manipulation Area window using the new, checked data.txt file. The figure reports the Glycolysis/Gluconeogenesis pathway (KEGG map 10). The text is colored in red if the relative change in gene expression is ≥1 or green if it is ≥1. The intensity of the color is proportional to the magnitude of the differential expression. The presence of a gray box indicates that the corresponding step in the biochemical pathway requires multiple gene products. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/ colorfigures.htm. Necessary Resources Hardware PC running Microsoft Windows. The authors have found that a 700 MHz Pentium PC with 512 Mb of RAM performs very well. Analyzing Expression Results with Pathway Processor Software Pathway Processor is written completely in Sun Microsystems Java. It is freely available on the Web page of the Bauer Center for Genomics Research (http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul Grosu ([email protected]) or Duccio Cavalieri (dcavalieri@ cgr.harvard.edu). The program can be downloaded from the Web together with the detailed User’s Instruction Manual. continued 7.6.8 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.6 The same screenshot as Figure 7.6.5, with the exception that the user has dragged out the per-gene fold-changes. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/ colorfigures.htm. Files data.txt file (see Basic Protocol) 1. After completing the Basic Protocol, place the data.txt file in the data subdirectory of the expression_mapper folder. 2. Go to the expression_mapper folder and double-click on the run.bat file. 3. Enter the KEGG map number of interest in the dialog box that appears, then click on the Process Request button. Be sure to type in a map that exists. If unsure, check the Pathway column from the pathway_summary_file.txt file (Table 7.6.2) for the KEGG map number of greatest interest. 4. A window will come up that will look similar to Figure 7.6.5 (this is for KEGG map number 10). Analyzing Expression Patterns 7.6.9 Current Protocols in Bioinformatics Supplement 5 Figure 7.6.7 JPEG output file that is saved from Figure 7.6.6 when one closes the Map Manipulation Area window. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www. interscience.wiley.com/c_p/colorfigures.htm. 7.6.10 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.8 This is a portion of the Map Manipulation Area window using the B. subtilis version of the program. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm. The user will notice that the fold changes of some of the ORFs are listed. These are locations where only one ORF is present. The ORFs are colored with shades of red if they are up-regulated and shades of green if they are down-regulated. The lighter the shade, the more up-regulated or down-regulated the gene is. The darker the shade, the less up-regulated or down-regulated the gene is. The ratios are transformed into fold changes and written on the graph. The user will notice that some locations contain gray boxes. These are locations where more than on gene is located. In order to view these, one will need to click on the gray box and drag it out (Fig. 7.6.6). 5. Finally, once satisfied with the way the pathway layout looks, one can save the image by closing the window. By closing the window, an output file of the corresponding map number will be created in the results folder under the expression_mapper directory. The file name will be created from the template[mapnumber] output.jpg. For instance, if KEGG map number 10 is used, the output file will be 10 output.jpg. The output of KEGG map 10 is shown in Figure 7.6.7. For B. subtilis, there will be boxes which are green; these boxes are prerendered green by KEGG to indicate that they contain B. subtilis genes. When they are seen on screen, it means that the data do not contain those genes since they are left green and not overwritten with either a gray box or with a specific B. subtilis gene and its associated fold change. Figure 7.6.8 shows an example. In that figure, EC number 6.3.2.4 is left green and not overwritten. Remember that the green in 6.3.2.4 does not necessarily mean that the gene is down-regulated. COMPARATIVE VISUALIZATION OF PATHWAY ANALYSIS FROM MULTIPLE EXPERIMENTS ALTERNATE PROTOCOL It is possible to use Pathway Analyzer to perform pathway analysis across multiple experiments. To do this, first run the Pathway Analyzer program on each experiment of interest with the same cutoffs on each of them (see Basic Protocol). Next, take the Signed Fisher Exact Test column of each experiment and place them into one Excel spreadsheet. Everything can then be sorted by the most interesting experiment, such that the most up-regulated pathways are at the top and most down-regulated pathways are at the bottom. From this, one can make a contour plot. Figure 7.6.9 shows an example of such a plot, performed on data from the paper on time-course expression during yeast diauxic shift (DeRisi et al., 1997), showing only the top 10 most up-regulated and down-regulated Signed Fisher Exact Test pathways. The data were downloaded from the The Pat Brown Laboratory Web Site http://cmgm. stanford.edu/pbrown/explore/array.txt. Analyzing Expression Patterns 7.6.11 Current Protocols in Bioinformatics Supplement 5 Figure 7.6.9 Surface graph obtained using Microsoft Excel to plot all the Signed Fisher Exact Test column values of the different pathway_summary_file.txt files. It is possible to preserve the sign and subtract the absolute value from 1, and then plot the line plots in Microsoft Excel and get the result shown in Figure 7.6.10 for the top 10 up-regulated and down-regulated pathways. Figure 7.6.11 shows the figure modified from the paper (DeRisi et al., 1997) itself. There is a very good correlation between the two figures, indicating that Pathway Processor has automatically identified the more relevant features of the process. According to the researcher’s preferences, the results can be visualized with different visualization programs. The starting point is always an Excel file containing the values from the Signed Fisher Exact Test column from different experiments Analyzing Expression Results with Pathway Processor To generate similar heatmaps and view them with the Eisen’s clustering programs (Eisen et al., 1998), one would go through the following steps: 1. Take the sorted file and convert every value which is greater than −0.01 and less than 0.01 to 0.01 with the appropriate sign. 7.6.12 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.10 Time course of the experiment described in DeRisi et al. (1997). The figure reports a XY (Scatter) graph using Microsoft Excel to plot all the Signed Fisher Exact Test column values of the different pathway_summary_file.txt files. The p values have been adjusted to plot with large numbers at low p values and vice versa to show that one can get the same result as the original DeRisi figure. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm. Figure 7.6.11 Time course of the experiment described in DeRisi et al. (1997). The hours on the horizontal bar indicate the time, during diauxic shift, at which the mRNA has been extracted. The experiment compares differential expression at the indicated time respect to a common reference. This is a redrawing of the original figure appearing in De Risi et al. (1997). This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm. 2. Take the reciprocal of every value. 3. Rename the Pathway column to Name. The file would look similar in format to Table 7.6.4. 4. Save the file as a tab-delimited text file. 5. Open Mike Eisen’s TreeView program. 6. Go to the File menu and select Load. 7. Select the type of file to be text (*.TXT) and select the file. The result will look similar to Figure 7.6.12. Analyzing Expression Patterns 7.6.13 Current Protocols in Bioinformatics Supplement 5 Table 7.6.4 Detail of the Visualization of the Results of the Comparison of the Seven Experiments in the Time Course of the De Risi et al. (1997) Experimenta Name 9 hr 11 hr 13 hr 15 hr Ribosome map 3010 Purine metabolism map 230 1 1 1 1 1 100 Pyrimidine metabolism map 240 RNA polymerase map 3020 Aminoacyl-tRNA biosynthesis map 970 Methionine metabolism map 271 Selenoamino acid metabolism map 450 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 17 hr 19 hr 21 hr 1 −1.005134 −1.220797 10.70821 −100 −31.96548 −100 −100 1 1 1 −1.881824 1 1 −5.985468 −8.275396 −100 −100 −100 −100 1 1 −100 −100 −29.94031 −43.05598 −100 −75.23784 aHours in the top row indicate the time, during diauxic shift, at which the mRNA has been extracted; the experiment compares differential expression at the indicated time respect to a common reference. Figure 7.6.12 A screenshot of Mike Eisen’s TreeView program using the reciprocally adjusted Signed Fisher Exact Test values to show how one can quickly visualize the results of multi-experiment pathway analysis. Data from different experiments analyzed using Pathway Analyzer can be visualized with the open-source visualization software OpenDX (http://www.opendx.org). This visualization program allows an elegant and detailed examination of the expression levels observed in the experiment, according to pathways. Analyzing Expression Results with Pathway Processor The advantage of OpenDX is that it visualizes data in three dimensions. An example is shown in Figure 7.6.13. The input of the program consists of three files: one with the pathway names, another with the Signed Fisher Exact Test, and a third with the header row. The program represents each value graphically as a cube. The color of the cube indicates the extent of the variation, based on the magnitude of the p values and the sign, with red being up-regulated, green down-regulated, and yellow no change. The correspon- 7.6.14 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.13 Picture representing the down-regulated pathways in the diauxic shift (DeRisi et al., 1997), with the Fisher Exact Test Results visualized using OpenDX. The values of the Signed Fisher Exact Test of the 21-hr data set have been sorted according to the value of the Fisher Exact Test; the results of the other data sets for the affected pathways are also reported. The color of the cube indicates the extent of the variation, according to the p values, with red being up-regulated, green down-regulated, and yellow unchanged. The opacity visually represents the statistical significance of the variation, the greater the opacity, the greater the significance of p value. The color of the cube depends on the p value in the following way: from 1 to 0.15 the color remains yellow, from 0.15 to 0 with overexpression (+) it goes from yellow to red, and from 0.15 to 0 with under-expression (−) it goes from yellow to green. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience. wiley.com/c_p/colorfigures.htm. dence between the color of the cube and the p value can be modulated according to the user’s preferences; the authors suggest that the visualization be tuned in the following way: from 1 to 0.15, the color remains yellow; from 0.15 to 0 with over-expression (+), it goes from yellow to red; from 0.15 to 0 with under-expression (−), it goes from yellow to green. To allow the eye to focus on the most significant results, it is also suggested that the opacity be changed so that the greater the significance of the variation, the greater the opacity (Fig. 7.6.13). The use of the program is not intuitive, and its application to visualization of microarray data classified using Pathway Processor needs some fine tuning. A detailed description of OpenDX itself is beyond the scope of this manual; a detailed description of the program and manuals on how to use OpenDx can be found at http://www.opendx.org/support.html and http://www.opendx.org/index2.php. A book with a more extensive tutorial can be found at http://www.vizsolutions.com/paths.html. GUIDELINES FOR UNDERSTANDING RESULTS Two tab-delimited text files are generated from the comparison files in Pathway Analyzer. One, called, gene_expression_pathway_summary_file.txt (Table 7.6.1), contains all the genes that pass the cutoff, organized by pathway, and can be used to retrieve lists of the genes with their fold changes, subdivided according to the KEGG Pathway organization. The other, comb_pathway_summary_file.txt (Table Analyzing Expression Patterns 7.6.15 Current Protocols in Bioinformatics Supplement 5 7.6.2), contains the summary of the statistics for each pathway, which can be imported into Microsoft Excel to enable the user to sort the results into various columns, to determine the effect on the different pathways, and to be used as input in different visualization tools. The Signed Fisher Exact Test column of comb_pathway_summary_file.txt (Table 7.6.2) allows the sorting of up-regulated or down-regulated pathways. The value in this column is composed of two distinct parts. The first part carries the signs + or −, indicating whether the particular pathway contains genes that are up- or down-regulated. The second part of each entry is a positive real number (between 0 and 1), corresponding to the p value of the Fisher Exact Test for the pathway. The sign is calculated by subtracting the mean relative expression of all genes that pass the cutoff and are in the pathway from the mean relative expression of the genes that pass the cutoff and are not within the pathway (up-regulation/down-regulation column, Table 7.6.2). If there are no genes above the cutoff in a pathway, the sign is arbitrarily set to +. This step is done only for convenience, as the p values for such pathways will always be non-significant. Sorting for the Signed Fisher Exact Test is done so that the most significant values are at the top of the column (Table 7.6.2) for the up-regulated pathways and at the bottom for the down-regulated pathways. In the middle are the least significant pathways. The values of the Fisher Exact Test vector can be used to compare different experiments using Microsoft Excel (Table 7.6.2), and the comparison among the different experiments can be represented graphically. Programs for the graphical representation can vary from Excel to more sophisticated ones; an interesting graphical software is OpenDX (http://www.opendx.org), an open-source visualization software package (Fig. 7.6.13). The resulting set of p values for all pathways is finally used to rank the pathways according to the magnitude and direction of the effects. The Pathway Processor results from multiple experiments can be compared, reducing the analysis from the full set of individual genes to a limited number of pathways of interest. The probability that a given pathway is affected is necessary to weigh the relative contribution of the biological process at work to the phenotype studied. COMMENTARY Background Information Analyzing Expression Results with Pathway Processor DNA microarrays provide a powerful technology for genomics research. The multistep, data-intensive nature of this approach has created unprecedented challenges for the development of proper statistics and new bioinformatic tools. It is of the greatest importance to integrate information on the genomic scale with the biological information accumulated through years of research on the molecular genetics, biochemistry, and physiology of the organisms that researchers investigate. A genomic approach for the understanding of fundamental biological processes enables the simultaneous study of expression patterns of all genes for branch-point enzymes. Similarly, one can look for patterns of expression variation in particular classes of genes, such as those involved in metabolism, cytoskeleton, cell-division control, apoptosis, membrane transport, sexual reproduction, and so forth. Interpreting such a huge amount of data requires a deep knowledge of metabolism and cellular signaling pathways. The availability of properly annotated pathway databases is one of the requirements for analyzing microarray data in the context of biological pathways. Efforts to establish proper gene ontology (UNIT 7.2; Ashburner et al., 2000) and pathway databases are continuing, and several resources are available publicly and commercially. Efforts have also been made to integrate functional genomic information into databases, such as ArrayDB (Ermolaeva et al., 1998), SGD (Ball et al., 2000; Ball et al., 2001), YPD, Worm PD, PombePD and callPd (Costanzo et al., 2000), and KEGG (Nakao et al., 1999). The ability to display information on pathway maps is also extremely important. The Kyoto Ency- 7.6.16 Supplement 5 Current Protocols in Bioinformatics clopedia of Genes and Genomes (KEGG; Kanehisa et al., 2002), the Alliance for Cellular Signalling, BioCarta, EcoCyc (Karp et al., 2002a), MetaCyC (Karp et al., 2002b), PathDB, and MIPS all organize existing metabolic information in easily accessible pathway maps. Pathway databases will become more useful as a unique and detailed annotation for all the genes in the sequenced genomes becomes available. In this respect, the situation for yeast contrasts with that for human, mouse, and rat, for which the systematic and detailed annotation and description of open reading frame (ORF) function is still in progress. The visualization of expression data on cellular process charts is also important. Many authors have manually mapped transcriptional changes to metabolic charts, and others have developed automatic methods to assign genes showing expression variation to functional categories, focusing on single pathways. KEGG, MetaCyC, and EcoCyC display expression data from some experiments on their maps. Some commercial microarray analysis packages, such as Rosetta Resolver (Rosetta Biosoftware) have also integrated a feature enabling the display of expression of a given gene in the context of a metabolic map. MAPPFinder and GenMAPP (http://www. genmapp.org/; UNIT 7.5), are recently developed tools allowing the display of expression results on metabolic or cellular charts (Doniger et al., 2003). The program represents the pathways in a special file format called MAPPs, independent of the gene expression data, and enables the grouping of genes by any organizing principle. The novel feature is that it both allows visualization of networks according to the structure reported in the current pathway databases and also provides the ability to modify pathways ad hoc; it also makes it possible to design custom maps and exchange pathway-related data among investigators. The map contains hyperlinks to the expression information and to the information on every gene available on public databases. Still, the program is limited because the information provided on the map is not quantitative but only qualitative and based on color coding, with a repressed gene or an activated gene reported, respectively, as green or red. This does not automatically indicate the pathways of greatest interest. Microarray papers tend to discuss the fact that a given pathway is activated or repressed, based on the number of genes activated or repressed, or just on intuition, and too often the researcher finds only what she or he expected, or already knew. A recent paper from Castillo-Davis and Hartl (2003) describes a method to assess the significance of alteration in expression of diverse cellular pathways, taken from GO, MIPS, or other sources. The program is extremely useful but does not include a visualization tool enabling one to map the results on pathway charts, and does not address the problem arising from applying hypergeometric distributions to the analysis of highly interconnected pathways with a high level of redundancy as defined according to the GO terms. Ideally, methods analyzing expression data according to a pathway-based logic should give an indication of the statistical significance of the conclusions, provide a user-friendly interface, and be able to encompass the largest number of possible interconnections between genes, although the ability to separate larger pathways into smaller independent subpathways is important when developing methods assessing the analysis of the statistical significance of up- or down-regulation of a pathway. The Pathway Analyzer algorithm Pathway Analyzer implements a statistical method in Java, automatically identifying which metabolic pathways are most affected by differences in gene expression observed in a particular experiment. The method associates an ORF with a given biochemical step according to the information contained in 92 pathway files from KEGG (http://www.genome.ad.jp/ kegg/). Pathway Analyzer scores KEGG biochemical pathways, measuring the probability that the genes of a pathway are significantly altered in a given experiment. KEGG has been chosen for the concise and clear way in which the genes are interconnected, and for its curators’ great effort in keeping the information up to date. In deriving scores for pathways, Pathway Analyzer takes into account the following factors: (1) the number of ORFs whose expression is altered in each pathway; (2) the total number of ORFs contained in the pathway; and (3) the proportion of the ORFs in the genome contained in a given pathway. In the first step of the analysis, the user specifies the magnitude of the difference in ORF expression that should be regarded as above background. The relative change in gene expression is the multiplier by which the level of expression of a particular ORF is increased or decreased in an experiment. Hence, Pathway Analyzer allows the study of differences smaller than a given fold change, but which Analyzing Expression Patterns 7.6.17 Current Protocols in Bioinformatics Supplement 5 affect a statistically significant number of ORFs in a particular metabolic pathway. Consistent differential expression of a number of ORFs in the same pathway can have important biological implications; for example, it may indicate the existence of a set of coordinately regulated ORFs. The program uses the Fisher Exact Test to calculate the probability that a difference in ORF expression in each of the 92 pathways could be due to chance. A statistically significant probability means that a particular pathway contains more affected ORFs than would be expected by chance. The program allows the user to choose different values of the Fisher Exact Test. Fisher Exact test The analysis performed with the Fisher Exact Test provides a quick and user-friendly way of determining which pathways are the most affected. The one-sided Fisher Exact Test calculates a p value, based on the number of genes whose expression exceeds a user-specified cutoff in a given pathway. This p value is the probability that by chance the pathway would contain as many affected genes as or more affected genes than actually observed, the null hypothesis being that the relative changes in gene expression of the genes in the pathway are a random subset of those observed in the experiment as a whole. Literature Cited Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., and Sherlock, G. 2000. The Gene Ontology Consortium. Nat. Genet. 25:25-29. Ball, C.A., Dolinski, K., Dwight, S.S., Harris, M.A., Issel-Tarver, L., Kasarskis, A., Scafe, C.R., Sherlock, G., Binkley, G., Jin, H., Kaloper, M., Orr, S.D., Schroeder, M., Weng, S., Zhu, Y., Botstein, D., and Cherry, J.M. 2000. Integrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res. 28:77-80. Analyzing Expression Results with Pathway Processor Ball, C.A., Jin, H., Sherlock, G., Weng, S., Matese, J.C., Andrada, R. , Binkley, G., Dolinski, K., Dwight, S.S., Harris, M.A., IsselTarver, L., Schroeder, M., Botstein, D., and Cherry, J.M. 2001. Saccharomyces genome database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res. 29:80-81. Castillo-Davis, C.I. and Hartl, D.L. 2003. GeneMerge-post-genomic analysis, data mining, a n d h y p o t h e s i s t e s t i n g . Bioinformatics 19:891-892. Costanzo, M.C., Hogan, J.D., Cusick, M.E., Davis, B.P., Fancher, A.M., Hodges, P.E., Kondu, P., Lengieza, C., Lew-Smith, J.E., Lingner, C., Roberg-Perez, K.J., Tillberg, M., Brooks, J.E., and Garrels, J.I. 2000. The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): Comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res. 28:73-76. DeRisi, J.L., Iyer,V.R., and Brown, P.O. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680–686. Doniger, S.W., Salomonis, N., Dahlquist, K.D., Vranizan, K., Lawlor, S.C., and Conklin, B.R. 2003. MAPPFinder: Using gene ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol. 4:R7. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868. Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L., Chen, Y., Simon, R., Meltzer, P., Trent, J.M., and Boguski, M.S. 1998. Data management and analysis for gene expression arrays. Nat. Genet. 20:19-23. Grosu, P., Townsend, J.P., Hartl, D.L., and Cavalieri, D. 2002. Pathway processor: A tool for integrating whole-genome expression results into metabo lic networks. Genome Res. 12:1121-1126. Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42-46. Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S.M., PellegriniToole, A., Bonavides, C., and Gama-Castro, S. 2002a. The EcoCyC database. Nucleic Acids Res. 30:56-58. Karp, P.D., Riley, M., Paley, S.M., and PellegriniToole, A. 2002b. The MetaCyC database. Nucleic Acids Res. 30:59-61. Nakao, M, Bono, H., Kawashima, S., Kamiya, T., Sato, K., Goto, S., and Kanehisa, M. 1999. Genome-scale gene expression analysis and pathway reconstruction in KEGG. Genome Inform. Ser. Workshop Genome Inform. 10:94103. Internet Resources http://www.cgr.harvard.edu/cavalieri/pp.html Duccio Cavalieri CGR Web site. http://www.genome.ad.jp/kegg/ The Kyoto Encyclopedia of Genes and Genomes (KEGG) home page. 7.6.18 Supplement 5 Current Protocols in Bioinformatics http://www.proteome.com/databases/YPD/ YPDsearch-quick.html http://www.ncgr.org/genex/ The yeast proteome database (YPD) home page. Gene X, Gene expression home page at the National Center for Genome Resources. http://www.opendx.org http://cmgm.stanford.edu/pbrown/ http://www.opendx.org/index2.php The Pat Brown Laboratory Web site. The open-source visualization software, OpenDX along with manuals and tutorials. http://genome-www.stanford.edu/ Saccharomyces/ The Saccharomyces Genome database (SGD) home page. Contributed by Duccio Cavalieri and Paul Grosu Bauer Center for Genomics Research Harvard University Cambridge, Massachusetts Analyzing Expression Patterns 7.6.19 Current Protocols in Bioinformatics Supplement 5 Integrating Whole-Genome Expression Results into Metabolic Networks with Pathway Processor UNIT 7.6 Genes never act alone in a biological system, but participate in a cascade of networks. As a result, analyzing microarray data from a pathway perspective leads to a new level of understanding the system. The authors’ group has recently developed Pathway Processor (http://cgr.harvard.edu/cavalieri/pp.html), an automatic statistical method to determine which pathways are most affected by transcriptional changes and to map expression data from multiple whole-genome expression experiments on metabolic pathways (Grosu et al., 2002). The Pathway Processor package (Fig. 7.6.1) consists of three programs, Data File Checker, Pathway Analyzer (see Basic Protocol), and Expression Mapper (see Support Protocol). The final protocol in the unit presents a method for comparing the results from multiple experiments (see Alternate Protocol). The first program included with the Pathway Processor package, called Data File Checker, examines the input microarray data and checks whether it has the correct format for Pathway Analyzer and Expression Mapper. The output form data file checker is a text file called data.txt that constitutes the input of the two other programs. SCORING BIOCHEMICAL PATHWAYS WITH PATHWAY PROCESSOR Pathway Analyzer is a new method that uses the Fisher Exact Test to score biochemical pathways according to the probability that as many or more genes in a pathway would be significantly altered in a given experiment as would be altered by chance alone. Results from multiple experiments can be compared, reducing the analysis from the full set of individual genes to a limited number of pathways of interest. BASIC PROTOCOL This tool is the first to include a statistical test to determine automatically the probability that the genes of any of a large number of pathways are significantly altered in a given experiment. Pathway Processor also provides a user-friendly interface, called Expression Mapper (see Support Protocol), which automatically associates expression changes with genes organized into metabolic maps (Grosu et al., 2002). The Pathway Processor program, initially designed for the analysis of yeast and B.subtilis expression data, can readily be adapted to the metabolic networks of other organisms. The program can also be adapted to metabolic pathways other that those reported in KEGG. Necessary Resources Hardware PC running Microsoft Windows. The authors have found that a 700 MHz Pentium PC with 512 Mb of RAM performs very well. Software Pathway Processor is written completely in Sun Microsystems Java. It is freely available on the Web page of the Bauer Center for Genomics Research (http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul Grosu ([email protected]) or Duccio Cavalieri (dcavalieri@cgr. harvard.edu). The program can be downloaded from the Web together with the detailed User’s Instruction Manual. Contributed by Duccio Cavalieri and Paul Grosu Current Protocols in Bioinformatics (2004) 7.6.1-7.6.19 Copyright © 2004 by John Wiley & Sons, Inc. Analyzing Expression Patterns 7.6.1 Supplement 5 Figure 7.6.1 Flowchart of the Pathway Processor Project, including a screenshot of the directory structure of Pathway Processor. Analyzing Expression Results with Pathway Processor Files The tab-delimited data text file is the file where one’s expression data will reside. This data file must have the name data.txt, and will need to reside in the data folder of the programs for which it will be used (this will be described in greater detail later on; see step 1). This is the file used by Pathway Analyzer and Expression Mapper. The file must contain normalized data in the format of ratios. Data should not be log-transformed, since the programs will take care of that where necessary. The file must not have any headers and is of the following format: (1) the first column must contain the yeast ORF names (for B. subtilis, use the SubtiList accession numbers; e.g., BG11037; see note below); (2) the last column must contain the normalized ratios; (3) there can be as many columns in between as desired, but the authors recommend that only locus names be placed as the middle column; this provides a quicker identification of the ORF in Expression Mapper. Figure 7.6.2 shows an example. There are some requirements and restrictions on the data file, i.e.: (a) the data file must not contain any empty ORFs or ratios; (b) the data file must not contain any 0 ratios since this will be a problem when taking the log of these Ratios; (c) the data file must not contain duplicate ORFs since the statistics will be skewed; (d) the data file must not contain any blank rows or columns; (e) the data file must not contain any header columns nor extra lines or spaces except for the text that is in each cell. Each cell must contain only one line and cannot be spread across multiple lines. 7.6.2 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.2 A valid data.txt file. NOTE: For Bacillus subtilis it is necessary to use the SubtiList Accession numbers in data.txt. For example, instead of using aadK, one needs to use BG11037. This is a procedure that can easily be performed in Microsoft Access where one has a table of one’s data and another table that associates the gene names (e.g., aadK) with the corresponding SubtiList Accession number (in this case, BG 11037). Such associations can be entered as a table in Microsoft Access from different locations available freely on the Internet. Feel free to contain either Paul Grosu ([email protected]) or Duccio Cavalieri ([email protected]). Installing Pathway Processor 1. Pathway Processor comes as a compressed file called pathway_processor. zip. The user will need to unzip this file and all the proper directories and files will be created. All three programs (Pathway Analyzer, Expression Mapper, and Data File Checker) have the same directory architecture. For each program there exists one main directory and three subdirectories (data, library, results; Fig. 7.6.1). The program and the three subdirectories reside in the Main Folder. In the data folder, the user will put the data.txt file. The library folder contains data that the program will use to process the user’s data. The results folder will output all of the user’s results. The JRE1.3.1 folder is used by the program to start running. 2. After performing the operations described in step 1, run the data file through the Data File Checker program (steps 3 to 5). This program will remove any ORFs that are either not present in the pathway matrix against which the data is compared to perform the statistics (this will be explained in more detail later), as well any data that contain 0 ratios. Running the Data File Checker 3. Place the data.txt file in the data subdirectory of the data_file_checker folder. 4. Go to the data_file_checker folder and double-click on the run.bat file. Click the Process Request button in the dialog box that appears. The program will parse the data.txt file and remove any ORFs that have 0 ratios or that are not part of the latest SGD ORF listing. This SGD ORF listing is used by Pathway Analyzer in a matrix form to do the statistical calculations. Updates to the pathway matrix file will be done on a weekly basis. The pathway matrix file is called pathway_ file.txt and resides in the following subdirectories: For the Data_File_Checker: pathway_processor\\data_file_checker\\library\\pathway_file For Pathway_Analyzer: pathway_processor\\pathway_analyzer\\library\\pathway_file Analyzing Expression Patterns 7.6.3 Current Protocols in Bioinformatics Supplement 5 A B Figure 7.6.3 (A) Screen shot of the message window one receives when the Data File Checker application has successfully parsed one’s data file. (B) Screenshot of the message window one receives when the Data File Checker application has encountered an error while parsing one’s data file. This message will alert the user to the row (line number) at which the error has occurred. The user will need to open the file, usually with Microsoft Excel, and make the correction and rerun the Data File Checker application. The data files always need to be saved as tab-delimited text files. 5a. Scenario 1: If the data.txt file was of the correct format, the message shown in Figure 7.6.3A will come up. In the results folder, the new processed data.txt file will be found. This can be placed in the data directory of pathway_analyzer or expression_mapper (see Support Protocol). 5b. Scenario 2: If the data.txt file was not of the correct format, then the message shown in Figure 7.6.3B will come up. The next step would be to correct the data file where the error has occurred and then try to run data_file_checker again on the new data file. Running Pathway Analyzer 6. Place the data.txt file (from step 5a) in the data subdirectory of the pathway_ analyzer folder. 7. Go to the pathway_analyzer folder and double-click on the run.bat file. The screen shown in Figure 7.6.4 will come up: Analyzing Expression Results with Pathway Processor 8. The next step is to set the appropriate fold change cutoff. Pathway Analyzer will start with a preset fold change cutoff for the Fisher Exact Test Statistic. The user should choose the fold change based on the number of replicates that are combined to create the data set, on the confidence that he or she has in the data, and on the type of experiment. The Fisher Exact Test is based on the number of genes that pass the cutoff, without considering the variance. In the experiment used as an example, the 1.8 fold change was chosen also by looking at the Gaussian distribution of the fold changes 7.6.4 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.4 Screenshot of the Pathway Analyzer application main window. in the experimental data set. It was observed that, in this particular data set, the number of genes included between 1.5 and 1.6 was much larger than 1.8 and 1.9, and could be the result of noise or variability in the measurements. It is suggested that the analysis be performed with different cutoffs and that the one that gives the best values of the Fisher Exact Test be chosen. In Pathway Analyzer, the user specifies the magnitude of the difference in ORF expression that is to be regarded as above background. The program uses the expression “fold change” to indicate the relative change in gene expression, represented as the multiplier by which the level of expression of a particular ORF is increased or decreased in an experiment. 9. Click on the Process Request button. The status bar will then change from “Waiting for process request” to “Working...Please wait for job to finish...”. When the program is finished, the status bar will change to “Job done. Waiting for process request.” The program will parse the data file and compare it to pathway_file.txt (pathway matrix file). From this comparison it will generate the Fisher Exact Test. All ratios are transformed to log base 2 values before performing any kind of analysis. The first set of tab-delimited text files that are generated are the following, all of which will be saved in the results subdirectory of the pathway_analyzer directory: gene_expression_pathway_summary_file.txt pathway_summary_file.txt The gene_expression_pathway_summary_file.txt will list, per pathway, all the genes, with the associated fold change, that passed the cutoff. Table 7.6.1 is a small sample of what it will look like. The KEGG map number of each pathway is also listed in the header (first) row. This will come in handy for the Expression Mapper (see Support Protocol). The second file (Table 7.6.2), pathway_summary_file.txt, is the file containing the Fisher Exact Test signed and unsigned t statistic information. Table 7.6.3 contains a description of the content of the columns of the pathway_summary_file.txt. The Signed Fisher Exact Test values will come in handy when doing pathway analysis among multiple experiments, which is described in the Alternate Protocol. Analyzing Expression Patterns 7.6.5 Current Protocols in Bioinformatics Supplement 5 Table 7.6.1 Visualization of a Detail of Two Columns of gene_expression_pathway_ summary_file.txt, Opened Using Microsoft Excel Pentose and glucuronate interconversions map40 YBR204C *** Fold Change: 1.82 YKL140W - TGL1 *** Fold Change: 2.05 YKL035W - UGP1 *** Fold Change: 2.38 Fructose and mannose metabolism map51 YDL055C - PSA1 *** Fold Change: −2.18 YGL253W - HXK2 *** Fold Change: −1.82 YDR368W - YPR1 *** Fold Change: 1.85 YCL040W - GLK1 *** Fold Change: 1.93 YKR009C - FOX2 *** Fold Change: 2.29 YJR159W - SOR1 *** Fold Change: 2.39 YIL107C - PFK26 *** Fold Change: 2.87 YJL155C - FBP26 *** Fold Change: 3.60 YDL243C - AAD4 *** Fold Change: 3.83 YCR107W - AAD3 *** Fold Change: 5.00 YFL056C - AAD6 *** Fold Change: 7.25 YJR155W - AAD10 *** Fold Change: 10.05 Table 7.6.2 Visualization of a Detail of the Second File Obtained from Pathway Analyzer, pathway_summary_ file.txt, Opened Using Microsoft Excel Genes in pathway present in the data file Genes exceeding fold change cutoff (−1.8, 1.8) Fisher Exact Test (−1.8, 1.8) Up-regulation/ Down-regulation of pathway (−1.8, 1.8) Signed Fisher Exact Test (−1.8, 1.8) 39 18 0.0084793 0.134658288 0.0084793 3 0 1 0 1 23 18 4.80E-07 2.477863857 4.80E-07 22 13 0.0016249 0.909243073 0.0016249 8 3 0.3770615 1.78955758 0.3770615 36 14 0.085082 0.539119765 0.085082 29 8 0.5521999 1.257629699 0.5521999 13 3 0.7303117 2.138632217 0.7303117 4 0 1 0 1 Pathway Glycolysis/ Gluconeogenesis, map10 Styrene degradation map11 Citrate cycle (TCA cycle) map 20 Pentose phosphate cycle map30 Pentose and glucuronate interconversions map 40 Fructose and mannose metabolism map 51 Galactose metabolism map 52 Ascorbate and aldarate metabolism map 53 Fatty acid biosynthesis (path 1) map 61 7.6.6 Supplement 5 Current Protocols in Bioinformatics Table 7.6.3 Description of the Content of the Columns of the pathway_summary_file.txt Column name Column description Genes in pathway present in the data file Lists the number of genes in the particular pathway—in the last column in that row—and also present in the data file Lists the number of genes that passed the cutoff in the particular pathway, listed in the last column in that row. In parentheses is listed the fold change range that was entered when the program was run. Lists the Fisher Exact Test value. In parentheses is listed the fold change range that was entered when the program was run. Calculates the difference between the means of the log2 ratios of the genes that passed the cutoff within the pathway and all the genes that passed the cutoff. If the number is greater than zero, the pathway is up-regulated compared to the rest of the genome. If the pathway is less than zero it is down-regulated. If it is zero, it is not significant. In parentheses is listed the fold change range that was entered when the program was run. Genes exceeding fold change cutoff (−1.8, 1.8) Fisher Exact Test (−1.8, 1.8) Up-regulation/down-regulation of pathway (−1.8, 1.8) Signed Fisher Exact Test (−1.8, 1.8) Pathway Takes the sign of the up-regulation/down-regulation column—only if it is non-zero—and multiplies it by the Fisher Exact Test column value. If the up-regulation/down-regulation column is 0, then the value is automatically set to 0. If the value is greater or equal to −0.0001 and less than 0, then the value is automatically set to −0.0001. This is done so that colors can be plotted correctly, since colors in ranges very close to zero in the negative region are considered 0 by some visualization programs. In parentheses is listed the fold change range that was entered when the program was run. This column lists the pathway name. The KEGG map number, which can be used in Expression Mapper, is listed in parentheses. 10. The extent of the alteration of the genes that show major changes, and their position in the pathways identified as of greatest interest with Pathway Analyzer, can be now visualized. It is sufficient to annotate the number of the pathway of interest, as reported in parentheses next to the KEGG map number in the pathway column (Table 7.6.2), and proceed to the analysis with Expression Mapper (see Support Protocol). DETAILED ANALYSIS WITH EXPRESSION MAPPER Expression Mapper allows a detailed examination of the relationships among genes in the pathways of interest. This program features a unique graphical output, displaying differences in expression on metabolic charts of the biochemical pathways to which the ORFs are assigned. The gene names are visualized on the metabolic chart together with the fold change, next to the biological step to which the gene has been associated. SUPPORT PROTOCOL The letters are colored in red if the gene is up-regulated and green if the gene is down-regulated; the color intensity is proportional to the extent of the change in expression. Single pathways of interest can be then studied in detail using Expression Mapper. Analyzing Expression Patterns 7.6.7 Current Protocols in Bioinformatics Supplement 5 Figure 7.6.5 Screenshot of the Expression Mapper application’s Map Manipulation Area window using the new, checked data.txt file. The figure reports the Glycolysis/Gluconeogenesis pathway (KEGG map 10). The text is colored in red if the relative change in gene expression is ≥1 or green if it is ≥1. The intensity of the color is proportional to the magnitude of the differential expression. The presence of a gray box indicates that the corresponding step in the biochemical pathway requires multiple gene products. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/ colorfigures.htm. Necessary Resources Hardware PC running Microsoft Windows. The authors have found that a 700 MHz Pentium PC with 512 Mb of RAM performs very well. Analyzing Expression Results with Pathway Processor Software Pathway Processor is written completely in Sun Microsystems Java. It is freely available on the Web page of the Bauer Center for Genomics Research (http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul Grosu ([email protected]) or Duccio Cavalieri (dcavalieri@ cgr.harvard.edu). The program can be downloaded from the Web together with the detailed User’s Instruction Manual. continued 7.6.8 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.6 The same screenshot as Figure 7.6.5, with the exception that the user has dragged out the per-gene fold-changes. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/ colorfigures.htm. Files data.txt file (see Basic Protocol) 1. After completing the Basic Protocol, place the data.txt file in the data subdirectory of the expression_mapper folder. 2. Go to the expression_mapper folder and double-click on the run.bat file. 3. Enter the KEGG map number of interest in the dialog box that appears, then click on the Process Request button. Be sure to type in a map that exists. If unsure, check the Pathway column from the pathway_summary_file.txt file (Table 7.6.2) for the KEGG map number of greatest interest. 4. A window will come up that will look similar to Figure 7.6.5 (this is for KEGG map number 10). Analyzing Expression Patterns 7.6.9 Current Protocols in Bioinformatics Supplement 5 Figure 7.6.7 JPEG output file that is saved from Figure 7.6.6 when one closes the Map Manipulation Area window. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www. interscience.wiley.com/c_p/colorfigures.htm. 7.6.10 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.8 This is a portion of the Map Manipulation Area window using the B. subtilis version of the program. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm. The user will notice that the fold changes of some of the ORFs are listed. These are locations where only one ORF is present. The ORFs are colored with shades of red if they are up-regulated and shades of green if they are down-regulated. The lighter the shade, the more up-regulated or down-regulated the gene is. The darker the shade, the less up-regulated or down-regulated the gene is. The ratios are transformed into fold changes and written on the graph. The user will notice that some locations contain gray boxes. These are locations where more than on gene is located. In order to view these, one will need to click on the gray box and drag it out (Fig. 7.6.6). 5. Finally, once satisfied with the way the pathway layout looks, one can save the image by closing the window. By closing the window, an output file of the corresponding map number will be created in the results folder under the expression_mapper directory. The file name will be created from the template[mapnumber] output.jpg. For instance, if KEGG map number 10 is used, the output file will be 10 output.jpg. The output of KEGG map 10 is shown in Figure 7.6.7. For B. subtilis, there will be boxes which are green; these boxes are prerendered green by KEGG to indicate that they contain B. subtilis genes. When they are seen on screen, it means that the data do not contain those genes since they are left green and not overwritten with either a gray box or with a specific B. subtilis gene and its associated fold change. Figure 7.6.8 shows an example. In that figure, EC number 6.3.2.4 is left green and not overwritten. Remember that the green in 6.3.2.4 does not necessarily mean that the gene is down-regulated. COMPARATIVE VISUALIZATION OF PATHWAY ANALYSIS FROM MULTIPLE EXPERIMENTS ALTERNATE PROTOCOL It is possible to use Pathway Analyzer to perform pathway analysis across multiple experiments. To do this, first run the Pathway Analyzer program on each experiment of interest with the same cutoffs on each of them (see Basic Protocol). Next, take the Signed Fisher Exact Test column of each experiment and place them into one Excel spreadsheet. Everything can then be sorted by the most interesting experiment, such that the most up-regulated pathways are at the top and most down-regulated pathways are at the bottom. From this, one can make a contour plot. Figure 7.6.9 shows an example of such a plot, performed on data from the paper on time-course expression during yeast diauxic shift (DeRisi et al., 1997), showing only the top 10 most up-regulated and down-regulated Signed Fisher Exact Test pathways. The data were downloaded from the The Pat Brown Laboratory Web Site http://cmgm. stanford.edu/pbrown/explore/array.txt. Analyzing Expression Patterns 7.6.11 Current Protocols in Bioinformatics Supplement 5 Figure 7.6.9 Surface graph obtained using Microsoft Excel to plot all the Signed Fisher Exact Test column values of the different pathway_summary_file.txt files. It is possible to preserve the sign and subtract the absolute value from 1, and then plot the line plots in Microsoft Excel and get the result shown in Figure 7.6.10 for the top 10 up-regulated and down-regulated pathways. Figure 7.6.11 shows the figure modified from the paper (DeRisi et al., 1997) itself. There is a very good correlation between the two figures, indicating that Pathway Processor has automatically identified the more relevant features of the process. According to the researcher’s preferences, the results can be visualized with different visualization programs. The starting point is always an Excel file containing the values from the Signed Fisher Exact Test column from different experiments Analyzing Expression Results with Pathway Processor To generate similar heatmaps and view them with the Eisen’s clustering programs (Eisen et al., 1998), one would go through the following steps: 1. Take the sorted file and convert every value which is greater than −0.01 and less than 0.01 to 0.01 with the appropriate sign. 7.6.12 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.10 Time course of the experiment described in DeRisi et al. (1997). The figure reports a XY (Scatter) graph using Microsoft Excel to plot all the Signed Fisher Exact Test column values of the different pathway_summary_file.txt files. The p values have been adjusted to plot with large numbers at low p values and vice versa to show that one can get the same result as the original DeRisi figure. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm. Figure 7.6.11 Time course of the experiment described in DeRisi et al. (1997). The hours on the horizontal bar indicate the time, during diauxic shift, at which the mRNA has been extracted. The experiment compares differential expression at the indicated time respect to a common reference. This is a redrawing of the original figure appearing in De Risi et al. (1997). This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm. 2. Take the reciprocal of every value. 3. Rename the Pathway column to Name. The file would look similar in format to Table 7.6.4. 4. Save the file as a tab-delimited text file. 5. Open Mike Eisen’s TreeView program. 6. Go to the File menu and select Load. 7. Select the type of file to be text (*.TXT) and select the file. The result will look similar to Figure 7.6.12. Analyzing Expression Patterns 7.6.13 Current Protocols in Bioinformatics Supplement 5 Table 7.6.4 Detail of the Visualization of the Results of the Comparison of the Seven Experiments in the Time Course of the De Risi et al. (1997) Experimenta Name 9 hr 11 hr 13 hr 15 hr Ribosome map 3010 Purine metabolism map 230 1 1 1 1 1 100 Pyrimidine metabolism map 240 RNA polymerase map 3020 Aminoacyl-tRNA biosynthesis map 970 Methionine metabolism map 271 Selenoamino acid metabolism map 450 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 17 hr 19 hr 21 hr 1 −1.005134 −1.220797 10.70821 −100 −31.96548 −100 −100 1 1 1 −1.881824 1 1 −5.985468 −8.275396 −100 −100 −100 −100 1 1 −100 −100 −29.94031 −43.05598 −100 −75.23784 aHours in the top row indicate the time, during diauxic shift, at which the mRNA has been extracted; the experiment compares differential expression at the indicated time respect to a common reference. Figure 7.6.12 A screenshot of Mike Eisen’s TreeView program using the reciprocally adjusted Signed Fisher Exact Test values to show how one can quickly visualize the results of multi-experiment pathway analysis. Data from different experiments analyzed using Pathway Analyzer can be visualized with the open-source visualization software OpenDX (http://www.opendx.org). This visualization program allows an elegant and detailed examination of the expression levels observed in the experiment, according to pathways. Analyzing Expression Results with Pathway Processor The advantage of OpenDX is that it visualizes data in three dimensions. An example is shown in Figure 7.6.13. The input of the program consists of three files: one with the pathway names, another with the Signed Fisher Exact Test, and a third with the header row. The program represents each value graphically as a cube. The color of the cube indicates the extent of the variation, based on the magnitude of the p values and the sign, with red being up-regulated, green down-regulated, and yellow no change. The correspon- 7.6.14 Supplement 5 Current Protocols in Bioinformatics Figure 7.6.13 Picture representing the down-regulated pathways in the diauxic shift (DeRisi et al., 1997), with the Fisher Exact Test Results visualized using OpenDX. The values of the Signed Fisher Exact Test of the 21-hr data set have been sorted according to the value of the Fisher Exact Test; the results of the other data sets for the affected pathways are also reported. The color of the cube indicates the extent of the variation, according to the p values, with red being up-regulated, green down-regulated, and yellow unchanged. The opacity visually represents the statistical significance of the variation, the greater the opacity, the greater the significance of p value. The color of the cube depends on the p value in the following way: from 1 to 0.15 the color remains yellow, from 0.15 to 0 with overexpression (+) it goes from yellow to red, and from 0.15 to 0 with under-expression (−) it goes from yellow to green. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience. wiley.com/c_p/colorfigures.htm. dence between the color of the cube and the p value can be modulated according to the user’s preferences; the authors suggest that the visualization be tuned in the following way: from 1 to 0.15, the color remains yellow; from 0.15 to 0 with over-expression (+), it goes from yellow to red; from 0.15 to 0 with under-expression (−), it goes from yellow to green. To allow the eye to focus on the most significant results, it is also suggested that the opacity be changed so that the greater the significance of the variation, the greater the opacity (Fig. 7.6.13). The use of the program is not intuitive, and its application to visualization of microarray data classified using Pathway Processor needs some fine tuning. A detailed description of OpenDX itself is beyond the scope of this manual; a detailed description of the program and manuals on how to use OpenDx can be found at http://www.opendx.org/support.html and http://www.opendx.org/index2.php. A book with a more extensive tutorial can be found at http://www.vizsolutions.com/paths.html. GUIDELINES FOR UNDERSTANDING RESULTS Two tab-delimited text files are generated from the comparison files in Pathway Analyzer. One, called, gene_expression_pathway_summary_file.txt (Table 7.6.1), contains all the genes that pass the cutoff, organized by pathway, and can be used to retrieve lists of the genes with their fold changes, subdivided according to the KEGG Pathway organization. The other, comb_pathway_summary_file.txt (Table Analyzing Expression Patterns 7.6.15 Current Protocols in Bioinformatics Supplement 5 7.6.2), contains the summary of the statistics for each pathway, which can be imported into Microsoft Excel to enable the user to sort the results into various columns, to determine the effect on the different pathways, and to be used as input in different visualization tools. The Signed Fisher Exact Test column of comb_pathway_summary_file.txt (Table 7.6.2) allows the sorting of up-regulated or down-regulated pathways. The value in this column is composed of two distinct parts. The first part carries the signs + or −, indicating whether the particular pathway contains genes that are up- or down-regulated. The second part of each entry is a positive real number (between 0 and 1), corresponding to the p value of the Fisher Exact Test for the pathway. The sign is calculated by subtracting the mean relative expression of all genes that pass the cutoff and are in the pathway from the mean relative expression of the genes that pass the cutoff and are not within the pathway (up-regulation/down-regulation column, Table 7.6.2). If there are no genes above the cutoff in a pathway, the sign is arbitrarily set to +. This step is done only for convenience, as the p values for such pathways will always be non-significant. Sorting for the Signed Fisher Exact Test is done so that the most significant values are at the top of the column (Table 7.6.2) for the up-regulated pathways and at the bottom for the down-regulated pathways. In the middle are the least significant pathways. The values of the Fisher Exact Test vector can be used to compare different experiments using Microsoft Excel (Table 7.6.2), and the comparison among the different experiments can be represented graphically. Programs for the graphical representation can vary from Excel to more sophisticated ones; an interesting graphical software is OpenDX (http://www.opendx.org), an open-source visualization software package (Fig. 7.6.13). The resulting set of p values for all pathways is finally used to rank the pathways according to the magnitude and direction of the effects. The Pathway Processor results from multiple experiments can be compared, reducing the analysis from the full set of individual genes to a limited number of pathways of interest. The probability that a given pathway is affected is necessary to weigh the relative contribution of the biological process at work to the phenotype studied. COMMENTARY Background Information Analyzing Expression Results with Pathway Processor DNA microarrays provide a powerful technology for genomics research. The multistep, data-intensive nature of this approach has created unprecedented challenges for the development of proper statistics and new bioinformatic tools. It is of the greatest importance to integrate information on the genomic scale with the biological information accumulated through years of research on the molecular genetics, biochemistry, and physiology of the organisms that researchers investigate. A genomic approach for the understanding of fundamental biological processes enables the simultaneous study of expression patterns of all genes for branch-point enzymes. Similarly, one can look for patterns of expression variation in particular classes of genes, such as those involved in metabolism, cytoskeleton, cell-division control, apoptosis, membrane transport, sexual reproduction, and so forth. Interpreting such a huge amount of data requires a deep knowledge of metabolism and cellular signaling pathways. The availability of properly annotated pathway databases is one of the requirements for analyzing microarray data in the context of biological pathways. Efforts to establish proper gene ontology (UNIT 7.2; Ashburner et al., 2000) and pathway databases are continuing, and several resources are available publicly and commercially. Efforts have also been made to integrate functional genomic information into databases, such as ArrayDB (Ermolaeva et al., 1998), SGD (Ball et al., 2000; Ball et al., 2001), YPD, Worm PD, PombePD and callPd (Costanzo et al., 2000), and KEGG (Nakao et al., 1999). The ability to display information on pathway maps is also extremely important. The Kyoto Ency- 7.6.16 Supplement 5 Current Protocols in Bioinformatics clopedia of Genes and Genomes (KEGG; Kanehisa et al., 2002), the Alliance for Cellular Signalling, BioCarta, EcoCyc (Karp et al., 2002a), MetaCyC (Karp et al., 2002b), PathDB, and MIPS all organize existing metabolic information in easily accessible pathway maps. Pathway databases will become more useful as a unique and detailed annotation for all the genes in the sequenced genomes becomes available. In this respect, the situation for yeast contrasts with that for human, mouse, and rat, for which the systematic and detailed annotation and description of open reading frame (ORF) function is still in progress. The visualization of expression data on cellular process charts is also important. Many authors have manually mapped transcriptional changes to metabolic charts, and others have developed automatic methods to assign genes showing expression variation to functional categories, focusing on single pathways. KEGG, MetaCyC, and EcoCyC display expression data from some experiments on their maps. Some commercial microarray analysis packages, such as Rosetta Resolver (Rosetta Biosoftware) have also integrated a feature enabling the display of expression of a given gene in the context of a metabolic map. MAPPFinder and GenMAPP (http://www. genmapp.org/; UNIT 7.5), are recently developed tools allowing the display of expression results on metabolic or cellular charts (Doniger et al., 2003). The program represents the pathways in a special file format called MAPPs, independent of the gene expression data, and enables the grouping of genes by any organizing principle. The novel feature is that it both allows visualization of networks according to the structure reported in the current pathway databases and also provides the ability to modify pathways ad hoc; it also makes it possible to design custom maps and exchange pathway-related data among investigators. The map contains hyperlinks to the expression information and to the information on every gene available on public databases. Still, the program is limited because the information provided on the map is not quantitative but only qualitative and based on color coding, with a repressed gene or an activated gene reported, respectively, as green or red. This does not automatically indicate the pathways of greatest interest. Microarray papers tend to discuss the fact that a given pathway is activated or repressed, based on the number of genes activated or repressed, or just on intuition, and too often the researcher finds only what she or he expected, or already knew. A recent paper from Castillo-Davis and Hartl (2003) describes a method to assess the significance of alteration in expression of diverse cellular pathways, taken from GO, MIPS, or other sources. The program is extremely useful but does not include a visualization tool enabling one to map the results on pathway charts, and does not address the problem arising from applying hypergeometric distributions to the analysis of highly interconnected pathways with a high level of redundancy as defined according to the GO terms. Ideally, methods analyzing expression data according to a pathway-based logic should give an indication of the statistical significance of the conclusions, provide a user-friendly interface, and be able to encompass the largest number of possible interconnections between genes, although the ability to separate larger pathways into smaller independent subpathways is important when developing methods assessing the analysis of the statistical significance of up- or down-regulation of a pathway. The Pathway Analyzer algorithm Pathway Analyzer implements a statistical method in Java, automatically identifying which metabolic pathways are most affected by differences in gene expression observed in a particular experiment. The method associates an ORF with a given biochemical step according to the information contained in 92 pathway files from KEGG (http://www.genome.ad.jp/ kegg/). Pathway Analyzer scores KEGG biochemical pathways, measuring the probability that the genes of a pathway are significantly altered in a given experiment. KEGG has been chosen for the concise and clear way in which the genes are interconnected, and for its curators’ great effort in keeping the information up to date. In deriving scores for pathways, Pathway Analyzer takes into account the following factors: (1) the number of ORFs whose expression is altered in each pathway; (2) the total number of ORFs contained in the pathway; and (3) the proportion of the ORFs in the genome contained in a given pathway. In the first step of the analysis, the user specifies the magnitude of the difference in ORF expression that should be regarded as above background. The relative change in gene expression is the multiplier by which the level of expression of a particular ORF is increased or decreased in an experiment. Hence, Pathway Analyzer allows the study of differences smaller than a given fold change, but which Analyzing Expression Patterns 7.6.17 Current Protocols in Bioinformatics Supplement 5 affect a statistically significant number of ORFs in a particular metabolic pathway. Consistent differential expression of a number of ORFs in the same pathway can have important biological implications; for example, it may indicate the existence of a set of coordinately regulated ORFs. The program uses the Fisher Exact Test to calculate the probability that a difference in ORF expression in each of the 92 pathways could be due to chance. A statistically significant probability means that a particular pathway contains more affected ORFs than would be expected by chance. The program allows the user to choose different values of the Fisher Exact Test. Fisher Exact test The analysis performed with the Fisher Exact Test provides a quick and user-friendly way of determining which pathways are the most affected. The one-sided Fisher Exact Test calculates a p value, based on the number of genes whose expression exceeds a user-specified cutoff in a given pathway. This p value is the probability that by chance the pathway would contain as many affected genes as or more affected genes than actually observed, the null hypothesis being that the relative changes in gene expression of the genes in the pathway are a random subset of those observed in the experiment as a whole. Literature Cited Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., and Sherlock, G. 2000. The Gene Ontology Consortium. Nat. Genet. 25:25-29. Ball, C.A., Dolinski, K., Dwight, S.S., Harris, M.A., Issel-Tarver, L., Kasarskis, A., Scafe, C.R., Sherlock, G., Binkley, G., Jin, H., Kaloper, M., Orr, S.D., Schroeder, M., Weng, S., Zhu, Y., Botstein, D., and Cherry, J.M. 2000. Integrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res. 28:77-80. Analyzing Expression Results with Pathway Processor Ball, C.A., Jin, H., Sherlock, G., Weng, S., Matese, J.C., Andrada, R. , Binkley, G., Dolinski, K., Dwight, S.S., Harris, M.A., IsselTarver, L., Schroeder, M., Botstein, D., and Cherry, J.M. 2001. Saccharomyces genome database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res. 29:80-81. Castillo-Davis, C.I. and Hartl, D.L. 2003. GeneMerge-post-genomic analysis, data mining, a n d h y p o t h e s i s t e s t i n g . Bioinformatics 19:891-892. Costanzo, M.C., Hogan, J.D., Cusick, M.E., Davis, B.P., Fancher, A.M., Hodges, P.E., Kondu, P., Lengieza, C., Lew-Smith, J.E., Lingner, C., Roberg-Perez, K.J., Tillberg, M., Brooks, J.E., and Garrels, J.I. 2000. The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): Comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res. 28:73-76. DeRisi, J.L., Iyer,V.R., and Brown, P.O. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680–686. Doniger, S.W., Salomonis, N., Dahlquist, K.D., Vranizan, K., Lawlor, S.C., and Conklin, B.R. 2003. MAPPFinder: Using gene ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol. 4:R7. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868. Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L., Chen, Y., Simon, R., Meltzer, P., Trent, J.M., and Boguski, M.S. 1998. Data management and analysis for gene expression arrays. Nat. Genet. 20:19-23. Grosu, P., Townsend, J.P., Hartl, D.L., and Cavalieri, D. 2002. Pathway processor: A tool for integrating whole-genome expression results into metabo lic networks. Genome Res. 12:1121-1126. Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42-46. Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S.M., PellegriniToole, A., Bonavides, C., and Gama-Castro, S. 2002a. The EcoCyC database. Nucleic Acids Res. 30:56-58. Karp, P.D., Riley, M., Paley, S.M., and PellegriniToole, A. 2002b. The MetaCyC database. Nucleic Acids Res. 30:59-61. Nakao, M, Bono, H., Kawashima, S., Kamiya, T., Sato, K., Goto, S., and Kanehisa, M. 1999. Genome-scale gene expression analysis and pathway reconstruction in KEGG. Genome Inform. Ser. Workshop Genome Inform. 10:94103. Internet Resources http://www.cgr.harvard.edu/cavalieri/pp.html Duccio Cavalieri CGR Web site. http://www.genome.ad.jp/kegg/ The Kyoto Encyclopedia of Genes and Genomes (KEGG) home page. 7.6.18 Supplement 5 Current Protocols in Bioinformatics http://www.proteome.com/databases/YPD/ YPDsearch-quick.html http://www.ncgr.org/genex/ The yeast proteome database (YPD) home page. Gene X, Gene expression home page at the National Center for Genome Resources. http://www.opendx.org http://cmgm.stanford.edu/pbrown/ http://www.opendx.org/index2.php The Pat Brown Laboratory Web site. The open-source visualization software, OpenDX along with manuals and tutorials. http://genome-www.stanford.edu/ Saccharomyces/ The Saccharomyces Genome database (SGD) home page. Contributed by Duccio Cavalieri and Paul Grosu Bauer Center for Genomics Research Harvard University Cambridge, Massachusetts Analyzing Expression Patterns 7.6.19 Current Protocols in Bioinformatics Supplement 5 An Overview of Spotfire for Gene-Expression Studies Spotfire DecisionSite (Spotfire, Inc.; http://www.spotfire.com) is a powerful data mining and visualization application with use in many disciplines. Modules are available for use in various areas of research, e.g., support of gene expression analysis, proteomics, general statistical analysis, chemical lead discovery analysis, and geology. Here the focus is on Spotfire’s utility in analyzing gene expression data obtained from DNA microarray experiments (Spotfire Decision Site for Functional Genomics). Since its advent in the middle of last decade (Schena et al., 1995; Kozal et al., 1996), DNA microarray technology has revolutionized the way gene expression is measured (Schena et al., 1998). The ability to quantitatively measure the levels of expression of thousands of genes in a single experiment is allowing investigators to make significant advances in the fields of stress response, transcriptional analysis, disease detection and treatment, gene therapy, and many others (Iyer et al., 1999; Lee et al., 2002; Yeoh et al., 2002; Cheok et al., 2003). Analysis of microarray data can be overwhelming and confusing, due to the enormous amount of data generated (Leung and Cavalieri, 2003). In addition, because of technical considerations or genetic or biological variability within the subjects, these measurements can be noisy and error prone (Smyth et al., 2003). To obtain statistically meaningful results, one needs to replicate the experiments several times, adding to the dimensionality and scope of the problem (Kerr and Churchill, 2001). With rapidly emerging microarray analysis methods, scientists are faced with the challenge of incorporating new information into their analyses quickly and easily. Mining data of this magnitude requires software-based solutions able to handle and manipulate such data. DecisionSite for Functional Genomics (henceforth referred to as Spotfire) is a solution for accessing, analyzing and visualizing data. This platform combines state-ofthe-art data access, gene expression analysis methods, guided workflows, dynamic visualizations, and extensive computational tools. It is a client-server based system that can be run UNIT 7.7 from a Unix-based server and can service either a single client using a single license or a large group of users with an institution-wide site license. The authors of this unit currently use version 7.2 of this application. Spotfire is designed to allow biologists with little or no programming or statistical skills to transform (UNIT 7.8), process, and analyze (UNIT 7.9) microarray data. Tools and Guides are the components of Spotfire platform. Tools are the analytical access components added to the interface, while Guides connect Tools together to initiate suggested analysis paths. Spotfire can extract data from a variety of databases and data sources such as Oracle, SQL server, Informix, Sybase, and networked or even local drives. This ability to retrieve data from multiple databases allows the application to function as a “virtual data warehouse” and offers tremendous potential in integrating data from various sources for data-mining purposes. Once the data have been extracted, Spotfire can be used to interactively query and filter it using multiple overlapping criteria without knowledge of a query language such as SQL. Results of data manipulation can be instantly visualized using a number of graphics options, including two- and three-dimensional scatter plots, bar charts, profiles, heat maps, Venn diagrams and others. Spotfire stores data internally in a proprietary file format allowing for quick response times, and has a series of builtin heuristics and algorithms to perform most of the basic tasks that a user with microarray data would like to perform. These include importing data from two-color microarray experiments (here the focus is on data produced using the popular GenePix scanners from Axon, Inc.; UNIT 7.8); importing data from Affymetrix one-color microarray experiments (UNIT 7.8); filtering and preprocessing to remove unreliable data (UNIT 7.8); log-transformation of array data (UNIT 7.8); scaling and normalization of array data (UNIT 7.8); identification of differentially expressed genes using statistical tests such as the t test/ANOVA algorithm; calculation of fold change as ratios of given signal values; and other calculations (UNIT 7.9). A variety of clustering algorithms are available in the package—e.g., K-means Clustering, Analyzing Expression Analysis Contributed by Deepak Kaushal and Clayton W. Naeve Current Protocols in Bioinformatics (2004) 7.7.1-7.7.21 C 2004 by John Wiley & Sons, Inc. Copyright 7.7.1 Supplement 6 Self-Organizing Maps (SOMs), Principal Components Analysis like this (PCA), and Hierarchical Clustering (UNIT 7.9). The ability to join expression data with gene annotations and Gene Ontology (UNIT 7.2) descriptions using the Web browser functionality within the DecisionSite navigator is also possible (UNIT 7.9). Finally, an application-programming interface is available for Spotfire that allows one to customize the application. Spotfire DecisionSite Developer offers the ability to add new microarray data sources, normalization methods, and algorithms within a guided workflow that can be rapidly deployed to all users. The platform can be updated and extended to incorporate both the expertise gained by the users as well as the tremendous advances occurring in the field of genomics on a daily basis. The authors have been able to integrate S-Plus algorithms into Spotfire to enhance its capabilities. This interface makes it possible for end-users to write code for their own Guides or Tools and incorporate them into the DecisionSite environment for everyday use. NECESSARY REQUIREMENTS FOR USING THE FUNCTIONAL GENOMICS MODULE OF SPOTFIRE Hardware The recommended minimal hardware requirements are modest. For Windows systems: The software will run on an Intel Pentium or equivalent with a 100 MHz processor, 64 Mb RAM, 20 Mb disk space; VGA or better display, and a video display of 800 × 6000 pixels. However, most microarray experiments yield large output files and most experimental designs require several data files to be analyzed simultaneously, so the user will benefit from both a much higher RAM and a significantly better processor speed. For Macintosh: Requirements include Macintosh PowerPC with 8 Mb of available memory, 3Mb of free disk space, and 256 color (or better) video display. A network interface card (NIC) is required for network connections to MetaFrame servers. MDAC (Microsoft Data Access Components) versions 2.1 sp2 (2.1.2.4202.3) to 2.5 (2.50.4403.12). A Web connection to the Spotfire server (http://home.spotfire.net) or a local customerspecific Spotfire Server. A Web connection is also required to take advantage of Web links for the purpose of querying databases and Web sites on the Internet using columns of data residing in Spotfire (UNIT 7.9). Microsoft PowerPoint, Word and Excel are required to take advantage of a number of features available within Spotfire related to export of text results or visualizations (UNIT 7.9). Spotfire (v. 6.2 or above) is required. An evaluation copy of Spotfire DecisionSite for Functional Genomics can be downloaded by contacting the regional account manager at Spotfire Inc. (http://www.spotfire.com). A single license can be purchased and is installed through the Web (client recieves a link where the software can be downloaded using a vendor-supplied key). A site license is required for multiple users within an institution to be able to use the software and these users share a local customer-specific Spotfire server. The site license is typically installed on site by the Spotfire support team. The vendor offers an academic pricing which covers a one-year subscription for use of the software along with Spotfire technical support. It is also possible to download a software program called System Checker from the same vendor that checks the system on which Spotfire will be installed to ensure compliance. Spotfire must be installed under a Windows user account with full Administrator privileges. Antivirus software must be disabled during installation, as well as while downloading server-delivered Functional Genomics applications. In order to properly operate, Spotfire requires Microsoft Data Access Components (MDAC) version 2.1 SP1 through 2.7 SP1. For Macintosh systems: MacOS 7.5.3 or later (including OS X) is required. In order to operate Spotfire on Apple’s Macintosh system, users need to install and configure Citrix ICA client (http://www.citrix.com) on the system. Open Transport TCP/IP Version 1.1.1 or later is required. Software An Overview of Spotfire for Gene-Expression Studies For Windows systems: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium or Windows 2000. A standard install of Microsoft Internet Explorer v. 5.0 to 6.0. Files Spotfire (Functional Genomics module) can import data in nearly any format, but the focus here is on two popular microarray platforms, the commercial GeneChip microarray data (Affymetrix, Inc.) and two-color spotted 7.7.2 Supplement 6 Current Protocols in Bioinformatics microarray data produced using GenePix software (Axon, Inc.). Spotfire facilitates the seamless import of Affymetrix output files (.met) from Affymetrix MAS v. 4.0 or v. 5.0 software. The .met file is a tab-delimited text file containing information about attributes such as probe set level, gene expression levels (signal), and detection quality controls (p value and Absence/Presence calls). In the illustration below, MAS 5.0 .met files are used as an example. Several types of spotted arrays and corresponding data types exist, including those from commercial vendors (Agilent, Motorola, and Mergen) that supply spotted microarrays for various organisms, and those from facilities that manufacture their own chips. Several different scanners and scanning software packages are available. One of the more commonly used scanners is the Axon GenePix (Axon, Inc.). GenePix data files are a tab-delimited text format (.gpr), which can be directly imported into a Spotfire session. OVERVIEW OF SPOTFIRE VISUALIZATION WINDOW Upon launch, the Spotfire DecisionSite application window appears as illustrated below (Fig. 7.7.1). One can choose which DecisionSite module to open by clicking on the orange Navigate button. Upon selection of the Functional Genomics module, a suite of applications geared to microarray data analysis is made available including database Figure 7.7.1 access, data preprocessing, statistical analysis, and domain-specific tasks. The Spotfire application window contains four functionally different areas (Fig. 7.7.1): DecisionSite Navigator on the left side, which is a Web browser; A visualization window in the center containing one or more visualizations of the data (the user needs to import data (UNIT 7.8) into a Spotfire session for visualizations to appear); Query devices on the top right, which allow manipulation of the data (UNIT 7.8); “Details-on-Demand” on the bottom right, which allows cross-examination of selected data (UNIT 7.8). DecisionSite NAVIGATOR DecisionSite Navigator is a Web browser that is fully integrated into the Spotfire DecisionSite environment. The navigator is used to connect to the Spotfire DecisionSite server, providing access to several powerful tools, guides and resources (Fig. 7.7.2). The DecisionSite Navigator is opened by default when the user launches Spotfire. The user can open and close the Navigator by clicking the “bulls eye” button on the far left of the toolbar. The Navigation toolbar is used to navigate the content of the DecisionSite Navigator, and has the same basic features of any popular Web browser such as Back, Forward, Stop, Refresh, and Home buttons. The Navigator toolbar can be hidden or revealed by selecting Navigation toolbar from the View menu. Various components of Spotfire DecisionSite. Analyzing Expression Analysis 7.7.3 Current Protocols in Bioinformatics Supplement 6 Figure 7.7.2 An Overview of Spotfire for Gene-Expression Studies Different components of the DecisionSite Navigator. The DecisionSite Navigator contains three different Panes, or windows, each with a different collection of hyperlinks: 1. The Guides pane, which contains stepby-step directions to perform selected tasks or workflows, most useful for beginners. Guides are a good way to learn the function of tools that are used repetitively and that are needed to achieve a specific goal. 2. The Tools pane, which contains direct links to tools and functions that directly affect manipulation of microarray data. These include tools for importing, pre-processing, normalizing, clustering, and statistically validating the data. For further discussion of these tools see UNIT 7.8 and UNIT 7.9. These tools are arranged in a hierarchical manner in the Navigator, depending on the type of tool. The header for each level of hierarchy is boldfaced. Clicking on any header expands the tree and reveals the tools underneath. Clicking again will collapse the tree. The following tools are available in Spotfire DecisionSite for Functional Genomics: Portfolio (allows users to save subsets of data for overlap and Venn-logic comparison); Access (allows users to import multiple files of Affymetrix and GenePix data with Tools such as Import Affymetrix data, Import GenePix data, Import GEML file, Celera discovery system, Add columns, and Weblinks); Analysis (contains tools for actual manipulation of array data such as Normalization, Row Summarization, Pivot and Transpose data, Hierarchical clustering, K-means clustering, Principal Components Analysis, Treatment comparison with ANOVA, profile search, and coincidence testing); Reporting (allows users to export data and visualizations to external software such as PowerPoint or Word). 3. Resources (contain links to various information sources such as the “Functional Genomics companion,” which maintains a collection of articles and Webcasts about Spotfire; the Resources section also contains a detailed users manual). VISUALIZATIONS Visualizations are a key to the analytical power of Spotfire (UNIT 7.9). Spotfire can display nine types of visualizations, each one providing a unique view of the data. Different visualizations are linked and updated automatically when query devices are manipulated. Visualizations allow high-dimension data to be readily displayed and enhanced by manipulating values that control visual attributes such as size, color, shape, rotation, and text labels. Some examples include: 2-D scatter plots 7.7.4 Supplement 6 Current Protocols in Bioinformatics Figure 7.7.3 The Properties Dialog Box. 3-D scatter plots Histograms Bar charts Profile charts Line charts Pie charts Heat maps Tables. By setting properties such as color, shape, and size, each visualization can be tailored to personal taste and the specific task at hand. New visualizations are created by clicking one of the toolbar buttons, or by selecting a visualization from the Visualization menu. Visualization properties are controlled from the Properties dialog (Fig. 7.7.3). Click the toolbar button, or select Properties from the Edit menu to go to this dialog. Initial Visualization: 2-D Scatter Plot Data are imported into Spotfire as a table with rows and columns (Short/Wide format). Each record (Spot or Probe set in microarray data) is assigned a row in the table. Various measurements made for every record are assigned columns in the table. When working with a single data set, it is possible to look at the expression behavior of a particular gene. It is also possible to analyze several experiments with same number of records simultaneously. When a data set is loaded in Spotfire, the program produces an initial default visualization in the form of a two-dimensional scatter plot. The columns to be displayed as x and y axes are initially suggested by the program and may not be particularly helpful. Any data column that has been imported into the Spotfire session can, however, be used to display a scatter plot. Users can specify columns to be used in the scatter plot by selecting the appropriate columns from the X- and Y-axis column selectors (Fig. 7.7.4). New two-dimensional scatter plots can be created by clicking the “2-D” button on the toolbar, by clicking Ctrl-1, or by selecting New 2D Scatter Plot from the Visualization menu (Fig. 7.7.4). The user can zoom in and out of a scatter plot in many ways, e.g., using the zoom bars or mouse. Dragging the end arrows of the zoom bars (along the edges of the visualization window) zooms in on a portion of the visualization. Dragging the bar itself (by placing the mouse pointer on the yellow bar and dragging) pans across different areas of the entire visualization. The pale yellow area represents the selected range of values, whereas the bright yellow area represents the range of existing values within the selected range. The zoom bar can be adjusted to encompass only the currently selected data. The user can set the Data Range to Selected records by: Moving the drag box of the zoom bar to narrow the selection; Right-clicking on the zoom bar; or Choosing Select Data Range from Zooming from the pop-up menu. The zoom bar expands to its full width, but with Data Range set to encompass only the selected records. Three dots are displayed to Analyzing Expression Analysis 7.7.5 Current Protocols in Bioinformatics Supplement 6 Figure 7.7.4 Various features of the Scatter plot visualization in Spotfire. indicate that the range is not the original full range. To reset the Data Range, right-click the zoom bar and select Reset Data Range. Coloring plots An Overview of Spotfire for Gene-Expression Studies Markers can be colored to reflect the value of a particular attribute. There are three modes for coloring: Fixed, Continuous and Categorical (Fig. 7.7.3). Fixed coloring means that all markers are the same color (except deselected, empty and marked). Categorical coloring means that each value in the chosen column is given its own color. Categorical coloring makes most sense if there are less than ten unique values. To control which color is assigned to each value, click Customize. Continuous coloring means that the maximum and minimum values in the selected column are each assigned a color. Intermediate values are then assigned colors on a scale ranging between the two extreme colors. In scatter plots, any column can be used for continuous coloring. Colors representing minimum and maximum values are set with the Customize dialog. Begin and End categories define the color limits. When one of the categories is selected, one can choose which color will represent that end of the value range. A line with the color scale is displayed below the corresponding query device. By default, deselected records (i.e., records that have been filtered out using the query de- vices) are invisible. It is possible to keep them visible, but colored differently. Check the box labeled “Show deselected.” Set the color by pressing Customize (Fig. 7.7.5). Regardless of coloring mode, the choice of colors can be controlled by clicking Customize on the Markers tab of the Properties dialog. Depending on the current coloring mode, the top-most list will display the fixed color, Begin and End colors (continuous mode), or the color associated to each category (Categorical mode). The other list displays colors associated with deselected, empty, and marked records (“Empty” refers to records for which no value is specified in the column used for coloring). To change a color, click the category to modify, then click a color in the palette. To revert to default coloring, click Default Colors. To select a color from the complete palette, click “Other. . .”. Marker size and other properties The size of markers can be made to reflect the value of a particular column. Select a column from the drop-down list under Size. Moving the slider changes the size of all markers, while maintaining the size ratio of different markers. It is possible to customize the size of different markers. The shape of markers can be fixed, or made to reflect the value of a particular column. Click Fixed or By to alternate between these modes. Only columns with less than 16 distinct values can be used for controlling shapes. Click Customize to chose appropriate shapes for each value (Fig. 7.7.6). 7.7.6 Supplement 6 Current Protocols in Bioinformatics Figure 7.7.5 The Customize Colors window for scatter plot (and other visualizations). Figure 7.7.6 The Customize Shapes window for scatter plot. It is possible to customize the size of the markers in a scatter plot. This overrides the usual size slider in the properties Marker tab. To customize the marker size, select a value, then select a shape for that value. Next, check the “Specify size” check box. Enter Width and Height. These values are relative to the scale used in the current visualization. Look at the scale used in the current visualization and determine how large the markers are to be. It is also possible to specify the order in which the markers of a scatter plot are drawn, to use rotation of markers to reflect the value of a column, and to tag each marker with a label showing the value of a particular column, using one of the following options: “None”: no labels are visible; “Marked records”: only records that are marked will have labels next to them (maximum 1000 records); Analyzing Expression Analysis 7.7.7 Current Protocols in Bioinformatics Supplement 6 Figure 7.7.7 The 3D Scatter Plot visualization in Spotfire. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm. “All records, max”: all records (up to a configurable maximum number) will have labels next to them. 3-D Scatter Plots 3-D scatter plots allow even more information to be encoded into visualizations. They are especially useful when analyzing data that is not clustered along any of the axes (columns) of the data set. A new 3-D scatter plot is created in one of the following ways: click the “3D” button on the toolbar, click Ctrl-2, or Select New 3D Scatter Plot from the Visualization menu (Fig. 7.7.7). While navigating 3-D visualizations, the zoom bars are used as in 2-D. Additionally, holding down Ctrl and dragging the right mouse button allows users to rotate the graph, while holding down Shift and dragging using the right mouse button allows zooming. 3-D scatter plots are ideal visualizations for viewing results of Principal Components Analysis (UNIT 7.9). Histograms and Bar Charts An Overview of Spotfire for Gene-Expression Studies Histograms and bar charts can effectively analyze very large data sets. New Histograms are made in one of the following ways: click the Histogram button on the toolbar, press Ctrl-3, or select New Histogram from the Visualization menu. Bar charts are created in one of the following ways: click the Bar Chart button on the toolbar, press Ctrl-4, or select New Bar Chart from the Visualization menu. In traditional bar charts, the height of the bar is the sum of the values of the records in a certain column. In histogram-type visualizations, heights of bars are proportional to the number of records specified by the “X axis” column. The attributes of both histograms and bar charts can be altered from the Bars tab in the Properties dialog. Profile Charts A profile chart maps each record as a line, or profile. Each attribute of a record is represented by a point on the line. This makes profile charts similar in appearance to line charts, but the way data are translated into a plot is substantially different. Profile charts are an ideal visualization for t test/ANOVA calculations (UNIT 7.9), as they provide a good (if somewhat simplified) overview of characteristics. To create a profile chart, press Ctrl-7, click the New Profile Chart button on the toolbar, or select New Profile Chart (Fig. 7.7.8) from the Visualization menu. Next, go to the axis selector of the x axis and uncheck solitary columns that are not to be included in the chart, such as identifier columns, or go to the Profile Columns tab of the Properties dialog to change multiple 7.7.8 Supplement 6 Current Protocols in Bioinformatics Figure 7.7.8 The Profile Chart visualization in Spotfire. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm. columns. The Properties dialog can be used to adjust the various properties of the chart. Heat Map Plots Heat Map plots are also known as Intensity plots or Matrix plots. A Heat Map can be likened to a spreadsheet, where the values in the cells are represented by colors instead of numbers. More specifically, a Heat Map is a type of plot in which the pivoted (short/wide) data are presented as a matrix of rows and columns, where the cells are of equal size and the information represented by the color of the cells is the most important property. Heat Maps can be used to identify clusters of records with similar values, as these are displayed as “areas” of similar color. New Heat Maps are created in one of the following ways: click the Heat Map button on the toolbar, press Ctrl-8, or select New Heat Map from the Visualization menu (Fig. 7.7.9). Heat Maps are controlled from two tabs in the Properties dialog: the Heat Map Columns tab, where one selects which columns are to be included in the visualization, and the Colors tab, where one can customize the coloring of the Heat Map. The Heat Map Columns tab of the Properties dialog is used to organize the columns in the heat map. Fig. 7.7.10 shows the Properties dialog with the Heat Map Columns tab selected. The list on the right-hand side shows the columns that are included in the visualization, while the one on the left shows those that are not. Use the Add and Remove buttons to move columns between the two lists, or click Remove All to remove all columns from the “Columns in heat map” list. The list of available columns can be sorted by checking the box labeled “List columns alphabetically,” or by clicking the Column field in the column heading. The Colors tab of the Properties dialog is used to modify the color range of the heat map. The default color range is set to green for minimum values, black for intermediate values, and red for maximum values. To apply a specific color range to one or more columns, select the appropriate column(s) from the list, then choose a range from the “Color range” drop-down list box, and finally click the Apply to column(s) button. Use Shift or Ctrl to select several columns at a time. To change the color range of one or more columns, it is necessary to create a new range. Click on the New button to open the Create New Color Range dialog. Enter a new range Analyzing Expression Analysis 7.7.9 Current Protocols in Bioinformatics Supplement 6 Figure 7.7.9 The Heat Map visualization in Spotfire. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm. An Overview of Spotfire for Gene-Expression Studies Figure 7.7.10 Heat Map Properties dialog box. 7.7.10 Supplement 6 Current Protocols in Bioinformatics Figure 7.7.11 The Edit Color Range dialog box allows users to choose the colors for their heat map visualization. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm. name in the text field at the top, then select Categorical Coloring or Continuous Coloring (Fig. 7.7.11). Depending on which radio button is selected, the lower part of the window will change to show the relevant suboptions. Categorical Coloring means that each unique value in the heat map is represented by its own color. This is most useful when dealing with a smaller number of varying values or when looking for identical values in a heat map. If one has a certain value and one wishes to change its color, select that value from the list, and then choose a new color for it from the palette. Continuous Coloring means that the color range is linear from one specific color to another color, via a third middle color. By default, this is set to show low values in shades of green, intermediate values going toward black, and high values in shades of red. Select new colors to represent the Min, Mid, or Max values by clicking on their corresponding color button and picking a new color from the palette that appears. Continuous Coloring is further divided into three sub-options: “Shared custom range” means that it is possible specify an exact Min, Mid, and Max value for the color range, instead of these values being automatically determined. “Shared” means that all selected columns will be colored according to these values regardless of their own individ- ual Min and Max values; “Shared auto range” means that the Min, Mid, and Max value for the range is automatically set to the lowest, middle, and highest value that exists in all the selected columns. It also means that all selected columns will be colored according to these values regardless of their own individual Min and Max values. “Individual auto range” means that the Min, Mid, and Max values for the range are automatically set to the lowest, middle, and highest values, respectively, that exist in each individual column. This means that all selected columns will be colored according to their own individual Min and Max values. Making a record active or marking several records in a heat map plot differs somewhat from the method used with other plots. In a heat map, one row always equals one record. Consequently, one always selects or marks one or more entire rows, equaling one or many records. When one clicks on a row, a black triangle appears at both ends of the selected row to indicate that it is active. Information about the row is displayed in the Details-onDemand window. By clicking and holding the mouse button while the mouse pointer is on a row and dragging it to cover several rows, these rows all become marked. This is indicated by a small bar shown at the left and Analyzing Expression Analysis 7.7.11 Current Protocols in Bioinformatics Supplement 6 right of the rows in question. Details on these records are shown in the Details-on-Demand window. Tables The Table visualization presents the data as a table of rows and columns. The Table can handle the same number of rows and columns as any other visualization in DecisionSite. In the Table, a row represents a record. By clicking on a row, that record is made active, and by holding down the mouse button and dragging the pointer over several rows, it is possible to can mark them. One can sort the rows in the table according to different columns by clicking on the column headers, or filter out unwanted records by using the query devices. To create a Table, press Ctrl-9, click the New Table button on the toolbar, or select New Table (Fig. 7.7.12) from the Visualization menu. One can then click on the header of the column by which the rows are to be sorted, or rearrange the order of the columns by dragging and dropping the column headers horizontally. Use the Properties dialog to further adjust the various properties of the chart. An Overview of Spotfire for Gene-Expression Studies Annotating Visualizations And Changing Visualization Columns It is possible to give any visualization a title and an annotation. The title will appear as the caption of the window. It can also appear in the heading of printouts. The annotation will appear as a tool tip when the mouse pointer is placed over the paper clip at the bottomleft corner of the visualization. To set title and annotation: 1. Go to the Annotations tab of the Properties dialog (Fig. 7.7.13); 2. Enter a title and/or an annotation; 3. Check “Append axes name to visualization title” if the current axes are to be appended to the title. One can type in a great deal of text in the Annotation field, as well as cut and paste to and from other Windows applications. Which and how many columns should be included in a visualization can be controlled by the Properties dialog in most visualizations (Fig. 7.7.14). In scatter plot, the Columns tab houses this information. One or more columns can be selected by clicking (or Ctrl-clicking); the selected columns can then be deleted, Figure 7.7.12 The Table visualization allows users to view data in a sortable spreadsheet format. Like other visualizations, Table is also dynamically linked to the query devices and to other visualizations. 7.7.12 Supplement 6 Current Protocols in Bioinformatics Figure 7.7.13 Annotations can be appended to most visualizations (example shown here with the scatter plot) through the Properties dialog box. Figure 7.7.14 The number and type of columns in a scatter plot can be controlled via the Columns tab in the Properties dialog. Analyzing Expression Analysis 7.7.13 Current Protocols in Bioinformatics Supplement 6 moved up or down, or renamed, or their scale can be reset. Handling Multiple Visualizations Users will often need to work simultaneously with multiple visualizations in Spotfire. Bar charts and histograms are powerful tools for analyzing aggregate data, while scatter plots can reveal trends and correlation. Specific Tools like hierarchical clustering (UNIT 7.9) will generate specific visualizations such as heat maps. Spotfire DecisionSite is able to show multiple visualizations, each one as a window presenting the same data, but in different ways. The visualizations may have dissimilar coloring or axes—or even be of different types—one a 3-D scatter plot, another a bar chart. Each visualization can fill the entire window, all can be seen simultaneously, or each can reside on its own tab of a workbook. When operating the query devices, all visualizations are simultaneously updated, showing alterations in all visualizations when a factor is changed. When a marker is highlighted, it is highlighted in all visualizations simultaneously. New visualizations are created by selecting the appropriate icon or a command from the An Overview of Spotfire for Gene-Expression Studies Visualization menu, or by using the keyboard shortcuts shown in the same menu. There are several ways to reposition windows; the commands governing these functions all reside in the Window menu: Auto Hide Axis Selectors: When the visualization is small enough, this option automatically hides the zoom bars and the axis selectors. Hide Window Frame: Hides the title bar giving more space to the visualizations; this option is only available when several visualizations have been tiled. Auto Tile: Arranges all the windows on screen according to an internal algorithm. The active visualization will be given leftmost, uppermost, and size priority (Fig. 7.7.15). Cascade: Arranges the visualization windows so that they partially overlap each other, leaving each window accessible by clicking on the title bar. Tile Horizontal: Splits the window area horizontally according to the number of visualizations, giving each visualization equal area. Tile Vertical: Splits the window area vertically according to the number of visualizations, giving each visualization equal area. Figure 7.7.15 The auto-tile feature allows all the visualizations present in a particular Spotfire session to be viewed at once. 7.7.14 Supplement 6 Current Protocols in Bioinformatics QUERY DEVICES Spotfire DecisionSite automatically creates query devices when a data set is loaded. One device is created for each column of data. This section describes query devices within Spotfire, and how they can be used. A Spotfire DecisionSite query device is a visual tool for performing dynamic queries against an underlying data set. Query Devices are used to filter microarray data without the need to know Structured Query Language (SQL). Query devices include sliders, check boxes, or other graphic controls used to filter the data shown in the visualization, as described in the following paragraphs. A query device is always associated with a specific data column (Fig. 7.7.16). Range Slider A range slider is used to select records with values in a certain range. The left and right drag box can be used to change the lower and upper limit of the range—meaning that only records with values within the chosen range are selected and are therefore visible in the visualization. Labels above the slider indicate the selected span. The range can also be adjusted with the arrow keys when the query device is active: left and right arrows move the lower limit (left drag box), and up and down Figure 7.7.16 arrow keys move the upper limit. A minimum or maximum value can be typed into a range slider. The user can double-click on the minimum or maximum number above the drag box and then enter the desired value in the edit field. Alternatively one can click on the left or right drag box. No edit field will appear, but by simply typing the desired value, the slider will adjust. The currently selected interval of the range slider can be grabbed and moved to pan the selected range; this provides a powerful way of sweeping over different “slices” of a data set. Click and drag the yellow portion of the range slider to do this. Observing the reactions of the other sliders to such a sweep can give some interesting clues to correlation between parameters in the data set. If other query devices impose further restrictions, then the result may be that parts of the interval of the range slider are unpopulated. This area is indicated with a pale yellow color, as opposed to the bright yellow color that indicates the populated interval. An important feature of the range slider is that the values are distributed on a linear scale according to the values of the data. Therefore, if values are unevenly distributed, this will be reflected in the range slider. This is not the case with item sliders, where values are evenly distributed along the range of the slider, regardless Various types of query devices are assigned to different data columns. Analyzing Expression Analysis 7.7.15 Current Protocols in Bioinformatics Supplement 6 of what values appear in the column. The range slider can be set to span the current selection of data by double clicking at the center of the range slider. Item Slider An item slider is used to select individual values in a column. In an item slider query device, data items are evenly distributed on a continuous linear scale. However, the item slider selects only a single item at a time. The selected value is displayed as a label above the slider. As a special case, all items are selected when the slider handle is at the extreme left of the scale. A specific value can be typed into an item slider. The user can achieve this by either double-clicking on the number above the slider and then typing the desired value, or by clicking on the drag box itself and then typing the desired value. Note that no edit field will appear in which to type the value. Simply type the value after clicking, and the item slider will adjust itself to the value nearest possible to the value typed. The scope of an item slider is dependent on the settings of other query devices. This means that the item slider range constantly changes as one manipulates the query devices. When the input focus is set on the slider (marked by a dotted line), the arrow keys on the keyboard can be used to adjust the slider to the exact position of the entry. Up and right arrows move to the next record, down and left to the previous one. When the item slider drag box is moved to its leftmost position, all values for the slider are selected, as indicated by the label (All) above the slider. Check Boxes An Overview of Spotfire for Gene-Expression Studies Check boxes, one for each value appearing in the corresponding column in the data set, are used to select or deselect the values for appearance in the visualization. Check boxes are typically used when the record field holds just a few distinct values. In a check box query device, each unique value is represented by a check box, which is either checked (selected in the local context) or unchecked (deselected). If all records with a certain value are deselected by some other query device the label of that value becomes red. Coloring is set to categorical; ticking a check box causes all records of that particular color to show (unless they are deselected by another query device). By default, Spotfire DecisionSite assigns check boxes to any column containing ten values or less. Initially, all boxes are checked, which makes all records in the data set visible. For quick checking or unchecking of all the values, right click on the check boxes query device and select All or None from the pop-up menu. Like radio buttons, check boxes provide options that are either On or Off. Check boxes differ from radio buttons in that check boxes are typically used for independent or nonexclusive choices. Radio Buttons Radio buttons are similar to check boxes, but enable only one choice among the alternatives. In a radio button query device, a radio button represents each unique value. Radio buttons, also referred to as option buttons, represent a single choice within a limited set of mutually exclusive choices. That is, in any group of option buttons, only one option in the group can be set. However an All option is always present among the radio buttons, which makes it possible to select all the records in that column. If all records with a certain value are deselected (by this or some other query device) the label of that value becomes red. Full Text Search Full text search query devices permit the search for a specific string of alphanumeric characters with the use of Boolean operators. The full-text search query device allows users to search for (sub)strings within columns. It also allows one to search for a pattern by using Regular Expressions. For example, one can enter a pattern that means “a letter followed by two digits.” Alternatively, users can search strings that do not contain regular expressions in normal-text search. The search can be made as complex as desired by use of the logical operators AND (&) and OR (blank space). Search expressions are evaluated from left to right. Once the search string has been entered, holding down the Enter key on the keyboard executes the search. All records matching the search criteria will be shown in the visualization window. The full-text search query device also supports Cut/Copy/Paste of text strings using the Ctrl-X, Ctrl-C, and Ctrl-V keystroke combinations. Spotfire’s default choice of query devices is based on the column content and the number of unique values present in the data set for that attribute. If a column contains ten unique values or less, check boxes are assigned as the query device. For columns containing more than ten values, an item slider is chosen for alphanumeric (string) attributes, such as names and descriptions. Range sliders are assigned to 7.7.16 Supplement 6 Current Protocols in Bioinformatics numeric columns like date, time, and decimal or integer values. Users can change the type of query device to use for the column, with one restriction: check boxes and radio buttons can only be used for columns having less than 500 unique values. The currently selected query device is marked with a bullet. To change the type of query device, right-click the query device to make the pop-up menu appear. Select the appropriate query device option from the pop-up menu or Select the Columns tab of the Properties dialog. This tab contains a list of all Figure 7.7.17 (A) Records can be marked by left-clicking the mouse and dragging the cursor around the desired region. (B) Marking records in an irregular shape (by lasso) can be achieved by pressing Shift while left-clicking the mouse and dragging the cursor around the desired region. Analyzing Expression Analysis 7.7.17 Current Protocols in Bioinformatics Supplement 6 the columns in the data set. Mark a column and select the type of query device to use for that column. From the Columns tab of the Properties dialog, one can also make new columns from expressions or by binning, as well as delete columns from the Spotfire DecisionSite internal database. The order of the query devices can be sorted in four ways: by original order, by annotation, by name, or by type. For example, users can group all range sliders together, or sort the query devices in alphabetical order. To sort the query devices, right-click on a query device, select Sort from the pop-up menu, and select Original, by Annotation, by Name, or by Type. Alternatively, users may want to regroup query devices and rearrange their order to avoid having to scroll up and down to keep track of the changes. The initial order of the Query Devices depends on the structure of the dataset loaded into Spotfire DecisionSite or the SQL query (UNIT 9.2) that was used to acquire data. This can be changed as needed by rearranging columns in the originating Figure 7.7.18 Details-on-Demand window shows a snapshot of the marked data. Data shown in this window can be exported to Excel or as text/HTML data. An Overview of Spotfire for Gene-Expression Studies Figure 7.7.19 record. Details-on-Demand window can also be used to exhibit data for a single highlighted 7.7.18 Supplement 6 Current Protocols in Bioinformatics Figure 7.7.20 (A) Details-on-Demand (HTML) format. (B) Selecting the external Web browser option from the View tab allows export of the HTML data to an external browser window (C). Analyzing Expression Analysis 7.7.19 Current Protocols in Bioinformatics Supplement 6 spreadsheet program or by writing the SQL query in a certain order. DETAILS-ON-DEMAND The Details-on-Demand window allows the user to display all data linked to a particular record or a set of records in text or HTML format. Records of interest can be marked using the left mouse button to create a rectangular box (Fig. 7.7.17A) around the records or by using a lasso (Fig. 7.7.17B), i.e., by surrounding the records with a line drawn in an arbitrary shape by pressing the Shift key and using the left mouse button to drag an arbitrary line. For all marked records, the Details-onDemand window shows linked data in text format (Fig. 7.7.18). Instead of marking a set of records, the Details-on-Demand window can also be used to display data linked to an individual record by simply clicking on that record to highlight it. This creates a circle around that record and the Details-on-Demand window displays data corresponding only to that record (Fig. 7.7.19). The Details-on-Demand feature can export marked data to Excel; data can also be exported as an HTML file (Fig. 7.7.20A) within Spotfire or in an external browser window (Fig. 7.7.20B and C). STRENGTHS AND WEAKNESSES OF SPOTFIRE AS A DESKTOP MICROARRAY ANALYSIS SOFTWARE An Overview of Spotfire for Gene-Expression Studies Spotfire is a powerful tool for data mining and visualization. Its strengths include the ability to import data from a number of databases for visualization in a single session. This feature makes it a “virtual data warehouse” and allows evaluation of, for example, gene expression data and corresponding proteomics data in a single session. Its ability to manipulate visualizations is superb and it has a substantial collection of most frequently used statistical analysis algorithms. However, Spotfire does have limitations. It is currently unable to perform LOWESS (intensity dependent) normalization, which has particular relevance to the two-color microarray system. It is unable to perform a False-Discovery Rate calculation after obtaining an ANOVA-based list of significantly differential genes, and it lacks powerful statistical tools such as multiple testing corrections and mixed-model ANOVAs. It is unable to directly upload Affymetrix .cel files to obtain and analyze data at the probe-set level, and it lacks publication-quality graphics, among other deficits. Fortunately, Spotfire is also easily modified to incorporate new algorithms via its Application Programming Interface or through custom development by Spotfire staff. For example: 1. Spotfire has recently offered a custompackage upgrade to integrate powerful geneexpression data analysis tools from the statistical language R (http://www.bioconductor.org), allowing users to access and deploy these R scripts to perform, e.g., LOWESS normalization or mixed-model ANOVA, false-discovery rate, or Bonferroni correction. 2. Spotfire now offers a tool that interacts with the pathway-generating tool from Jubilant Biosciences, thereby allowing users to scan their expression data for pathway-based relationships. 3. Spotfire has entered into a partnership with Rosetta Impharmatics that allows licensed users to benefit from the errormodeling algorithms of Rosetta Resolver. 4. Spotfire is now offering a custompackage upgrade to analyze several types of proteomics data in a manner similar to geneexpression data. Spotfire thus offers much of the functionality one would typically require in a desktop gene expression analysis application, along with significant flexibility in adapting the application to one’s own environment and needs. Literature Cited Cheok, M.H., Yang, W., Pui, C.H., Downing, J.R., Cheng, C., Naeve, C.W., Relling, M.V., and Evans, W.E. 2003. Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet. 34:85–90. Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M., Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O. 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283:83– 87. Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183–201. Kozal, M.J., Shah, N., Shen, N., Yang, R., Fucini, R., Merigan, T.C., Richman, D.D., Morris, D., Hubbell, E., Chee, M., and Gingeras, T.R. 1996. Extensive polymorphisms observed in HIV1 clade B protease gene using high-density oligonucleotide arrays. Nat. Med. 2:753–759. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K., and 7.7.20 Supplement 6 Current Protocols in Bioinformatics Young, R.A. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804. Leung, Y.F. and Cavalieri, D. 2003. Fundamentals of cDNA microarray data analysis. Trends Genet. 19:649–659. Schena, M., Shalon, D, Davis, R.W., and Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470. Schena, M., Heller, R.A., Theriault, T.P., Konrad, K., Lachenmeier, E., and Davis, R.W. 1998. Microarrays: Biotechnology’s discovery platform for functional genomics. Trends Biotechnol. 16:301–306. Smyth, G.K., Yang, Y.H., and Speed, T. 2003. Statistical issues in cDNA microarray data analysis. Methods Mol. Biol. 224:111–136. Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C.H., Evans, W.E., Naeve, C., Wong, L., and Downing, J.R. 2002. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1:133–143. Contributed by Deepak Kaushal and Clayton W. Naeve St. Jude Children’s Research Hospital Memphis, Tennessee Analyzing Expression Analysis 7.7.21 Current Protocols in Bioinformatics Supplement 6 Loading and Preparing Data for Analysis in Spotfire UNIT 7.8 Microarray data exist in a variety of formats, which often depend on the particular array technology and detection instruments used. These data can easily be loaded into Spotfire DecisionSite (Spotfire DecisionSite, UNIT 7.7) by a number of methods including copying/ pasting from a spreadsheet, direct loading of text or comma separated (.csv) files, or direct loading of Microsoft Excel files. Data can also be loaded via preconfigured or ad hoc queries of relational databases and from proprietary databases and export file formats from microarray manufacturers such as Affymetrix (see Alternate Protocol 1) and Agilent, or scanner manufacturers such as GenePix (see Basic Protocol 1). Once the data are loaded, it is necessary to filter and preprocess the data prior to analysis (see Support Protocol 1). Subsequently, data transformation and normalization are critical to correctly perform powerful microarray data mining expeditions. These steps extract or enhance meaningful data characteristics and prepare the data for the application of certain analysis methods such as statistical tests to compute significance and clustering methods (UNIT 7.9)—which mostly require data to be normally distributed. A typical example of transformation methods is calculating the logarithm of raw signal values (see Support Protocol 2). Normalization is a type of transformation that accounts for systemic biases that abound in microarray data. One may then wish to normalize the data within an experiment (see Basic Protocol 2) or between multiple experiments (see Basic Protocol 3). During these processes it may be useful to combine data from multiple rows (see Basic Protocol 4). NOTE: UNIT 7.7 provides a general introduction to the Spotfire program and environment. This unit strictly focuses on data preparation within Spotfire. Readers unfamiliar with Spotfire are encouraged to read UNIT 7.7. UPLOADING GenePix DATA INTO SPOTFIRE Spotfire allows the user to upload multiple spotted microarray data files in GenePix format (.gpr files) using a script that can retrieve the files from a database or from a network drive. While the original script was set up to retrieve version 3.0 .gpr files, modifications can be made to it to allow it to recognize and import data from newer versions of GenePix data files such as 4.0, 4.1, or 5.0. The script reads a .gpr file and ignores the header part based on the information provided in the .gpr file header about the number of rows and columns in the data file. It then allows the user to pick and choose the relevant columns of data from a .gpr file to upload to Spotfire. BASIC PROTOCOL 1 Necessary Resources Hardware The recommended minimal hardware requirements are modest. The software will run on an Intel Pentium or equivalent with 100 MHz processor, 64 Mb RAM, 20 Mb disk space; a VGA or better display, and 800 × 6000 pixels resolution are needed. However, most microarray experiments yield large output files and most experimental designs require several data files to be analyzed simultaneously, so the user will benefit from both a much higher RAM and a significantly better processor speed. Analyzing Expression Analysis Contributed by Deepak Kaushal and Clayton W. Naeve Current Protocols in Bioinformatics (2004) 7.8.1-7.8.25 C 2004 by John Wiley & Sons, Inc. Copyright 7.8.1 Supplement 6 Software Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 A standard install of Microsoft Internet Explorer; v. 5.0 through 6.0 may be used MDAC (Microsoft Data Access Components); versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) may be used A Web connection to the Spotfire server (http://home.spotfire.net; UNIT 7.7) or a local customer specific Spotfire Server. A Web connection is also required to take advantage of Web Links for the purpose of querying databases and Web sites on the Internet using columns of data residing in Spotfire Microsoft PowerPoint, Word, and Excel are required to take advantage of a number of features available within Spotfire related to export of text results or visualizations (UNIT 7.9) Spotfire (6.2 or above) is required (see UNIT 7.7) Files Spotfire (Functional Genomics module) can import data in nearly any format, but the authors focus here on the two-color spotted microarray data produced using GenePix software (Axon, Inc.). Several types of spotted arrays, scanners, scanning software packages, and their corresponding data types exist, including those from commercial vendors (Agilent, Motorola, and Mergen) that supply spotted microarrays for various organisms, as well as those from facilities that manufacture their own chips. GenePix data files are a tab-delimited text format (.gpr), which can be directly imported into a Spotfire session. 1. Run Spotfire (UNIT 7.7) and ensure that access is available to the .gpr files from either a network drive or a database. Depending on the type of setup, it may be necessary to log in to the Spotfire application as well as the data source. Systems and database administrators may be able to provide more information. In this example, a GenePix version 3.0 data file is used. 2. In the Tools pane on the left-hand side of the screen, click on Access, then on Import GenePix files (Fig. 7.8.1). The Import GenePix Files dialog appears (Fig. 7.8.2A). Loading and Preparing Data for Analysis in Spotfire Figure 7.8.1 Tools pane with the Import GenePix Files tab highlighted. 7.8.2 Supplement 6 Current Protocols in Bioinformatics Figure 7.8.2 (A) The Import Genepix Files dialog allows users to specify files to be uploaded into a Spotfire session. (B) The Data Import Options allow users to chose all or any columns from the data set. Analyzing Expression Analysis 7.8.3 Current Protocols in Bioinformatics Supplement 6 3. Click Add. Point to the directory where the files to be analyzed are located, and double-click on the desired file. It is possible to load either a single file or multiple files with the help of the Shift key. The user may upload as many as seven files at one time. Uploading more than seven files will require repeating the process. The filename will appear in the center of the dialog box. 4. Specify the file(s) and click on the Columns button (Fig. 7.8.2A) to specify the data columns (Fig. 7.8.2B) to upload. One can choose to upload the entire file (requiring longer upload times). The 43 columns listed in Figure 7.8.2B are generated by the GenePix software and are related to the position (Block, Column, Row, X, and Y), identification (Name, ID), and morphology (Diameter) of the spot and its intensity in either the Cy5 or Cy3 channel (all other columns). B represents Background and F represents Fluorescence. 635 and 532 represent the two wavelengths used during scanning (532 for Cy3 and 635 for Cy5). Suggested columns to upload include F635 Median, B635 Median, F532 Median, B532 Median, Ratio of Medians, F635 Median-B635, F532 Median-B532, Flags, Norm Ratio of Medians, and Norm Flags. 5. Check all columns to import, then click OK. The Import GenePix files window will appear again. Click OK again. Data will begin loading into Spotfire. This could take several minutes depending on the size and number of the data columns being uploaded and RAM/processor speeds. At the end of the data-upload process, Spotfire will automatically display an initial visualization where each record is represented by a marker, along with a number of query devices for manipulating the visualization. Alternative visualizations (UNIT 7.7) can be opened by clicking on appropriate visualization toolbars, choosing Visualization from the File menu, or using the shortcuts Ctrl-1 through Ctrl-9 on the keyboard for various visualizations. 6. Filter and preprocess the data as described in Support Protocols 1 and 2. ALTERNATE PROTOCOL 1 UPLOADING AFFYMETRIX TEXT DATA INTO SPOTFIRE Support for standard microarray platforms, such as Affymetrix, is integrated within DecisionSite for Functional Genomics. Spotfire allows the user to upload multiple Affymetrix data files in the metric text format (.met files) using a script that can retrieve these files from a database or from a network drive. A guide is available to upload data from both MAS 4.0 and MAS 5.0 versions. The MAS 5.0 guide also works with the latest Affymetrix software GCOA 1.1. The script reads a .met file while largely ignoring the information provided in the header. It then allows the user to pivot the relevant columns of data from the .met file(s) to upload. Necessary Resources Hardware The recommended minimal hardware requirements are modest. The software will run on an Intel Pentium or equivalent with 100 MHz processor, 64 Mb RAM, 20 Mb disk space; a VGA or better display; and 800 × 6000 pixels resolution are needed. However, most microarray experiments yield large output files and most experimental designs require several data files to be analyzed simultaneously, so the user will benefit from both a much higher RAM and a significantly better processor speed. Loading and Preparing Data for Analysis in Spotfire Software Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 7.8.4 Supplement 6 Current Protocols in Bioinformatics A standard install of Microsoft Internet Explorer; v. 5.0 through 6.0, may be used MDAC (Microsoft Data Access Components); versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) may be used A Web connection to the Spotfire server (http://home.spotfire.net; UNIT 7.7) or a local customer specific Spotfire Server. A Web connection is also required to take advantage of Web Links for the purpose of querying databases and Web sites on the Internet using columns of data residing in Spotfire Microsoft PowerPoint, Word, and Excel are required to take advantage of a number of features available within Spotfire related to export of text results or visualizations (UNIT 7.9) Spotfire (6.2 or above) is required (see UNIT 7.7) Files Spotfire (Functional Genomics module) can import data in nearly any format, but the authors focus here on the commercial GeneChip microarray data (Affymetrix, Inc.). Spotfire facilitates the seamless import of Affymetrix output files (.met) from Affymetrix MAS v. 4.0 or v. 5.0 software. The .met file is a tab-delimited text file containing information about attributes such as probe set level, gene expression levels (signal), and detection quality controls (p value and Absence/Presence calls). In the illustration below, MAS 5.0 .met files will be used as an example. 1. Run Spotfire (UNIT 7.7) and ensure that access is available to the .met files from either a network drive or a database. Depending on the type of setup, it may be necessary to log in to the Spotfire application as well as the data source. Systems and database administrators may be able to provide more information. 2. In the Tools pane on the Left hand side, a plus sign (+) in front of the script Access indicates that it can be expanded to explore other items under this directory. Click on Access, then on Import Affymetrix Data. This reveals all options available for downloading Affymetrix data (from version 4.0 or 5.0 MAS files on a network drive, or from a local or remote database). Click on Import Affymetrix V5 Files (Fig. 7.8.3). Figure 7.8.3 Tools pane with the Import Affymetrix v5 Files tab highlighted. Analyzing Expression Analysis 7.8.5 Current Protocols in Bioinformatics Supplement 6 Figure 7.8.4 (A) The Import Affymetrix Files dialog allows users to specify files to be uploaded into a Spotfire session. (B) The Data Import Options allow users to chose all or any columns from the data set. 3. Clicking on Import Affymetrix V5 Files will open a window for the user to specify the files to upload to Spotfire (Fig. 7.8.4A). Loading and Preparing Data for Analysis in Spotfire 4. Click Add. Point to the directory where the files to be analyzed are located, and double-click on the desired file. It is possible to load either a single file or multiple files with the help of the Shift key. The user may upload as many as seven files at one time. Uploading more than seven files will require that the process be repeated. The filename will appear in the center of the dialog box. 7.8.6 Supplement 6 Current Protocols in Bioinformatics 5. Specify the file(s) and click on the Columns button (Fig. 7.8.4A) to specify the data columns (Fig. 7.8.4B) to upload. One can choose to upload the entire file (requiring longer upload times). 6. Check all columns to import, then click OK. The Import Affymetrix Files window will appear again. Click OK again. Data will begin loading into Spotfire. This could take several minutes depending on the size and number of the data columns being uploaded and RAM/processor speeds. At the end of the data-upload process, Spotfire will automatically display an initial visualization where each record is represented by a marker, along with a number of query devices for manipulating the visualization. Alternative visualizations can be opened by clicking on appropriate visualization toolbars, choosing Visualization from the File menu, or using the shortcuts Ctrl-1 through Ctrl-9 on the keyboard for various visualizations. 7. Filter and preprocess the data as described in Support Protocols 1 and 2. FILTERING AND PREPROCESSING MICROARRAY DATA Successfully completing microarray experiments includes assessing the quality of the array design, the experimental design, the experimental execution, the data analysis, and the biological interpretation. At each step, data quality and data integrity should be maintained by minimizing both systematic and random measurement errors. Before embarking on the actual analysis of data, it is important to perform filtering and preprocessing, and other kinds of transformations, to remove systemic biases that are present in microarray data. It is not uncommon for users to overlook the importance of such quality-control measures. Typical filtering operations include removing genes with background levels of expression from the data, as these would likely confound later transformations and cause spurious effects during fold-change calculations and significance analysis. This can be readily achieved by filtering on the basis of absence/presence calls and detection p value. SUPPORT PROTOCOL 1 Query devices are assigned to every field of data and allow the user to perform filtering with multiple selection criteria, resulting in updates of all visualizations to display the results of this cumulative filtering. Guides can be used to perform such repetitive tasks quickly or to initiate a series of specific steps in the analysis. Throughout analysis, filtering using any data-field query device can be used to subset data and limit the number of genes that are included in further calculations and visualizations. Genes can be filtered on the basis of detection p value, Affymetrix signal, GenePix signal, GenePix signal-to-noise ratio, fold change, standard deviation, and modulation (frequency crossing a threshold). For example, filtering genes on modulation by setting a 0.05 p value threshold will split genes out by the number of times they fall above the 0.05 limit in the selected experiments. To preprocess Affymetrix text data 1a. Initiate a Spotfire session (UNIT 7.7) and upload Affymetrix text (.met) files as described in Alternate Protocol 1. 2a. Pay careful attention to the query devices as a default visualization is loaded. A query device appears for every column of data that is uploaded and can be used to manipulate data visualization. In the Guides pane on the top-left corner, click on the link for Data Analysis, then on “Analyze Affymetrix absence/presence calls” (Fig. 7.8.5). 3a. This script allows one to choose Detection columns containing Absent (A), Marginal (M), and Present (P) calls. Click on all the detection columns to be considered from the display in the Guides pane, then click on Continue. Analyzing Expression Analysis 7.8.7 Current Protocols in Bioinformatics Supplement 6 Figure 7.8.5 Guides pane with the Analyze Affymetrix absence/presence calls guide highlighted. Figure 7.8.6 The data are binned on the basis of the number of times a particular Probe set was called Absent, Present, or Marginal, and presents a histogram to display the results. 4a. The frequency of absent, present, and marginal occurrences is then calculated across the selected experiments for each gene. It is possible to filter data using three new query devices: Absent Count, Present Count, and Marginal Count. A histogram may be created to view the distribution of Absent, Present, and Marginal counts using the Histogram Guide (Fig. 7.8.6). Loading and Preparing Data for Analysis in Spotfire This display allows users to quickly identify those genes that are repeatedly called Absent. In the above example, there are eight metric text files (Fig. 7.8.6). The histogram displays all genes based on how many times they were binned into the P category in these eight 7.8.8 Supplement 6 Current Protocols in Bioinformatics Figure 7.8.7 The data generated from the use of the Affymetrix absence/presence guide is added to the Spotfire session as a new column and a new corresponding query device generated. experiments. The distribution ranges from 0-1, which identifies genes that are always or almost always Absent, to 7-8, which identifies genes that are almost always called Present. 5a. Using the above histogram it is possible to exclude genes in one or more groups. Similar results can be obtained by sending “Absent call” results to different bins. When the histogram is displayed, associated data are linked to parent data in the Spotfire session and a new query device is created for this column of data (Fig. 7.8.7). 6a. By default, the query device is in the range-slider format. Right-click on the center of the Query device and choose Check Boxes (Fig. 7.8.8). 7a. Uncheck the check box for category 0-1. Notice how the number of visible records on the activity line changes from 6352 to 4441, reflecting the 1911 genes that were filtered out using this method (Fig. 7.8.9). Records under the histogram 0-1 pertain to those genes that were called Present either 0 or 1 time out of a total of 8 Affymetrix chips in this particular experiment. This indicates that these genes are not reliably detected under these conditions. Filtering out these genes allows further calculations and transformations to be performed on the rest of the data set without any effect from these genes. 8a. Alternatively, data may be filtered based on criteria (detection p value or raw signal) other than Absence/Presence calls. To do so, click on Data Preparation in the Guides pane, followed by Filter Genes. Users can filter genes by “Standard deviation,” “Fold change,” or “Modulation.” To filter genes by “Standard deviation,” it is necessary to normalize data based on Z-score calculations (see Basic Protocol 3 and Background Information). Similarly, genes can only be filtered by “Fold change” when the appropriate normalization has been applied to the data (see Basic Protocol 3 and Background Information). Genes can also be filtered by modulation or frequency of crossing a threshold. In a set of 12 .met files, for example, one can query how many times a certain gene has a detection p value greater than 0.05. This calculation can be carried out for every gene in the dataset and groups of genes can be removed based on a particular frequency. Analyzing Expression Analysis 7.8.9 Current Protocols in Bioinformatics Supplement 6 Figure 7.8.8 another. Query Device for a particular column of data can be modified from one type to Figure 7.8.9 Clearing check box corresponding to “Binned Present count 0-1” alters the number of visible records (shown on the Activity Line). 9a. Choose Modulation. Next, choose all the p value columns to be considered from the display in the Guides pane. Hit Continue (Fig. 7.8.10). Loading and Preparing Data for Analysis in Spotfire 10a. Select a modulation threshold. If interested in filtering out genes on the basis of a p value cutoff of 0.05, for example, type 0.05. Click on Filter by Modulation (Fig. 7.8.11). 7.8.10 Supplement 6 Current Protocols in Bioinformatics Figure 7.8.10 fashion. The Filter Genes guide helps users to perform data preprocessing in a stepwise Figure 7.8.11 The Filter Genes by Modulation guide bins data by the number of times a record (gene) crosses the specified threshold in the given experiments. 11a. The frequency of p value occurrences above 0.05 is across the selected experiments for each gene is displayed. It is possible to filter data using the new query device or from the histogram or trellis display (Fig. 7.8.12). Similar filtering may be performed on raw signal data. To preprocess spotted array (GenePix) data 1b. Initiate a Spotfire session (UNIT 7.7) and upload appropriate columns from GenePix (.gpr) files as described in Basic Protocol 1. It is useful to retrieve data from the raw signal columns and background-corrected signal columns. In addition, GenePix data contain indicators of data quality in Signal Analyzing Expression Analysis 7.8.11 Current Protocols in Bioinformatics Supplement 6 Figure 7.8.12 A new data column and a new query device are added to the Spotfire session, based on the Filter Genes>Modulation>p-value selection. to Noise Ratio columns for every channel and a Flags column for every slide. It is useful to retrieve these data. In the example below, six cDNA microarray experiments (12 channels of signal data) are uploaded to Spotfire. 2b. In the Guides pane on the top-left corner, click on the link for Data Preparation and then on Filter Genes (Fig. 7.8.13). 3b. Click on Modulation. Filtering can be performed to remove bad data from GenePix files using data contained in the Flags columns and/or the Signal to Noise Ratio column. Choose Flags columns for any number of arrays to be mined, then hit Continue (Fig. 7.8.14). GenePix software provides the ability to flag individual features with quality indicators such as Good, Bad, Absent, or Not Found. In the text data file, these indicators are converted to numeric data. Features with a Bad flag are designated −100, Good features are flagged as +100, Absent features are flagged as −75, and Not Found features as −50. All other genes are designated as 0 in the .gpr file. By modulating data on the Flags column at a setting of 0, it is possible to identify those genes that are consistently good or bad. 4b. The frequency of various flagged occurrences is then calculated across the selected experiments for each gene. It is possible to filter the data using query devices for the newly generated columns. A histogram can be created using the Histogram Guide to better view the distribution. When the histogram is created, associated data are linked to parent data in the Spotfire session and a new query device is created for this column of data (Fig. 7.8.15). Loading and Preparing Data for Analysis in Spotfire This display allows users to quickly identify those genes that are repeatedly called Absent. In the above example, there are six GenePix files. The histogram displays all genes based on how many times they were binned into the Flag category from six columns of data. The distribution ranges from 0, which identifies genes that are never flagged Bad or 7.8.12 Supplement 6 Current Protocols in Bioinformatics Figure 7.8.13 Clicking on the Filter Genes Guide allows users to perform preprocessing on GenePix data. Figure 7.8.14 columns. Preprocessing can be performed on GenePix data using the Flags or the SNR Not Found or Absent (hence the good genes), to 6, which identifies genes that are most frequently flagged and need to be filtered out of the data set. 5b. Using the above histogram it is possible to exclude genes in one or more groups. By default, the query device is in the range slider format. Right click on the center of the Query device and choose Check Boxes. By filtering the “flagged 6 times group,” 2937 genes are filtered out (Fig. 7.8.16). 6b. Users may also filter GenePix data based on criteria other than Flag, such as Signal to Noise Ratio (SNR), raw signal, or Background pixel saturation levels. Click on Analyzing Expression Analysis 7.8.13 Current Protocols in Bioinformatics Supplement 6 Figure 7.8.15 A new data column and a new query device are added to the Spotfire session, based on the Filter Genes>Modulation>Flags selection. Figure 7.8.16 Clearing check box corresponding to Modulation by Flags column (category 6) alters the number of visible records (shown on the Activity Line). Loading and Preparing Data for Analysis in Spotfire Data Preparation in the Guides pane, followed by Filter Genes. Users can filter genes by “Standard deviation,” “Fold change,” or “Modulation.” In order to filter genes by “Standard deviation,” it is necessary to normalize data based on Z-score calculations (see Basic Protocol 3 and Background Information). Similarly, genes can only be filtered by “Fold change” when the appropriate normalization has been applied to the data (see Basic Protocol 3 and Background Information). Genes can 7.8.14 Supplement 6 Current Protocols in Bioinformatics be filtered by modulation or frequency of crossing a threshold. In a set of 12 GenePix files, for example, one can ask how many times a certain gene has a SNR value greater than 1.5. This calculation can be carried out for every gene in the dataset and groups of genes can be removed based on a particular frequency. LOG TRANSFORMATION OF MICROARRAY DATA The logarithmic (henceforth referred to as log) function has been used to preprocess microarray data from the very beginning (Yang et al., 2002). The range for raw intensity values in microarray experiments spans a very large interval from zero to tens of thousands. However, only a small fraction of genes have values that high. This generates a long tail in the distribution curve, making it asymmetrical and non-normal. Log transformation provides values that are easily interpretable and more meaningful from a biological standpoint. The log transformation accomplishes the goal of defining directionality and fold change, whereas raw signal numbers only demonstrate relative expression levels. The log transformation also makes the distribution of values symmetrical and almost normal, by removing the skew originating from long tails originating from values with high intensities. SUPPORT PROTOCOL 2 1. Open an instance of Spotfire (UNIT 7.7). Upload (see Basic Protocol 1 or Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data. 2. In the Guides pane of the DecisionSite Navigator (see UNIT 7.7), click on Data preparation>Transform columns to log scale. A new window is opened within the Guides pane (Fig. 7.8.17). 3. Select the columns on which to perform log transformation. These would typically be the signal columns in Affymetrix data and Cy-3 and Cy-5 signal data in the case Figure 7.8.17 The “Transform columns to log scale” guide allows the user to convert any numeric data column to its logarithm counterpart, allowing the user to chose log to base 2 or 10. Analyzing Expression Analysis 7.8.15 Current Protocols in Bioinformatics Supplement 6 of two-color arrays. Hold down the Ctrl key in order to select multiple columns. In order to select all the columns displayed in the guide, select the first column, hold down the Shift key and then select the last column (Fig. 7.8.17). 4. Click Continue. The user is now presented with the option of transforming log to the base 10 or 2. 5. Click on “log10” or “log2.” Most microarray users have a preference for log2. New data columns are generated and added to the data set. Query Devices for these newly generated columns are also added and can be used to manipulate visualizations. Log transformed values for input values less than or equal to zero are not calculated and are left empty. 6. Load the Guides pane again by clicking on Back to Contents. BASIC PROTOCOL 2 NORMALIZATION OF MICROARRAY DATA WITHIN AN EXPERIMENT Experimental comparisons of expression are only valid if the data are corrected for systemic biases such as the technology used, protocol used and investigator. Since these biases are regularly detected in raw microarray data, it is imperative that some sort of normalization procedure be used to address this issue (Smyth and Speed, 2003). At this time, however, there is no consensus way to perform normalization. Several methods are available in the normalization module of Spotfire. These can broadly be divided into two categories: those that make experiments comparable (i.e., within experiments) and those that make the genes comparable (i.e., between experiments, see Basic Protocol 3). “Normalize by mean,” “Normalize by trimmed mean,” “Normalize by percentile,” “Scale between 0 and 1,” and “Subtract the mean or median” are all examples of the former category, which is particularly relevant for the spotted arrays but rarely need for the Affymetrix chips (Background Information). 1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1) and prefilter (see Support Protocol 1) microarray data. To normalize by mean (also see Background Information) 2a. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens (Fig. 7.8.18). 3a. Choose the “Normalize by mean” radio button and then click the Next> button. The normalization dialog box 2(2) opens (Fig. 7.8.19). 4a. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5a. Click a radio button to select whether to work with “All records” or “Selected records.” 6a. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing values to the interpolated value between the two neighboring values in the row. Loading and Preparing Data for Analysis in Spotfire 7a. Set one of the columns to be used for normalization as a baseline by selecting from the “Baseline for rescaling” drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious examples. Select None if no baseline is needed. 7.8.16 Supplement 6 Current Protocols in Bioinformatics Figure 7.8.18 The Normalization dialog 1(2) allows the users to choose from several Normalization options. Figure 7.8.19 The Normalization dialog 2(2) allows the users to choose Value column on which to perform Normalization and other variables. Analyzing Expression Analysis 7.8.17 Current Protocols in Bioinformatics Supplement 6 8a. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 9a. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 10a. Click Finish. Normalized columns are computed and added to the data set. To normalize by trimmed mean (also see Background Information) 2b. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3b. Choose the “Normalize by trimmed mean” radio button and then click the Next> button. The normalization dialog box 2(2) opens. 4b. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5b. Click a radio button to select whether to work with “All records” or “Selected records.” 6b. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing values to the interpolated value between the two neighboring values in the row. 7b. Set one of the columns to be used for normalization as a baseline by selecting from the “Baseline for rescaling” drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious examples. Select None if no baseline is needed. 8b. Enter a “Trim value.” If a trim value of 10% is entered, the highest and the lowest 5% of the values are excluded when calculating the mean. 9b. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 10b. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 11b. Click Finish. Normalized columns are computed and added to the data set. To normalize by percentile (also see Background Information) 2c. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3c. Choose “Normalize by percentile value” and then click the Next> button. The normalization dialog box 2(2) opens. Loading and Preparing Data for Analysis in Spotfire 4c. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 7.8.18 Supplement 6 Current Protocols in Bioinformatics 5c. Click a radio button to select whether to work with “All records” or “Selected records.” 6c. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing value to the interpolated value between the two neighboring values in the row. 7c. Select one of the columns to be used for normalization as a baseline by selecting from the “Baseline for rescaling” drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious examples. Select None if no baseline is needed. 8c. Enter a Percentile. For example, “85-percentile” is the value that 85% of all values in the data set are less than or equal to. 9c. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 10c. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 11c. Click Finish. Normalized columns are computed and added to the data set. To scale between 0 and 1 (also see Background Information) 2d. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3d. Choose “Scale between 0 and 1” and then click the Next> button. The normalization dialog box 2(2) opens. 4d. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5d. Click a radio button to select whether to work with “All records” or “Selected records.” 6d. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing value to the interpolated value between the two neighboring values in the row. 7d. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8d. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 9d. Click Finish. Normalized columns are computed and added to the data set. Analyzing Expression Analysis 7.8.19 Current Protocols in Bioinformatics Supplement 6 To subtract the mean (also see Background Information) 2e. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3e. Choose “Subtract the mean” and then click the Next> button. The normalization dialog box 2(2) opens. 4e. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5e. Click a radio button to select whether to work with “All records” or “Selected records.” 6e. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing value to the interpolated value between the two neighboring values in the row. 7e. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8e. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 9e. Click Finish. Normalized columns are computed and added to the data set. To subtract the median (also see Background Information) 2f. In the Tools Pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3f. Choose “Subtract the mean” and then click the Next> button. The normalization dialog box 2(2) opens. 4f. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5f. Click a radio button to select whether to work with “All records” or “Selected records.” 6f. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing values to the interpolated value between the two neighboring values in the row. 7f. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. Loading and Preparing Data for Analysis in Spotfire 8f. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) to calculate mean from. Click OK. 9f. Click Finish. Normalized columns are computed and added to the data set. 7.8.20 Supplement 6 Current Protocols in Bioinformatics NORMALIZATION OF MICROARRAY DATA BETWEEN EXPERIMENTS Experimental comparisons of expression are valid only if the data are corrected for systemic biases such as the technology used, protocol used, and investigator. Since these biases are regularly detected in raw microarray data, it is imperative that some sort of normalization procedure be used to address this issue (Smyth and Speed, 2003). At this time, however, there is no consensus way to perform normalization. Fold change as signed ratio, fold change as log ratio, fold-change as log ratio in standard deviation units, and Zscore calculation are all examples of between-experiments normalization that are equally applicable to both spotted and Affymetrix array platforms. BASIC PROTOCOL 3 1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1 or Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data. To normalize by calculating fold change (as signed ratio, log ratio, or log ratio in standard deviation units; also see Background Information) 2a. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3a. Select a radio button for “Fold change as signed ratio,” “Fold change as log ratio,” or “Fold change as log ratio in Standard Deviation units.” Click Next. The Normalization dialog box 2(2) opens. 4a. Select the “Value columns” on which to perform the operation. 5a. Click a radio button to select whether to work with “All records” or “Selected records.” 6a. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing values to the interpolated value between the two neighboring values in the row. 7a. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8a. Click a radio button to specify whether to calculate mean from “All genes” or “Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 9a. Click Finish. Normalized columns are computed and added to the data set. For Z-score calculation (also see Background Information) 2b. In the Tools pane of the DescisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens. 3b. Click Z-score Normalization and then click the Next> button. The Normalization dialog box 2(2) opens. 4b. Select the “Value columns” on which to perform the operation. 5b. Click a radio button to select whether to work with “All records” or “Selected records.” 6b. Select a method from the “Replace empty values with” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” Analyzing Expression Analysis 7.8.21 Current Protocols in Bioinformatics Supplement 6 replaces empty values by the average for the entire row; and “Row interpolation” sets the missing value to the interpolated value between the two neighboring values in the row. 7b. Check the “Overwrite existing columns” check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8b. Select the “Add mean column check box” if it is desirable to add a column with the mean of each gene. 9b. Select the “Add standard deviation check box” if it is desirable to add a column with the standard deviation of each gene. 10b. Select the “Add coefficient of variation check box” if it is desirable to add a column with the coefficient of variation of each gene. 11b. Click a radio button to select whether to calculate the Z-scores from “All genes” or “Genes from Portfolio.” Selecting the latter option opens a portfolio dialog box where on can choose a number of records or lists from which to calculate Z-score. Choose a list and go back to the Normalization dialog. 12b. Click Finish. Columns containing normalized data are added to the data set. BASIC PROTOCOL 4 ROW SUMMARIZATION The row summarization tool allows users to combine values from multiple columns (experiments) into a single column. Measures such as averages, standard deviations, and coefficients of variation of groups of columns can be calculated. Since microarray experiments are typically performed in multiple replicates, this tool serves to summarize those experiments and determine the extent of variability. 1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1 or Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data. 2. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data Preparation>Row summarization (Fig. 7.8.20). The “Row summarization” dialog box (Fig. 7.8.21) is displayed. 3. Create the appropriate number of groups using the New Groups tool. Move the desired value columns to suitable groups in the “Grouped value columns” list. To determine the average per row of n columns, create a new group in the “Grouped value columns” list, and then select it. Click to select all of the n columns in the value columns list and then click the Add button. In this manner, several groups can be summarized simultaneously. At least two value columns must be present in any “Grouped value columns” for this tool to work. Clicking on “Delete group” deletes the selected group and its contents (value columns) are transferred to the bottom of the “Value columns” list (Fig. 7.8.21). Loading and Preparing Data for Analysis in Spotfire 4. Select a group and click on Rename Group to edit the group name. This is important because the default column names are names of the original columns followed by the chosen comparison measure in parentheses. When dealing with a number of experiments, this sort of nomenclature can be problematic. Therefore, it is advisable to choose meaningful group names at this stage. 5. Click a radio button to select whether to work with “All records” or “Selected records.” 7.8.22 Supplement 6 Current Protocols in Bioinformatics Figure 7.8.20 The Row Summarization Tool is displayed. Figure 7.8.21 Row Summarization dialog allows the users to chose the value columns on which to perform the summarization, as well as other variables such as which measure (e.g., Average, Standard Deviation) to use. Analyzing Expression Analysis 7.8.23 Current Protocols in Bioinformatics Supplement 6 6. Select a method from the “Replace empty values” drop-down list. “Constant” allows the user to replace empty values with a constant value; “Row average” replaces empty values by the average for the entire row; and “Row interpolation” sets the missing values to the interpolated value between the two neighboring values in the row. 7. Select a “Summarization measure” (e.g., average, standard deviation, variance, min, max, median) from the list box and click on OK. 8. Results are added to the dataset and new query devices created. COMMENTARY Background Information Loading and Preparing Data for Analysis in Spotfire Normalization methods Normalize by mean. The mean intensity of one variable (in two-color arrays) is adjusted so that it is equal to the mean intensity of the control variable (logR − logG = 0, where R and G are the sum of intensities of each variable). This can be achieved in two ways: rescaling the experimental intensity to a baseline control intensity that remains constant, or rescaling without designating a baseline so that intensity levels in both channels are mutually adjusted. Normalize by trimmed mean. This method works in a manner that is essentially similar to normalization by mean, with the exception that the trimmed mean for a variable is based on all values except a certain percentage of the lowest and the highest values of that variable. This has the effect of reducing the effect of outliers during normalization. Setting the trim value to 10%, for example, excludes the top 5% and the bottom 5% values from the calculation. Once again, the normalization can be performed with and without a baseline. Normalize by percentile. The X-percentile is the value in a data set that X% of the data are lower than or equal to. One common way to control for systemic bias in microarrays is normalizing to the distribution of all genes— i.e., normalizing by percentile value. Signal strength of all genes in sample X is therefore normalized to a specified percentile of all of the measurements taken in sample X. If the chosen percentile value is very high (∼85-percentile), the corresponding data point lies sufficiently far away from the origin that a good line can be drawn through all the points. The slopes of the line for each variable are then used to rescale each variable. One caveat of this sort of normalization is that it assumes that the median signal of the genes on the chip stays relatively constant throughout the experiment. If the total number of expressed genes in the experiment changes dramatically due to true biological activity (causing the median of one chip to be much higher than another), then the true expression values have been masked by normalizing to the median of each chip. For such an experiment, it may be desirable to consider normalizing to something other than the median, or one may want to instead normalize to positive controls. Scale between 0 and 1. If the intent of a microarray experiment is to study the data using clustering, the user may need to put different genes on a single scale of variation. Normalizations that may accomplish this include scaling between 0 and 1. Gene expression values are scaled such that the smallest value for each gene becomes 0 and the largest value becomes 1. This method is also known as Min-Max normalization. Subtract the mean. This method is generally used in the context of log-transformed data. This will replace each value by [value – mean (expression values of the gene across hybridizations)]. Mean and median centering are useful transformations because they reduce the effect of highly expressed genes on a dataset, thereby allowing the researcher to detect interesting effects in weakly expressed genes. Subtract the median. This method is also generally used in the context of logtransformed data and has a similar effect to mean centering, but is more robust and less susceptible to the effect of outliers. This will replace each value by [value – median (expression values of the gene across hybridizations)]. Fold change as signed ratio. This is essentially similar to normalization by mean. A fold change for a gene under two different conditions (or chips) is created. If there are n genes and five variables A, B, C, D, and E, assuming that variable A is considered baseline, the normalized value ei for the variable E in the ith gene is calculated as Norm ei = ei /ai , where ai is the value of variable A in the ith gene. Fold change as log ratio. If there are n genes and five variables (A, B, C, D, and E), 7.8.24 Supplement 6 Current Protocols in Bioinformatics assuming that variable A is considered baseline, the normalized value ei for the variable E in the ith gene is calculated as Norm ei = log (ei /ai ), where ai is the value of variable A in the ith gene. Fold change as log ratio in standard deviation units. If there are n genes and five variables (A, B, C, D, and E), assuming that variable A is considered baseline, the normalized value ei for the variable E in the ith gene is calculated as Norm ei = 1/Std(x) · log (ei /ai ) where Std(x) is the standard deviation of a matrix of log ratios of all signal values for the corresponding record. Z-score calculation. Z-score provides a way of standardizing data across a wide range of experiments and allows the comparison of microarray data independently of the original hybridization intensities. This normalization is also typically performed in log space. Each gene is normalized by subtracting the given expression level from the median (or the mean) on all experiments, and then divided by the standard deviation. This weighs the expression levels in favor of those records that have lesser variance. Literature Cited Cheok, M.H., Yang, W., Pui, C.H., Downing, J.R., Cheng, C., Naeve, C.W., Relling, M.V., and Evans, W.E. 2003. Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet. 34:85-90. clade B protease gene using high-density oligonucleotide arrays. Nat. Med. 2:753-759. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K., and Young, R.A. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799-804. Leung, Y.F. and Cavalieri, D. 2003. Fundamentals of cDNA microarray data analysis. Trends Genet. 19:649-659. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability 1967:281-297. Sankoff, D. and Kruskal, J.B. 1983. Time Warps, String Edits, and Macromolecules. The Theory and Practice of Sequence Comparison. AddisonWesley, Reading Mass. Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467-470. Schena, M., Heller, R.A., Theriault, T.P., Konrad, K., Lachenmeier, E., and Davis, R.W. 1998. Microarrays: Biotechnology’s discovery platform for functional genomics. Trends Biotechnol. 16:301-306. Smyth, G.K. and Speed, T. 2003. Normalization of cDNA microarray data. Methods 31:265-273. Smyth, G.K., Yang, Y.H., and Speed, T. 2003. Statistical issues in cDNA microarray data analysis. Methods Mol. Biol. 224:111-136. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., and Church, G.M. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22:281-285. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531-537. Yang, Y., Buckley, M.J., Dudoit, S., and Speed, T.R. 2002. Comparison of methods for image analysis on cDNA microarray data. J. Comp. Stat. 11:108-136. Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M., Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O. 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283:83-87. Jolliffe, I.T. 1986. Principal Component Analysis. Springer Series in Statistics. Springer-Verlag, New York. Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183-201. Kozal, M.J., Shah, N., Shen, N., Yang, R., Fucini, R., Merigan, T.C., Richman, D.D., Morris, D., Hubbell, E., Chee, M., and Gingeras, T.R. 1996. Extensive polymorphisms observed in HIV-1 Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C.H., Evans, W.E., Naeve, C., Wong, L., Downing, J.R. 2002. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. 2002. Cancer Cell 1:133-143. Contributed by Deepak Kaushal and Clayton W. Naeve Hartwell Center for Bioinformatics and Biotechnology St. Jude Children’s Research Hospital Memphis, Tennessee Analyzing Expression Analysis 7.8.25 Current Protocols in Bioinformatics Supplement 6 Analyzing and Visualizing Expression Data with Spotfire UNIT 7.9 Spotfire DecisionSite (http://hc-spotfire.stjude.org/spotfire/support/manuals/manuals. jsp) is a powerful data mining and visualization program with application in many disciplines. Modules are available in support of gene expression analysis, proteomics, general statistical analysis, chemical lead discovery analysis, geology, as well as others. Here the focus is on Spotfire’s utility in analyzing gene expression data obtained from DNA microarray experiments. Other units in this manual present a general overview of the Spotfire environment along with the hardware and software requirements for installing it (UNIT 7.7), and how to load data into Spotfire for analysis (UNIT 7.8). This unit presents numerous methods for analyzing microarray data. Specifically, Basic Protocol 1 and Alternate Protocol 1 describe two methods for identifying differentially expressed genes. Basic Protocol 2 discusses how to conduct a profile search. Additional protocols illustrate various clustering methods, such as hierarchical clustering (see Basic Protocol 4 and Alternate Protocol 2), K-means clustering (see Basic Protocol 5), and Principal Components Analysis (see Basic Protocol 6). A protocol explaining coincidence testing (see Basic Protocol 3) allows the reader to compare the results from multiple clustering methods. Additional protocols demonstrate querying the Internet for information based on the microarray data (see Basic Protocol 7), mathematically transforming data within Spotfire to generate new data columns (see Basis Protocol 8), and exporting final Spotfire visualizations (see Basic Protocol 9). Spotfire (Functional Genomics module) can import data in nearly any format, but the authors have focused here on two popular microarray platforms, the commercial GeneChip microarray data (Affymetrix) and two-color spotted microarray data produced using GenePix software (Axon). Spotfire facilitates the seamless import of Affymetrix output files (.met) from Affymetrix MAS v4.0 or v5.0 software. The .met file is a tab-delimited text file containing information about attributes such as probe set level, gene expression levels (signal), detection quality controls (p-value and Absence/Presence calls), and so forth. In the illustration below, the authors use MAS 5.0 .met files as an example. Several types of spotted arrays and their corresponding data types exist, including commercial vendors (i.e., Agilent, Motorola, and Mergen) that supply spotted microarrays for various organisms as well as facilities that manufacture their own chips. Several different scanners and scanning software packages are available. One of the more commonly used scanners is the Axon GenePix. GenePix data files are in a tab-delimited text format (.gpr), which can be directly imported into a Spotfire session. NOTE: This unit assumes the reader is familiar with the Spotfire environment, has successfully installed Spotfire, and has uploaded and prepared data for analysis. For further information regarding these tasks, please see UNITS 7.7 & 7.8. IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES USING t-TEST/ANOVA The treatment comparison tool provides methods for distinguishing between different treatments for an individual record. There are two types of treatment comparison algorithms: t-test/ANOVA (Kerr and Churchill, 2001) and Multiple Distinction (Eisen et al., 1998). Both algorithms seek to identify differentially expressed genes based on their expression values. Contributed by Deepak Kaushal and Clayton W. Naeve Current Protocols in Bioinformatics (2004) 7.9.1-7.9.43 C 2004 by John Wiley & Sons, Inc. Copyright BASIC PROTOCOL 1 Analyzing Expression Analysis 7.9.1 Supplement 7 The t-test is a commonly used method to evaluate the differences between the means of two groups by verifying that observed differences between them are statistically significant. Analysis of variation (ANOVA) works along the same principle but can be used to differentiate between more than two groups. ANOVA calculates the variance within a group and compares it to the variance between the groups. The original (null) hypothesis assumes that the mean expression levels of a gene are not different between the two groups. The null hypothesis is then either rejected or accepted for each gene in consideration. The results are expressed in terms of a p-value, which is the observed significance level—i.e., the probability of a type I error concluding that a difference exists in the mean expression values of a given gene when in fact there is no difference. If the p-value is below a certain threshold, usually 0.05, it is considered that a significant difference exists. The lower the p-value, the higher the difference. The ANOVA algorithm in Spotfire has a one-way layout; therefore it can only be used to discriminate between groups based on one variable. Further, this algorithm assumes the following: (1) the data is normally distributed and (2) the variances of separate groups are similar. Failure to maintain these assumptions will lead to erroneous results. One way to ensure that the data is normally distributed is to log transform the data (UNIT 7.8). Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Analyzing and Visualizing Expression Data with Spotfire 1. Click Analysis, followed by Pattern Detection, followed by Treatment Comparison in the Tools pane of DecisionSite Navigator (Fig. 7.9.1). 7.9.2 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.1 The Treatment Comparison tool is shown. The treatment comparison dialog-box is displayed and all available columns are listed in the Value Columns field. (Note that if the tool has been used before, it retains the earlier grouping and the user will have to delete it.) Value Columns are the original data columns that have been uploaded into the Spotfire session. Any data column can be used as a value column as long as it includes integers or real numbers. 2. Use the following procedure to move and organize the desired value columns into the Grouped Value Columns field, which displays columns that the user has defined as being part of a group (e.g., replicate microarrays) on which the calculation is to be performed. Note that at least two columns should be present in every group for the tool to be able to perform its calculations. a. Select the desired column. Click the Add button. The column will end up in the selected group of the Grouped Value Columns field. b. Click New Group to add a group or Delete Group to remove a group. If the deleted group contained any value columns, they are moved back to the Value Columns field (Fig. 7.9.2). c. Click Rename Group to open the edit group name dialog box, which can be used to rename a group. It is useful to rename the columns to something meaningful because the default names are Group1, Group2, and so on. 3. From the same dialog box, choose whether All Records or Selected Records are to be used. Choosing All Records causes all records that were initially uploaded into Spotfire to be used for the calculations. If any preprocessing or filtering steps have been performed and the user would like to exclude those records from calculations, the user should choose Selected Records. Analyzing Expression Analysis 7.9.3 Current Protocols in Bioinformatics Supplement 7 Figure 7.9.2 The Treatment Comparison dialog box allows the users to group various Value Columns into different groups on which t-test/ANOVA is to be performed. Analyzing and Visualizing Expression Data with Spotfire Figure 7.9.3 A profile chart is generated to display the results of t-test/ANOVA analysis. The “ttest/ANOVA Query Device” (a range slider) can be manipulated to identify highly significant genes. The profile chart is colored in the Continuous Coloring mode based on the t-test/ANOVA p-values. 7.9.4 Supplement 7 Current Protocols in Bioinformatics 4. If there are empty values in the data, select a method to replace empty values from the following choices in the drop-down list: Choice Constant Numeric Value Row Average Row Interpolation Replaces empty values with Specified value Average of all the values in the row Interpolated value of the two neighboring values. 5. Select “t-test/ANOVA” from the Comparison Measure list box. 6. Type a new identifier in the Column Name text box or use the default. Check the Overwrite box to replace the values of a previously named column. If the user wishes not to overwrite, make sure that the Overwrite check box is unchecked. 7. Click OK. This will add a new column containing p-values to the data set and creates a new Profile Chart visualization. The profiles are ordered by the group with the lowest p-value setting (Fig. 7.9.3). IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES USING DISTINCTION CALCULATION ALTERNATE PROTOCOL 1 The distinction calculation algorithm (Eisen et al., 1998) is slightly different from that of t-test/ANOVA (see Basic Protocol 1). It is a measure of how distinct the expression level is between two parts of a profile. The Distinction Calculation algorithm divides the variables (columns) within a row into two groups. A distinction value is then calculated for each row based on the two groups of values. The distinction value is a measure of how distinct the difference in expression level is between two parts of the row (e.g., tumor cells versus normal cells). The algorithm divides the variables in the profile data into groups based on factors such as type of tissue and tumor, and looks for genes that show a distinct difference in expression level between them. The profiles can be compared to an idealized pattern to identify genes closely matching that pattern. One such idealized pattern could be where the expression level is uniformly high for one group of experiments and uniformly low for another group for the given gene. Profiles that match this ideal pattern closely (i.e., those that have high expression values in the first set of experiments and low expression values in the second) are given high positive distinction values. Similarly, profiles that give low expression values in the first group and high expression values in the second group are given high negative correlation values. The calculated distinction value is a measure of how similar each profile is with this ideal. Profiles that have high expression values in the first group and low expression values in the second are given high positive distinction values. Likewise, profiles that have low expression values in the first group and high expression values in the second are given high negative correlation values. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Analyzing Expression Analysis 7.9.5 Current Protocols in Bioinformatics Supplement 7 Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) 1. Click Analysis, followed by Pattern Detection, followed by Treatment Comparison in the Tools pane of DecisionSite Navigator (Fig. 7.9.1). The Treatment Comparison dialog-box is displayed (Fig. 7.9.4) and all available columns are listed in the Value Columns field. (Note that if the tool has been used before, it retains the earlier grouping and the user will have to delete it.) Value Columns are the original data columns that have been uploaded into the Spotfire session. Any data column can be used as a value column provided it includes integers or real numbers. 2. Organize columns, choose records, and fill empty values as described (see Basic Protocol 1, steps 2 to 4). 3. Select Distinction/Multiple Distinction from the Comparison Measure list box and click OK (Fig. 7.9.4). Analyzing and Visualizing Expression Data with Spotfire Figure 7.9.4 The Treatment Comparison dialog box allows the users to group various Value Columns into different groups on which Multiple Distinction is to be performed. 7.9.6 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.5 Results of Multiple Distinction are originally displayed in a profile chart. The users can however build a heat map based on these results. (A) A set of genes on the basis of which eight experiments can be distinctly identified using the Multiple Distinction algorithm. (B) A zoomed in version of the same heat map. This will add new columns containing distinction values to the data set and new profile visualization will be created. The profiles are ordered by the group with the lowest value (highest distinction). 4. Use these results to order a heat map based on the results of the Distinction/Multiple Distinction for better visualization and identification of genes with different profiles in different samples (Fig. 7.9.5). Analyzing Expression Analysis 7.9.7 Current Protocols in Bioinformatics Supplement 7 A heat map is a false color image of a data set (e.g., microarray data) which allows users to detect the presence of certain patterns in the data. Heat maps resemble a spreadsheet in which each row represents a gene present on the microarray and each column represents a microarray experiment. By coloring the heat map according to signal or log ratio values, trends can be obtained about the behavior of genes as a function of experiments. BASIC PROTOCOL 2 IDENTIFICATION OF GENES SIMILAR TO A GIVEN PROFILE: THE PROFILE SEARCH In a profile search, all profiles (i.e., all data-points or rows) are ranked according to their similarity to a master. The similarity between each of the profiles and the master is then calculated according to one of the available similarity measures. Spotfire adds a new data column with values for each individual profile (index of similarity) and a rank column, which enables users to identify numerous genes that have profiles similar to the master-profile. In order to successfully use this algorithm, the user must specify the following. A gene to be used as a master-profile. A profile search is always based on a master profile. Spotfire allows users to designate an existing and active profile as the master. Alternatively, a new master-profile can be constructed by averaging several active profiles. It is possible to edit the designated master-profile using the built-in editor function before embarking on profile search (Support Protocol 1). A similarity measure to be used. Similarity measures express the similarity between profiles in numeric terms, thus enabling users to rank profiles according to their similarity. Available methods include Euclidean Distance, Correlation, Cosine Correlation, CityBlock Distance, and Tanimoto (Sankoff and Kruskal, 1983). Whether to include or exclude empty values from the calculation. If a profile contains a missing value and the user opts to exclude empty values, the calculated similarity between the profiles is then based only on the remaining part of the profile. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software Analyzing and Visualizing Expression Data with Spotfire PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server 7.9.8 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.6 The Profile Search dialog box allows users to chose Value Columns to be used for this calculation as well as variables such as Similarity Measure and Calculation Options. Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) 1. Activate the profile to be used as the master in the Profile Chart/Diagram view. Alternatively, mark a number of profiles on which to base the master profile. 2. If changing the master profile is desired, or to create a totally new profile, edit the master profile as described (see Support Protocol 1). 3. Click on Analysis, followed by Pattern Detection, followed by Profile Search in the Tools pane of the DecisionSite Navigator. A Profile Search dialog box will appear (Fig. 7.9.6). 4. Select the Value Columns on which to perform the profile search. For multiple selections, hold down the Ctrl key while continuing to click the desired columns. 5. Click a radio button to choose to work with All Records or Selected Records (see Basic Protocol 1, step 3). 6. From the drop-down list, select a method to Replace Empty values from the dropdown list (see Basic Protocol 1, step 4). 7. If both marked records and an active record exist, select whether to use profile from the Active Record or Average from Marked Records. Analyzing Expression Analysis 7.9.9 Current Protocols in Bioinformatics Supplement 7 Only one record can be activated at a time (by clicking on the record in any visualization). An active record appears with a black circle around it. Several or all records present can be marked by clicking and drawing around them in any visualization. Marked data corresponding to these records can then be copied to the clipboard. See UNIT 7.7 for more information. Following this selection, the selected profile is displayed in the profile editor along with its name. At this point, the profile and its name can be edited in any manner desired. 8. Select the Similarity Measure to be used. For a detailed description on similarity measures, see Sankoff and Kruskal (1983). 9. Type a Column Name for the resulting column or use the default. Check the Overwrite box if appropriate (see Basic Protocol 1, step 6). 10. Click OK. This will cause the search to be performed and displayed in the editor, and the results to be added to the dataset as a new column. Additionally, a new scatter plot is created which displays rank versus similarity, and annotations containing information about the calculation settings are added to the Visualization. At the end of the profile search, selected profiles in the data are ranked according to their similarity to the selected master profile. 11. If desired, create a scatter plot between Similarity and Similarity Rank. In such a plot, the record that is most similar to the master profile will be displayed in the lower left corner of the visualization. SUPPORT PROTOCOL 1 EDITING A MASTER PROFILE Since the starting profile does not restrict the user in any fashion, one can modify existing values to create a master profile of their choice. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software Analyzing and Visualizing Expression Data with Spotfire PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) 7.9.10 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.7 The Profile Search: Edit dialog box allows users to edit an existing profile to create an imaginary profile upon which to base the search. Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) 1. Activate the profile to be used for creating an edited master profile by simply clicking on the profile in the Profile Chart visualization. 2. Click Analysis, followed by Pattern Detection, followed by Profile Search in the Tools pane of the DecisionSite Navigator. A profile search dialog box will appear. 3. Select the Value Columns on which to perform the profile. For multiple selections, hold down the Ctrl key while continuing to click on the desired columns. 4. Click Edit. This will open the profile search edit dialog box (Fig. 7.9.7). 5. Click directly in the editor to activate the variable to be changed. Drag the value to obtain a suitable look on the profile. Delete any undesirable value(s) using the Delete key on the keyboard. The new value will be instantaneously displayed in the editor. 6. Type a profile name in the text box or use the default name. 7. Click OK. This closes the editor and shows the edited profile in the profile search dialog box (Fig. 7.9.6). 8. If desired, revert to the original profile by clicking Use Profile From: Active Record. The Edited radio button is selected by default. Analyzing Expression Analysis 7.9.11 Current Protocols in Bioinformatics Supplement 7 BASIC PROTOCOL 3 COINCIDENCE TESTING This tool can be used to compare two columns and determine whether the apparent similarity between the two distributions is a coincidence or not. Essentially, the coincidence testing tool calculates the probability of getting an outcome as extreme as the particular outcome under the null hypothesis (Tavazoie et al., 1999). This tool is particularly useful in comparing the results of several different clustering methods (e.g., see Basic Protocols 4 and 5, and Alternate Protocol 2). Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) 1. Click on Analysis, followed by Pattern Detection, followed by Coincidence Testing in the Tools pane of the DecisionSite Navigator. A dialog box will be displayed (Fig. 7.9.8). 2. Select the First Category Column. For example, in comparing the results of two different clustering methods, select the first one here. 3. Select the Second Category Column. Analyzing and Visualizing Expression Data with Spotfire 4. Select whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 5. Type a Column Name for the resulting column or use the default. 7.9.12 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.8 The Coincidence Testing dialog box. 6. Select the Overwrite check box to overwrite a previous column with the same name (see Basic Protocol 1, step 6). 7. Click OK. A new results column containing p-values is added to the dataset. An annotation may also be added. HIERARCHICAL CLUSTERING Hierarchical clustering arranges objects in a hierarchy with a tree-like structure based on the similarity between the objects. The graphical representation of the resulting hierarchy is known as a dendrogram (Eisen et al., 1998). In Spotfire DecisionSite, the vertical axis of the dendrogram consists of the individual records and the horizontal axis represents the clustering level. The individual records in the clustered data set are represented by the right-most nodes in the row dendrogram. Each remaining node in the dendrogram represents a cluster of all records that lie below it to the right in the dendrogram, thus making the left-most node in the dendrogram a cluster that contains all records. Clustering is a very useful data reduction technique; however, it can easily be misapplied. The clustering results are highly affected by the choice of similarity measure and other input parameters. If possible, the user should replicate the clustering analysis using different methods. BASIC PROTOCOL 4 The algorithm used in the Hierarchical Clustering tool is a hierarchical agglomerative method. This means that the cluster analysis begins with each record in a separate cluster, and in subsequent steps the two clusters that are the most similar are combined to a new aggregate cluster. The number of clusters is thereby reduced by one in each iteration step. Eventually, all records are grouped into one large cluster. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Analyzing Expression Analysis 7.9.13 Current Protocols in Bioinformatics Supplement 7 Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Initiating hierarchical clustering in Spotfire DecisionSite 1. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering in the Tools pane of the DecisionSite Navigator (Fig. 7.9.9). The Hierarchical Clustering dialog box is displayed. 2. Select the Value Columns on which to base clustering. For multiple selections, hold down the Ctrl key and click on the desired columns or click on one of the columns and drag to select (Fig. 7.9.10). 3. Select whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a Method to Replace Empty values with from the drop-down list (see Basic Protocol 1, step 4). 5. Select which Clustering Method to use for calculating the similarity between two clusters. 6. Select which Similarity Measure to use in the calculations (Sankoff and Kruskal, 1983). Correlation measures are based on profile shape and are therefore better measures of complex microarray studies than measures like Euclidean distance, which are just based on numeric similarity. 7. Select which Ordering Function to use while displaying results. Analyzing and Visualizing Expression Data with Spotfire 8. Use the default name or type a new column name in the text box. Check the Overwrite box if overwriting a previously added column with the same name (see Basic Protocol 1, step 6). 7.9.14 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.9 The Hierarchical Clustering algorithm can be accessed from the Tools as well as the Guides menu. Figure 7.9.10 The Hierarchical clustering dialog box allows users to specify Value Columns to be included in the clustering calculation and various other calculation options such as the Clustering Method and Similarity Measure. Analyzing Expression Analysis 7.9.15 Current Protocols in Bioinformatics Supplement 7 Figure 7.9.11 Hierarchical clustering results are displayed as a (default red-green) heat map with an associated dendrogram. 9. Select the Calculate Column Dendrogram check box if creating a column dendrogram is desired. A column dendrogram arranges the most similar columns (experiments) next to each other. 10. Click OK. The hierarchical clustering dialog box will close and the clustering initiated. The results are displayed according to the user’s preferences in the dialog box (Fig. 7.9.11). 11. If desired, add the ordering column to the ordering dataset in order to compare the clustering results with other methods (see Support Protocol 2). Marking and activating nodes 12. To mark a node in the row-dendrogram to the left of the heat map, click just outside it, drag to enclose the node within the frame that appears, and then release. Alternatively, press Ctrl and click on the node to mark it. To mark more than one node, hold down the Ctrl key and click on all the nodes to be marked. To unmark nodes, click and drag an area outside the dendrogram. When one or more nodes are marked, that part of the dendrogram is shaded in green. The corresponding parts are also marked in the heat map and the corresponding visualizations. 13. To activate a node, click it in the dendrogram. Analyzing and Visualizing Expression Data with Spotfire A black ring appears around the node. Only one node can be active at a given time. This node remains active until another node is activated. It is possible to zoom in on the active node by selecting Zoom to Active from the hierarchical clustering menu. 7.9.16 Supplement 7 Current Protocols in Bioinformatics Zooming in and resizing a dendrogram 14. Zoom to a subtree in the row-dendrogram by using either the visualization zoom bar or by right clicking in the dendrogram and clicking Zoom to Active in the resulting pop-up menu. Alternatively, double click on a node. 15. To go one-step back, double click on an area in the dendrogram not containing any part of a node. To return to the original zoom, click Reset Zoom. 16. If desired, adjust the space occupied by the dendrogram in the visualization by holding down the Ctrl key and using the left/right arrow keys on the keypad to slim or widen it. Saving a dendrogram NOTE: Dendrograms are not saved in the Spotfire data file (.sfs) but can be saved as .xml documents. 17. To save, select Save, followed by Row Dendrogram or Column Dendrogram from the Hierarchical Clustering menu. 18. Type the file name and save the file as a .dnd file. Opening a saved dendrogram 19. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering in the Tools pane of the DecisionSite Navigator to display the Hierarchical Clustering dialog box. 20. Click on Open to display the Dendrogram Import dialog box. 21. Click on the Browse button by the Row Dendrogram field to display an Open File dialog box. ADDING A COLUMN FROM HIERARCHICAL CLUSTERING The ordering column that is added to the dataset when hierarchical clustering is performed is used only to display the row dendrogram and connect it to the heat map. In order to compare the results of hierarchical clustering to that of another method much as K-means clustering (see Basic Protocol 5), a clustering column should be added to the data. SUPPORT PROTOCOL 2 Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Analyzing Expression Analysis 7.9.17 Current Protocols in Bioinformatics Supplement 7 Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) 1. Perform Hierarchical Clustering on a dataset as described in Basic Protocol 4 and locate the row dendrogram, which can be found to the left of the heat map (Fig. 7.9.11). 2. If the cluster line is not visible, right click and select View from the resulting pop-up menu, followed by Cluster Scale. The cluster line, which is the dotted red line in the row dendrogram, enables users to determine the number of clusters being selected. 3. Click on the red-circle on the cluster slider above the dendrogram and drag it to control how many clusters should be included in the data column. Alternatively, use the left and right arrow keys on the keyboard to scroll through the different number of clusters. Analyzing and Visualizing Expression Data with Spotfire Figure 7.9.12 Hierarchical Clustering visualization allows users to zoom in and out of the heat map as well as the dendrogram. Individual or a group of clusters can be marked and a data column added to the Spotfire session. 7.9.18 Supplement 7 Current Protocols in Bioinformatics All clusters for the current position on the cluster slider are shown as red dots within the dendrogram. Upon positioning the red circle on its right-most position in the cluster slider, one cluster can be obtained for every record. Positioning it on its left-most position, on the other hand, causes all records to be comprised of a single cluster. 4. To retain a previously added cluster column, ensure that the Overwrite check box in the hierarchical clustering dialog is unchecked (see Basic Protocol 1, step 6). 5. Select Clustering, followed by Add New Clustering Column from the Hierarchical Clustering menu. A column with information pertaining to which cluster each record belongs, will be added to the dataset. Note that the records that are not included in the row dendrogram will have empty values in the new clustering column (Fig. 7.9.12). HIERARCHICAL CLUSTERING ON KEYS A structure key is a string that lists the substructures (for example various descriptions in the gene ontology tree). Clustering on keys therefore implies grouping genes with a similar set of substructures. Clustering on keys is based solely on the values within the key column that should contain comma-separated values for some, if not all, records in the dataset. This is a valuable tool to determine if there is an overlap between the expression data and gene ontology descriptions UNIT 7.2). ALTERNATE PROTOCOL 2 Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Analyzing Expression Analysis 7.9.19 Current Protocols in Bioinformatics Supplement 7 1. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering on Keys in the Tools pane of the DecisionSite Navigator. The Hierarchical Clustering dialog box will be displayed. 2. Select the Key Columns on which to base clustering. The Key Column can be any string column in the data. 3. Select whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a method to Replace Empty Values from the drop-down list (see Basic Protocol 1, step 4). 5. Select which Clustering Method to use for calculating the similarity between two clusters. 6. Select which Similarity Measure to use in the calculations (Sankoff and Kruskal, 1983). 7. Select which Ordering Function to use while displaying results. 8. Type a New Column Name or use the default in the text box. If desired, check the Overwrite check box if to overwrite a previously added column with the same name (see Basic Protocol 1, step 6). 9. Select the Calculate Column Dendrogram check box to create a column dendrogram, if desired. A column dendrogram arranges the most similar columns (experiments) next to each other. 10. Click OK. The Hierarchical Clustering on Keys dialog box will be closed and clustering initiated. The results are displayed according to the users preferences in the dialog box. A heat map and a row-dendrogram visualization are displayed and added to the dataset. BASIC PROTOCOL 5 K-MEANS CLUSTERING K-means clustering is a method for grouping objects into a predetermined number of clusters based on their similarity (MacQueen, 1967). It is a type of nonhierarchical clustering where the user must specify the number of clusters into which the data will eventually be divided. K-means clustering is an iterative process in which: (1) a number of user defined clusters are predetermined by the user for a data set, (2) a centroid (the center point for each cluster) is chosen for each cluster based on a number of methods by the user, and (3) each record in the data set is assigned to the cluster whose centroid is closest to that record. Note that the proximity of each record to the centroid is determined on the basis of a user-defined similarity measure. The centroid for each cluster is then recomputed based on the latest member of the cluster. These steps are repeated until a steady state has been reached. Necessary Resources Hardware Analyzing and Visualizing Expression Data with Spotfire Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or 7.9.20 Supplement 7 Current Protocols in Bioinformatics Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Performing K-means clustering 1. To initiate K-means clustering, click on Analysis, followed by Clustering, followed by K-means Clustering in the Tools pane of the DecisionSite Navigator. The K-means clustering dialog box will be displayed (Fig. 7.9.13). Figure 7.9.13 The K-means Clustering Tool dialog box allows the users to specify the number of desired clusters, the method of choice for initiating centroids, the similarity measure, and other variables. Analyzing Expression Analysis 7.9.21 Current Protocols in Bioinformatics Supplement 7 2. Select the Value Columns on which to perform the analysis. For multiple selections, hold down the Ctrl key and click on the desired columns or click on one column at a time and drag. 3. Click on the radio button to specify whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a method to Replace Empty Values with from the drop-down list (see Basic Protocol 1, step 4). 5. Enter the Maximum Number of Clusters. This is the number of clusters that the K-means tool will attempt to generate from the given data set. However, if empty clusters are generated, they will be discarded and the number of clusters displayed may be less than that specified. 6. Select a Cluster Initialization method from the drop-down menu. The user must specify the number of clusters in which the data should be organized and a method for initializing the cluster centroids. Among the methods available for this purpose are the Data Centroid Based Search, Evenly Spaced Profiles, Randomly Generated Profiles, Randomly Selected Profiles, and Marked Records. These methods are summarized in Table 7.9.1. 7. Select a Similarity Measure to use from the drop-down menu. Several different similarity measures are available to the K-means clustering tool. These measures express the similarity between different records as numbers, thereby making it possible to rank the records according to their similarity. These include Euclidian distance, Correlation, Cosine Correlation, and City-Block distance (Sankoff and Kruskal, 1983). Table 7.9.1 Cluster Initiation Methods Analyzing and Visualizing Expression Data with Spotfire Method Description Data Centroid Based Search An average of all profiles in the data set is chosen to be the first centroid in this method. The similarity between the centroid and all members of the cluster is calculated using the defined similarity measure. The profile that is least fit in this group or which is least similar to the centroid is then assigned to be the centroid for the second cluster. The similarity between the second centroid and all the rest of the profiles is then calculated and all those profiles that are more similar to the second centroid than the first one are the assigned to the second cluster. Of the remaining profiles, the least similar profile is then chosen to be the third centroid and the above process is repeated. This process continues until the number of clusters specified by the user is reached. Evenly Spaced Profiles This method generates profiles to be used as centroids that are evenly distributed between the minimum and maximum value for each variable in the profiles in the data set. The centroids are calculated as the average values of each part between the minimum and the maximum values. Randomly Generated Profiles Centroids are assigned from random values based on the data set. Each value in the centroids is randomly selected as any value between the maximum and minimum for each variable in the profiles in the data set. Randomly Selected Profiles Randomly selected existing profiles (and not some derivation) from the data set are chosen to be the centroids of different clusters. From Marked Records Currently marked profiles (marked before initiating K-means clustering) are used as centroids of different clusters. 7.9.22 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.14 K-means clustering results are displayed as a group of profile charts. Each group is uniquely colored as specified by the check-box query device. 8. Type a new column name for the resulting column or use the default. Check the Overwrite check box to overwrite any previously existing column with the same name (see Basic Protocol 1, step 6). 9. Click OK. The K-means dialog box will close and clustering initiated. At the end of clustering, the results are added to the data set as new columns and graphical representation of the results can be visualized (Fig. 7.9.14). PRINCIPAL COMPONENTS ANALYSIS Principal components analysis (PCA) is a tool to reduce the dimensionality of complex data so that it can be easily interpreted but without causing significant loss of data (Jolliffe, 1986). Often, this reduction in the dimensionality of data enables researchers to identify new, meaningful, underlying variables. BASIC PROTOCOL 6 PCA involves a mathematical procedure that converts high dimension data containing a number of (possibly) correlated variables into a new data set containing fewer uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. New variables are linear combinations of the original variables, thereby making it possible to ascribe meaning to what they represent. This tool works best with transposed data (see Support Protocol 3). Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Analyzing Expression Analysis 7.9.23 Current Protocols in Bioinformatics Supplement 7 Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Performing PCA 1. To initiate PCA, click on Analysis, followed by Clustering, followed by Principal Components Analysis in the Tools pane of the DecisionSite navigator. The PCA dialog box will open (Fig. 7.9.15). 2. Select the Value Columns on which to perform PCA. For multiple selections, hold down the Ctrl key and click on the desired columns or click on one column at a time and drag. 3. Click on the radio button to specify whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a method to Replace Empty Values from the drop down list (see Basic Protocol 1, step 4). 5. Specify the number of Principal Components. The number of Principal Components is the total number of dimensions into which the user wishes to reduce the original data. K-means clustering is an iterative process and is most valuable when it is repeated several times, using different numbers of defined clusters. There is no way to predict a good number of clusters for any data set. A pattern that is obvious in a cluster number of 20 which the user might think will be better defined with 50 clusters may in fact not appear at all in 50 clusters. It is sometimes helpful to perform a hierarchical clustering prior to K-means clustering. By looking at the heat-map and dendrogram generated by hierarchical clustering, the user will get some idea about how many clusters to specify for K-means clustering. Analyzing and Visualizing Expression Data with Spotfire 6. Type a new Column Name for the resulting column or use the default name. If desired, check the Overwrite box to overwrite a previously existing column with the same name (see Basic Protocol 1, step 6). 7.9.24 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.15 The PCA dialog box allows the users to specify which Value Columns should be included in the calculation. In addition, it allows users to define variables such as the number of desired components. Figure 7.9.16 PCA results are displayed as 2-D or 3-D plots according to the users specifications. Analyzing Expression Analysis 7.9.25 Current Protocols in Bioinformatics Supplement 7 7. Select whether to create 2D or a 3D scatter plot showing the Principal Components, or to perform the PCA calculations without creating a scatter plot by clearing the Create Scatter Plot check box. The 3D scatter plot can be rotated (Ctrl + right mouse key) or zoomed (Shift + right mouse key) to assist visualization. 8. Check the Generate Report box. This report is an HTML page that contains information about the calculation. If the user does not wish to generate this report, this box can be left unchecked. 9. Click OK. The Principal Components are now calculated and the results added to the data set as new columns. A new scatter plot and report is created according to the settings chosen in this protocol (Fig. 7.9.16). Note that the PCA tool in Spotfire is limited to 2000 columns of transposed data (i.e., 2000 records in original data). If more records are present at the time of running this Tool, they will be eliminated from the data. SUPPORT PROTOCOL 3 TRANSPOSING DATA IN SPOTFIRE DECISION SITE The Transpose data tool is used to rotate a dataset so that columns (measurements or experiments) now become rows (genes) and vice-versa. Often, transposition is necessary to present data for a certain type of visualization—e.g., Principal Components Analysis (PCA; see Basic Protocol 6)—or just to get a good overview of the data. Consider Table 7.9.2 as an example. As more and more genes are added, the table will grow taller. (Most typical microarrays contain thousands to tens of thousands of genes.) While useful during data collection, this may not be the format of choice of certain types for visualizations or calculations. By transposing this table, the following the format shown in Table 7.9.3. Table 7.9.2 Typical Affymetrix or Two-Color Microarray Dataa Gene Name Experiment 1 Experiment 2 Experiment 3 Gene A 250 283 219 Gene B 1937 80 1655 Gene C 71 84 77 Gene D 47358 131 39155 Gene E 28999 24107 24981 Gene F 689 801 750 Gene G 2004 2371 2205 a Analyzed microarray data typically consists of several rows, each representing a gene or a probe on the array, and several columns, each corresponding to different experiments (e.g., different tumors or treatments). This is the “tall-skinny” format. Table 7.9.3 Microarray Data After Transpositiona Analyzing and Visualizing Expression Data with Spotfire Experiment Gene A Gene B Gene C Gene D Gene E Gene F Gene G Experiment 1 250 1937 71 47358 28999 689 2004 Experiment 2 283 80 84 131 24107 801 2371 Experiment 3 219 1635 77 39155 24981 750 2205 a After transposition, the data is flipped so that each row now represents an experiment whereas each column now represents the observations for a gene. This “short-wide” data format is suitable for data visualization techniques like PCA. 7.9.26 Supplement 7 Current Protocols in Bioinformatics Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) 1. Open Transpose Data Wizard 1 by clicking Analysis, followed by Data Preparation, followed by Transpose Data in the Tools pane of the DecisionSite Navigator. 2. Select an identifier column from the drop-down list. Each value in this column will become a column name in the transposed dataset. 3. Select whether to create columns from All Records or Selected Records (see Basic Protocol 1, step 3). The transposed data will have exactly the same number of columns as records in the original data with an upper limit of 2000. The rest of the data will be truncated. 4. Click on Next to open Transpose Data Wizard 2. 5. Select the columns to be included in the transposition and then click Add>>. Each selected column will become a record in the new dataset. 6. Click on Next to open Transpose Data Wizard 3. 7. If needed, select Annotation Columns. Each transposed column is annotated with the value of this column. 8. Click Finish. A message box opens prompting the user to save previous work. Analyzing Expression Analysis 7.9.27 Current Protocols in Bioinformatics Supplement 7 9. Click Yes to save data. The transposed data now replaces the previous data set. Note that the user should save the previous data set with a different file name to avoid losing that data set. BASIC PROTOCOL 7 USING WEB LINKS TO QUERY THE INTERNET FOR USEFUL INFORMATION The Web Links tool enables users to send a query to an external Web site to search for information about marked records. The search results are displayed in a separate Web browser. The Web Links tool is shipped with a number of predefined Web sites that are ready to use, though the user can easily set up new links to Web sites of their choice. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Sending a query using Web links In order to send a query, the data must be in Spotfire DecisionSite. The query is sent for the marked records in the visualizations. If more than one record is marked, the records are separated by the Web link delimiter (specified under Web Links Options) in the query. 1a. In a particular visualization, mark those records for which information is desired. Analyzing and Visualizing Expression Data with Spotfire 2a. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator. The Web Links dialog box will be displayed (Fig. 7.9.17). 7.9.28 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.17 The Web Links dialog box allows users to specify the Web site to search and the Identifier column from which to formulate the query. 3a. Click to select the link to the Web site where the query will be sent. Some Web sites only allow searching for one item at a time. 4a. If there are no hits from a search, mark one record at a time in the visualizations and try again. 5a. Select the Identifier Column to be used as input to the query. Any column in the data set can be chosen. 6a. Click OK. The query is sent to the Web site and the results are displayed in a new Web browser (Fig. 7.9.18). Setting up a new Web link 1b. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator. The Web Links dialog box will be displayed. 2b. Click on Options to cause the Web Links Options dialog box to be displayed. 3b. Click on New. A new Web Link will be created and selected in the list of Available Web Links. The Preview shows what the finished query will look like when it is sent. 4b. Edit the name of the new link in the Web Link Name text box. 5b. Edit the URL to the Web link. Use a dollar sign within curly brackets {$} as a placeholder for ID. Anything entered between the left bracket and the dollar sign will be placed before each ID in the query. In the same way, anything placed between the dollar sign and the right bracket will be placed after each ID in the query. Analyzing Expression Analysis 7.9.29 Current Protocols in Bioinformatics Supplement 7 Figure 7.9.18 Results of a Web Link query are displayed in a new Web browser window. In this particular example, a significant outlier list of genes (Genbank Accession numbers) was queried using a Gene Annotation Database (created at the Hartwell Center for Bioinformatics and Biotechnology) and the results returned included Gene Descriptions and Gene Ontologies (UNIT 7.2) for the queried records. 6b. Enter the Delimiter to separate the IDs in a query. The identifiers in a query with more than one record are put together in one search string separated by the selected delimiter. The delimiters AND, OR, or ONLY can be used. The ONLY delimiter is useful when specifying genes differentially expressed at one point of time only, or genes that result in classification of a particular kind of tumor only. 7b. Click OK. The new Web Link will be saved and displayed together with the other available Web Links in the user interface. Editing a Web link 1c. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator. The Web Links dialog box will be displayed. 2c. Click on Options to display the Web Links Options dialog box. 3c. Click on the Web Link to be edited in the list of Available Web Links. The Web Link Name, URL, and Delimiter for the selected Web Link will be displayed and can be edited directly in the corresponding fields. All changes that are made are reflected in Preview, which helps show what the finished query will look like. 4c. Make desired changes to the Web Link and click OK. Analyzing and Visualizing Expression Data with Spotfire The Web Link will be updated according to the changes and the Web Links Options dialog box will close. 7.9.30 Supplement 7 Current Protocols in Bioinformatics Removing a Web link 1d. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator. The Web Links dialog box is displayed. 2d. Click on Options to display the Web Links Options dialog box. 3d. Click on the Web Link to be removed in the list of Available Web Links. The Web Link Name, URL, and Delimiter for the selected Web link will be displayed in the corresponding fields. 4d. Click Delete to clear all of the fields. Many Web Links can be deleted at the same time if several Web Links are selected in the list of Available Web Links and Delete is clicked. Press Ctrl and click on the Web Links in the list to select more than one. If some of the default Web Links are deleted by mistake, they can be retrieved by clicking the Add Defaults button. This adds all of the default links to the Available Web Links list, regardless of whether or not the links already exist. GENERATING NEW COLUMNS OF DATA IN SPOTFIRE New columns with numerical values can be computed from the current data set by using mathematical expressions. This protocol describes how to create and evaluate such expressions. Occasionally the columns included in a data set do not allow users to perform all necessary operations, or to create the visualizations needed to fully explore the data set. Still, in many cases, the necessary information can be computed from existing columns. Spotfire provides the option to calculate new columns by applying mathematical operators to existing values. For example, it may be necessary to compute the fold change in dealing with multiple array experiments. It can easily be computed by dividing the normalized signal values of the experimental array to the normalized signal values of the control array for every gene. For a discussion of normalizing data see UNIT 7.8. BASIC PROTOCOL 8 This protocol discusses dividing two columns as an example. Other calculations can be similarly performed. Spotfire supports the functions listed in Table 7.9.4 in expressions used for calculating new columns. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Analyzing Expression Analysis 7.9.31 Current Protocols in Bioinformatics Supplement 7 Table 7.9.4 Description of Various Functions Available in Spotfire Function Format Description ABS ABS(Arg1) Returns the unsigned value of Arg1 ADD Arg1 + Arg2 Adds the two real number arguments and returns a real number result CEIL CEIL(Arg1) Arg1 rounded up; that is the smallest integer which ≥Arg1 COS COS(Arg1) Returns the cosine of Arg1a DIVIDE Arg1/Arg2 Divides Arg1 by Arg2 (real numbers)b EXP EXP(Arg1, Arg2) or Arg1∧ Arg2 Raises Arg1 to the power of Arg2 FLOOR FLOOR(Arg1) Returns the largest integer which is ≤Arg1 (i.e., rounds down) LOG LOG(Arg1) Returns the base 10 logarithm of Arg1 LN LN(Arg1) Returns the natural logarithm of Arg1 MAX MAX(Arg1, Arg2, . . .) Returns the largest of the real number arguments (null arguments are ignored) MIN MIN(Arg1, Arg2, . . .) Returns the smallest of the real number arguments (null arguments are ignored) MOD MOD(Arg1, Arg2) Returns the remainder from integer division ∗ MULTIPLY Arg1 Arg2 Multiplies two real number arguments to yield a real number result NEG NEG(Arg1) Negates the argument SQRT SQRT(Arg1) Returns the square root of Arg1c SUBTRACT Arg1 − Arg2 Subtracts Arg2 from Arg1 (real numbers) to yield a real number result SIN SIN(Arg1) Returns the sine of Arg1a TAN TAN(Arg1) Returns the tangent of Arg1a a The argument is in radians. b If Arg2 is zero, this function results in an error. Examples: 7/2 yields 3.5, 0/0 yields #NUM, 1/0 yields #NUM. c The result can also be attained by supplying an Arg2 of 0.5 using the EXP function. Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Analyzing and Visualizing Expression Data with Spotfire Dividing two columns 1. Initiate a new Spotfire session and load data. For example load a few data columns from a .gpr file. 7.9.32 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.19 Right clicking in the Query Devices window allows generation of new columns. Figure 7.9.20 The New Columns dialog box. Analyzing Expression Analysis 7.9.33 Current Protocols in Bioinformatics Supplement 7 2. Right click in the query devices window. From the resulting pop-up menu (Fig. 7.9.19), chose New Column, followed by From Expression. A New Column dialog box will appear (Fig. 7.9.20). 3. From the Operators drop-down list, select “/” (Table 7.9.4). 4. Select the desired columns for Arguments 1 and 2. For example, select the normalized 635 (Cy-5) signal column as Argument 1 and the normalized 532 (Cy-3) signal column as Argument 2. 5. Click Insert Function. 6. Click Next >. 7. Enter a name for the new column, for example fold change. If the function just created can be used again later, save it as a Favorite by clicking Add To Favorites. After being saved, it will appear in the list of Favorites, and can be used again by selecting it and clicking Insert Favorite. 8. Click Finish. A new column of data will be added to the session. BASIC PROTOCOL 9 EXPORTING SPOTFIRE VISUALIZATIONS Microarray data analysis techniques usually involve rigorous computation. Most steps can be tracked and understood by novice users through the use of visualizations in two or three dimensions with a striking use of colors to demonstrate changes or groupings. UNIT 7.7 provides a detailed discussion of modifying and entracing visualizations. It is desirable that these visualizations be exported from within the Spotfire to other applications. Currently, Spotfire visualizations can be exported in four different fashions: to Microsoft Word, to Microsoft PowerPoint, as a Web page, or copied to the clipboard. Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software Analyzing and Visualizing Expression Data with Spotfire PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) 7.9.34 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.21 The Microsoft Word Presentation dialog box. Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Exporting visualizations to Word The Microsoft Word export tool exports the active visualization(s) to a Microsoft Word document. Each visualization is added to a new page in the document along with annotation, title, and legend. Note that Microsoft Word needs to be installed on the machine. 1a. Create Visualizations in Spotfire (UNIT 7.7) and if necessary, edit the Titles and Annotations. 2a. Click on Reporting, followed by Microsoft Word in the Tools pane of DecisionSite Navigator. A dialog box will be displayed listing all the visualizations that can be exported (Fig. 7.9.21). 3a. Click to select the visualizations to be exported. To select all, click on Select All. For multiple selections hold down the Ctrl key and select desired visualizations. 4a. Click OK. An instance of Microsoft Word will be displayed that contains the selected visualizations. Exporting visualizations to PowerPoint The Microsoft PowerPoint export tool exports the active visualization(s) to a Microsoft PowerPoint document. Each visualization is added to a new page in the document along with annotation, title, and legend. Note that Microsoft PowerPoint needs to be installed on the machine. Analyzing Expression Analysis 7.9.35 Current Protocols in Bioinformatics Supplement 7 1b. Create visualizations in Spotfire and if necessary, edit the Titles and Annotations. 2b. Click on Reporting, followed by Microsoft PowerPoint in the Tools pane of DecisionSite Navigator. A dialog box will be displayed listing all the visualizations that can be exported similar to the one displayed in Figure 7.9.21. 3b. Click to select the visualizations to be exported. To select all, click on Select all. For multiple selections hold down the Ctrl key and select desired visualizations. 4b. Click OK. An instance of Microsoft PowerPoint that contains selected visualizations will be displayed. Exporting visualizations as a Web page The Export as Web Page tool exports the current visualizations as an HTML file and a set of images. The user can also include annotations, titles, and legends for the visualization. 1c. Create the desired visualizations and set the query devices. If multiple visualizations are to be included, ensure that they are all visible and are in the right proportions. This is important because unlike the export to Word or PowerPoint features where each visualization is pasted on a new page (or slide) in the document, all visualizations are exported to the same page in this case. Visualizations are included in the report exactly as they are visible on the screen. Multiple visualizations can be tiled by clicking Window, followed by Auto Tile. 2c. Click on Reporting, followed by Export as Web Page in the Tools pane of DecisionSite Navigator. The Export as Web page dialog box will be displayed (Fig. 7.9.22). 3c. Enter a report header. This header will appear at the top of the Web Page Report. 4c. Check the options to include in the report. Analyzing and Visualizing Expression Data with Spotfire Figure 7.9.22 The Export as Web Page dialog box. 7.9.36 Supplement 7 Current Protocols in Bioinformatics Figure 7.9.23 Data exported from a Spotfire session to the Web is displayed as a Web page report containing all the images as well as marked records. These include Legend, Annotations, SQL query (corresponding to the current query devices setting), and a table of currently marked records. 5c. Select a graphic output format for the exported images (.jpg or .png). 6c. Click Save As. Enter a file name and a directory where the report is to be saved. The HTML report will be saved in the designated directory along with a subfolder containing the exported images. 7c. If desired, select View Report After Saving. A browser window will be launched, displaying the report (Fig. 7.9.23). Copying to clipboard This tool enables users to copy any active visualization to the clipboard and paste it to another application. 1d. Create the desired visualizations and set the Query Devices. 2d. From the File menu, click on Edit, followed by Copy Special, followed by Visualization (Fig. 7.9.24). The active visualization will be copied to clipboard. 3d. Open an instance of the desired application and paste from the clipboard. Exporting visualization from the file menu This option allows users to export data from the file menu as either .jpg or .bmp files. 1e. Create the desired visualizations and set the Query Devices. Analyzing Expression Analysis 7.9.37 Current Protocols in Bioinformatics Supplement 7 Figure 7.9.24 mode. Exporting currently active visualization using the Copy Special, Visualization Figure 7.9.25 The Export Visualization dialog box. 2e. Click on File, followed by Export, followed by Current Visualization. The Export Visualization dialog box will open. 3e. Select whether to Include Title or use the default for the visualization to be exported. The title is exported along with the visualization. Analyzing and Visualizing Expression Data with Spotfire 4e. Select Preserve Aspect Ratio or change the size of the visualization to be exported by changing the aspect settings. 5e. Click OK (Fig. 7.9.25). 7.9.38 Supplement 7 Current Protocols in Bioinformatics 6e. Choose the directory in which to save the visualization from the ensuing window. Also specify the format in which the visualization should be saved. Available choices include bitmap (.bmp), JPEG image (.jpg), PNG image (.png), and extended windows metafile (.emf). 7e. Click Save. GUIDELINES FOR UNDERSTANDING RESULTS The goal of most microarray experiments is to survey patterns of gene expression by assaying the expression levels of thousands to tens of thousands of genes in a single assay. Typically, RNA is first isolated from different tissues, developmental stages, disease states or samples subjected to appropriate treatments. The RNA is then labeled and hybridized to the microarrays using an experimental strategy that allows expression to be assayed and compared between appropriate sample pairs. Common strategies include the use of a single label and independent arrays for each sample (Affymetrix), or a single array with distinguishable fluorescent dye labels for the individual RNAs (most homemade two-color spotted microarray platforms). Irrespective of the type of platform chosen, microarray data analysis is a challenge. The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level. Biologically relevant patterns of expression are typically identified by comparing measured expression levels between different states on a gene-by-gene basis. Before the levels can be compared appropriately, a number of transformations must be carried out on the data to eliminate questionable or low-quality measurements, to adjust the measured intensities to facilitate comparisons, and to select genes that are significantly differentially expressed between classes of samples. Most microarray experiments investigate relationships between related biological samples based on patterns of expression, and the simplest approach looks for genes that are differentially expressed. Although ratios provide an intuitive measure of expression changes, they have the disadvantage of treating up- and down-regulated genes differently. For example, genes up-regulated by a factor of two have an expression ratio of two, whereas those down-regulated by the same factor have an expression ratio of −0.5. The most widely used alternative transformation of the ratio is the logarithm base two, which has the advantage of producing a continuous spectrum of values and treating upand down-regulated genes in a similar fashion. Normalization adjusts the individual hybridization intensities to balance them appropriately so that meaningful biological comparisons can be made. There are a number of reasons why data must be normalized, including unequal quantities of starting RNA, differences in labeling or detection efficiencies between the fluorescent dyes used, and systematic biases in the measured expression levels. Expression data can be mined efficiently if the problem of similarity is converted into a mathematical one by defining an expression vector for each gene that represents its location in expression space. In this view of gene expression, each experiment represents a separate, distinct axis in space and the log2(ratio) measured for that gene in that experiment represents its geometric coordinate. For example, if there are three experiments, the log2(ratio) for a given gene in experiment 1 is its x coordinate, the log2(ratio) in experiment 2 is its y coordinate, and the log2(ratio) in experiment 3 is its z coordinate. It is then possible to represent all the information obtained about that gene by a point in x-y-z-expression space. A second gene, with nearly the same log2(ratio) values for each experiment will be represented by a (spatially) nearby point in expression space; a gene with a very different pattern of expression will be far from the original gene. This Analyzing Expression Analysis 7.9.39 Current Protocols in Bioinformatics Supplement 7 model can be generalized to an infinite number of experiments. The dimensionality of expression space equals the number of experiments. In this way, expression data can be represented in n-dimensional expression space, where n is the number of experiments, and each gene-expression vector is represented as a single point in that space. Having been provided with a means of measuring distance between genes, clustering algorithms sort the data and group genes together on the basis of their separation in expression space. It should also be noted that if the interest is in clustering experiments, it is possible to represent each experiment as an experiment vector consisting of the expression values for each gene; these define an experiment space, the dimensionality of which is equal to the number of genes assayed in each experiment. Again, by defining distances appropriately, it is possible to apply any of the clustering algorithms defined here to analyze and group experiments. To interpret the results from any analysis of multiple experiments, it is helpful to have an intuitive visual representation. A commonly used approach relies on the creation of an expression matrix in which each column of the matrix represents a single experiment and each row represents the expression vector for a particular gene. Coloring each of the matrix elements on the basis of its expression value creates a visual representation of gene-expression patterns across the collection of experiments. There are countless ways in which the expression matrix can be colored and presented. The most commonly used method colors genes on the basis of their log2(ratio) in each experiment, with log2(ratio) values close to zero colored black, those with log2(ratio) values greater than zero colored red, and those with negative values colored green. For each element in the matrix, the relative intensity represents the relative expression, with brighter elements being more highly differentially expressed. For any particular group of experiments, the expression matrix generally appears without any apparent pattern or order. Programs designed to cluster data generally re-order the rows, columns, or both, such that patterns of expression become visually apparent when presented in this fashion. Before clustering the data, there are two further questions that need to be considered. First, should the data be adjusted in some way to enhance certain relationships? Second, what distance measure should be used to group related genes together? In many microarray experiments, the data analysis can be dominated by the variables that have the largest values, obscuring other, important differences. One way to circumvent this problem is to adjust or re-scale the data, and there are several methods in common use with microarray data. For example, each vector can be re-scaled so that the average expression of each gene is zero: a process referred to as mean centering. In this process, the basal expression level of a gene is subtracted from each experimental measurement. This has the effect of enhancing the variation of the expression pattern of each gene across experiments, without regard to whether the gene is primarily up- or down-regulated. This is particularly useful for the analysis of time-course experiments, in which one might like to find genes that show similar variation around their basal expression level. The data can also be adjusted so that the minimum and maximum are one or so that the ‘length’ of each expression vector is one. Analyzing and Visualizing Expression Data with Spotfire Various clustering techniques have been applied to the identification of patterns in geneexpression data. Most cluster analysis techniques are hierarchical; the resultant classification has an increasing number of nested classes and the result resembles a phylogenetic classification. Nonhierarchical clustering techniques also exist, such as K-means clustering, which simply partition objects into different clusters without trying to specify the relationship between individual elements. Clustering techniques can further be classified as divisive or agglomerative. A divisive method begins with all elements in one cluster that is gradually broken down into smaller and smaller clusters. Agglomerative 7.9.40 Supplement 7 Current Protocols in Bioinformatics techniques start with (usually) single-member clusters and gradually fuse them together. Finally, clustering can be either supervised or unsupervised. Supervised methods use existing biological information about specific genes that are functionally related to guide the clustering algorithm. However, most methods are unsupervised and these are dealt with first. Although cluster analysis techniques are extremely powerful, great care must be taken in applying this family of techniques. Even though the methods used are objective in the sense that the algorithms are well defined and reproducible, they are still subjective in the sense that selecting different algorithms, different normalizations, or different distance metrics, will place different objects into different clusters. Furthermore, clustering unrelated data will still produce clusters, although they might not be biologically meaningful. The challenge is therefore to select the data and to apply the algorithms appropriately so that the classification that arises partitions the data sensibly. Hierarchical Clustering Hierarchical clustering is simple and the result can be visualized easily. It is an agglomerative type of clustering in which single expression profiles are joined to form groups, which are further joined until the process has been carried to completion, forming a single hierarchical tree. First, the pairwise distance matrix is calculated for all of the genes to be clustered. Second, the distance matrix is searched for the two most similar genes or clusters; initially each cluster consists of a single gene. This is the first true stage in the clustering process. Third, the two selected clusters are merged to produce a new cluster that now contains at least two objects. Fourth, the distances are calculated between this new cluster and all other clusters. There is no need to calculate all distances as only those involving the new cluster have changed. Last, steps two through four are repeated until all objects are in one cluster. There are several variations on hierarchical clustering that differ in the rules governing how distances are measured between clusters as they are constructed. Each of these will produce slightly different results, as will any of the algorithms if the distance metric is changed. Typically for gene-expression data, average-linkage clustering gives acceptable results. K-Means Clustering If there is advanced knowledge about the number of clusters that should be represented in the data, K-means clustering is a good alternative to hierarchical methods. In Kmeans clustering, objects are partitioned into a fixed number (K) of clusters, such that the clusters are internally similar but externally dissimilar. First, all initial objects are randomly assigned to one of K clusters (where K is specified by the user). Second, an average expression vector is then calculated for each cluster and this is used to compute the distances between clusters. Third, using an iterative method, objects are moved between clusters and intra- and intercluster distances are measured with each move. Objects are allowed to remain in the new cluster only if they are closer to it than to their previous cluster. Fourth, after each move, the expression vectors for each cluster are recalculated. Last, the shuffling proceeds until moving any more objects would make the clusters more variable, increasing intracluster distances and decreasing intercluster dissimilarity. Self-Organizing Maps A self-organizing map (SOM) is a neural-network-based divisive clustering approach that assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. Before initiating the analysis, the user defines a geometric configuration for the partitions, typically a two-dimensional Analyzing Expression Analysis 7.9.41 Current Protocols in Bioinformatics Supplement 7 rectangular or hexagonal grid. Random vectors are generated for each partition, but before genes can be assigned to partitions, the vectors are first trained using an iterative process that continues until convergence so that the data are most effectively separated. In choosing the geometric configuration for the clusters, the user is, effectively, specifying the number of partitions into which the data is to be divided. As with K-means clustering, the user has to rely on some other source of information, such as PCA, to determine the number of clusters that best represents the available data. Principal Component Analysis An analysis of micro-array data is a search for genes that have similar, correlated patterns of expression. This indicates that some of the data might contain redundant information. For example, if a group of experiments were more closely related than the researcher had expected, it would be possible to ignore some of the redundant experiments, or use some average of the information without loss of information. Principal component analysis (PCA) is a mathematical technique that reduces the effective dimensionality of gene-expression space without significant loss of information while also allowing us to pick out patterns in the data. PCA allows the user to identify those views that give the best separation of the data. This technique can be applied to both genes and experiments as a means of classification. PCA is best utilized when used with another classification technique, such as K-means clustering or SOMs, that requires the user to specify the number of clusters. COMMENTARY Background Information Analyzing and Visualizing Expression Data with Spotfire DNA microarray analysis has become one of the most widely used techniques in modern molecular genetics and protocols have developed in the laboratory in recent years that have led to increasingly robust assays. The application of microarray technologies affords great opportunities for exploring patterns of gene expression and allows users to begin investigating problems ranging from deducing biological pathways to classifying patient populations. As with all assays, the starting point for developing a microarray study is planning the comparisons that will be made. The simplest experimental designs are based on the comparative analysis of two classes of samples, either using a series of paired case-control comparisons or comparisons to a common reference sample, although other approaches have been described; however, the fundamental purpose for using arrays is generally a comparison of samples to find genes that are significantly different in their patterns of expression. Microarrays have led biological and pharmaceutical research to increasingly higher throughput because of the value they bring in measuring the expression of numerous genes in parallel. The generation of all this data, however, loses much of its potential value un- less important conclusions can be extracted from large data sets quickly enough to interpret the results and influence the next experimental and/or clinical steps. Generating and understanding robust and efficient tools for data mining, including experimental design, statistical analysis, data visualization, data representation, and database design, is of paramount importance. Obtaining maximal value from experimental data involves a team effort that includes biologists, chemists, pharmacologists, statisticians, and software engineers. In this unit, the authors describe data analysis techniques used in their center for analysis of large volumes of homemade and commercial Affymetrix microarrays. An attempt has been made to describe microarray data analysis methods in a language that most biologists can understand. Benefit from the knowledge and expertise of biologists who ensure the right experiments are carried out is essential, in our view, for correct interpretation of microarray data. Literature Cited Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868. 7.9.42 Supplement 7 Current Protocols in Bioinformatics Jolliffe, I.T. 1986. Springer Series in Statistics, 1986: Principal Component Analysis. SpringerVerlag, New York. Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183-201. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations In Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability, Vol I. (L.M. Le Cam and J. Neyman, eds.) pp. 281-297. University of California Press, Berkeley, Calif. Sankoff, D. and Kruskal, J.B. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. AddisonWesley Publishing, Reading, Mass. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22:281-285. Contributed by Deepak Kaushal and Clayton W. Naeve St. Jude Children’s Research Hospital Memphis, Tennessee Analyzing Expression Analysis 7.9.43 Current Protocols in Bioinformatics Supplement 7 Microarray Data Visualization and Analysis with the Longhorn Array Database (LAD) UNIT 7.10 One of the many hallmarks of DNA microarray research is the staggering quantity and heterogeneity of raw data that are generated. In addition to these raw data, it is also important that the experiments themselves be fully annotated for the purpose of long-term functional understanding and comparative analysis. Finally, the combination of experimental annotation and raw data must be linked to organism-specific genomic biological annotations for the results to have immediate and long-term biological relevance and meaning. To handle these challenges, biologists in recent years have enthusiastically developed and adopted software applications powered by the resilience of computational relational databases. The Longhorn Array Database (LAD) is a microarray database that operates on the open-source combination of the relational database PostgreSQL and the operating system Linux (Brazma et al., 2001; Killion et al., 2003). LAD is a fully MIAME-compliant database. The MIAME (Minimal Information About a Microarray Experiment) standard for describing microarray experiments is being adopted by many journals as a requirement for the submission of papers. It is a fully open-source version of the Stanford Microarray Database (SMD), one of the largest and most functionally proven microarray databases (Sherlock et al., 2001). The protocols presented in this unit detail the steps required to upload experimental data, visualize and analyze results, and perform comparative analysis of multiple experiment datasets with LAD. Additionally, readers will learn how to effectively organize data for the purpose of efficient analysis, long-term warehousing, and open-access publication of primary microarray data. Each of these protocols is based on the assumption that one has access to an unrestricted user account on a fully deployed and configured LAD server. If one is a database curator who is in the position of having to maintain a LAD installation for a laboratory, core facility, or institution, Appendices A and B at the end of this unit, which deal with configuring global resources and setting up user accounts, will be of additional interest. Systems administrators who need to install LAD should consult Appendix C, which outlines the steps needed to do this, along with hardware recommendations. DATABASE LOCATION AND AUTHENTICATION Nearly all interaction with LAD is performed using a Web browser. The following protocol is a simple introduction to communication and authentication with LAD. Each of the subsequent protocols in this unit will assume that these steps have already been performed. BASIC PROTOCOL 1 Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Contributed by Patrick J. Killion and Vishwanath R. Iyer Current Protocols in Bioinformatics (2004) 7.10.1-7.10.60 C 2004 by John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.10.1 Supplement 8 Figure 7.10.1 LAD front page. Table 7.10.1 Links Available on the LAD Front Page (also see Fig. 7.10.1) Link Function Register Provides a resource that allows prospective new users to apply for a user account on the LAD server. This process is fully covered in Appendix A at the end of this unit. Login Provides the user account login portal. This function is detailed in the following steps of this protocol. Resume Provides the ability for a preauthenticated session to return to the LAD main menu without passing through the login portal. This function is only available if a user authenticates with LAD, navigates the browser to another Web site, and then wishes to return to LAD without closing the browser window. If the browser window is closed re-authentication with LAD will be required by the server. Tutorials Provides basic tutorials on how to interact with and navigate the various functions of the LAD server. Publications Provides a comprehensive list of publications that have been created on this LAD server. Publications are a very exciting feature of the LAD environment and are fully covered in Basic Protocol 8. 1. Using Internet Explorer or Mozilla, navigate to http://[your lad server]/ilat/. Note that the phrase [your lad server] indicates the fully qualified network hostname of the LAD server to be accessed. The LAD front page as pictured in Figure 7.10.1 will appear. Functions of the links available on this page are described in Table 7.10.1. 2. The user is now ready to authenticate with the LAD authentication mechanism. This protocol and all of the following protocols of this unit will assume the pkillion username (those using their own LAD user account should modify accordingly). Select the login link on the screen shown in Figure 7.10.1, then fill in the login screen with the following information: Microarray Data Visualization and Analysis with LAD User Name: pkillion Password: lad4me The LAD main menu, as pictured in Figure 7.10.2, will appear. This will be the starting place for many of the protocols in the remainder of this unit. 7.10.2 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.2 LAD main menu. EXPERIMENT SUBMISSION Experiment submission has several prerequisite actions that must be performed before a user is able to load microarray data into LAD. Appendix B at the end of this unit details these operations. They include the creation of SUID, plate, and print global resources to describe the specific microarray slide. Additionally, in Appendix A at the end of the unit there exist specific instructions for the creation of an unrestricted user account that will be utilized by an authenticated user to submit the experimental data. BASIC PROTOCOL 2 Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Files The GenePix Pro results file (GPR): This file must be created with GenePix Pro (from Molecular Devices), versions 3.x, 4.x, 5.x, or 6.x. It is highly recommended that this file not be edited in any way (especially with Microsoft Excel) once the GenePix Pro software has created it. This file is a text file, however, and can easily be read by custom analysis software applications. The GenePix Pro settings file (GPS): This file is the result of gridding operations performed within GenePix Pro. It is binary, rather than text, and is not processed in any way by LAD. LAD stores it for the purpose of complete warehousing of all information important and relevant to the experimental data. Analyzing Expression Patterns 7.10.3 Current Protocols in Bioinformatics Supplement 8 GenePix Pro Green Channel Scan File (TIFF One): The single-channel TIFF image of the green (532 nm) wavelength. GenePix Pro has two options for the creation of TIFF images. First, images may be saved as multi-channel, thus combining both the quantitation of the green and red wavelengths into one single file. Additionally, images may be saved as single-channel. This format separates the green and red data to separate files. LAD is only compatible with the latter option—images must be saved as single-channel TIFF images in order to be loaded into LAD as part of an experiment. TIFF One represents one of these files, most likely the green channel. GenePix Pro Red Channel Scan File (TIFF Two): The single-channel TIFF image of the red (635 nm) wavelength. This file is the second of two TIFF files that represent the image as captured by a microarray scanner. Sample files: LAD provides sample files that can be used to upload an experiment for the included Saccharomyces cerevisiae print. These files will be referenced for the remainder of this protocol. If one is using one’s own experiment files, one will need to substitute file information appropriately. These files are located in the directory /lad/install/yeast_example_print/. The sample experiment file names are: 250.gpr 250.gps 250 ch1.tif 250 ch2.tif Each of these four files should be ready for submission. It is recommended that they be named with a consistent naming scheme that implies their functional linkage as a group of files relevant to a single microarray hybridization. For example, foo.gpr, goo.gps, twiddle_1.tif, and twaddle_b.tif would be an inappropriate naming scheme while SC15-110.gpr, SC15-110.gps, SC15-110-green.tif, and SC15-110-red.tif would be an excellent reminder of experimental association. In this manner, the example files are both appropriately and inadequately named. They have a consistent naming scheme (250). This simple number, however, is probably inadequate for describing the exact experiment that these files represent. It will be seen in the coming steps how consistency becomes important for archiving and querying many related experiments. Copy the experiment files to the server 1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). 2. Experiment creation is a two-step process. The first step involves the placement of the experiment files into the user incoming directory. From the screen pictured in Figure 7.10.2, select the Upload Experiment Files link The screen shown in Figure 7.10.3 should appear. 3a. To upload files from a local drive: One at a time, use the Browse buttons to locate each of the four files. Make sure to select the correct file for each label. When complete, press the Upload Experiment Files button. This process may take from a few seconds to a few minutes depending on the speed of the network connection between the local computer and the LAD server. It is recommended that this operation not be performed over dial-up or other low-bandwidth connections. Microarray Data Visualization and Analysis with LAD Once complete, confirmation will be received as illustrated in Figure 7.10.4. This indicates that the files are now in the user incoming directory (on the LAD server). This is the purpose of the system account created in Appendix A at the end of this unit—to provide a user-specific locale in which experimental files may be placed for processing into the database. It is recommended that the provided example files be used at this stage, as they will be compatible with the print created in Appendix B. 7.10.4 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.3 Experiment file upload screen. Figure 7.10.4 Experiment file upload results screen. 3b. Alternative file transfer methods: Alternatively, one could choose to use any other method of transferring files from one’s client machine to the LAD server. Technologies such as FTP, SFTP, Samba, and NFS might provide a superior solution for the secure transfer of larger file sets to the LAD server. If one is already working from an X-Windows console on the LAD server itself, one would simply need to copy the files to the appropriate incoming directory. Experiment record submission With the GPR, GPS, and TIFF files properly copied to the server it is now time to submit the experiment to the database for processing. 4. From the screen pictured in Figure 7.10.2, select the Enter Experiments and Results link. The screen shown in Figure 7.10.5 should appear. This screen allows one to select the correct organism and decide whether one wishes to use batch submission or single experiment submission. Batch submission of experiments is useful for submitting of a large number of experiments as a single transaction. It is outside the scope of this protocol to fully cover batch submission. Please see the LAD Web site documentation for further details on this option. 5. From the Choose Organism pull-down menu, select “Saccharomyces cerevisiae.” Choose the radio button labeled No under “Check here if you wish to submit Experiments and Results by batch,” then press the button labeled Enter Experiments into LAD. 6. The screen shown in Figure 7.10.6 should now appear. This is the screen that will make it possible to fully describe this experiment in the database. It is important that consistent and meaningful values be entered into the fields provided so that the longterm entry of experiments yields an environment that is navigable and thoughtfully organized. The following options are for sample purposes and apply the sample files provided with the LAD distribution. Fill the experiment submission screen out with the following values: Analyzing Expression Patterns 7.10.5 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.5 Experiment upload screen 1. Figure 7.10.6 Experiment upload screen 2. From the LAD Print Name pull-down menu, select SC1 Set Slide Name to a value of SC1-250 From the Data File Location pull-down menu, select 250.gpr From the Grid File Location pull-down menu, select 250.gps From the Green Scan File Location pull-down menu, select 250-ch1.tiff From the Red Scan File Location pull-down menu, select 250-ch2.tiff Leave MIAME Annotation File Location as “You have no MIAME file for loading.” The following fields and values are not depicted in the figure but should be set accordingly: Microarray Data Visualization and Analysis with LAD Leave Experiment Date as the current date Set LAD Experiment Name to a value of Sample Experiment 7.10.6 Supplement 8 Current Protocols in Bioinformatics Set LAD Experiment Description to a value of Set of Experiment Files that come with LAD From the LAD Experiment Category pull-down menu, select Test From the LAD Experiment SubCategory pull-down menu, select Test Set Green Channel (CH1) Description to a value of Control Channel Set Red Channel (CH2) Description to a value of Experimental Channel From the Reverse Replicate pull-down menu, select N From the Set Normalization Type pull-down menu, select Computed. 7. The Experiment Access section of this experiment submission screen, not depicted in the figure, can be left with its default selections for now. Experiment access and issues of experiment ownership and security will be fully discussed in Appendices A and B at the end of this unit (also see Basic Protocol 3, step 7). Additionally, microarray data normalization will be further discussed within Guidelines For Understanding Results. 8. Press the “Load Experiment into LAD button.” A confirmation screen should now appear. If errors appear, simply hit the browser Back button, fix them, and resubmit the screen by pressing the Load Experiment into LAD button. Server-side experiment processing 9. Linux cron is a standard service, installed with every variety of Linux, that maintains and executes scheduled system tasks. This is both useful and required for the serverside experiment processing step of experiment record submission. If Linux cron has been previously configured to maintain and execute the loading script then there is no required action at this time. Proceed directly to the next step of this protocol. Conversely, if this script is not executing on a regular schedule, it will be necessary to execute it in order to process the submitted experiment into the database. Log into the LAD server as the root user and execute: /lad/www-data/cgi-bin/SMD/queue/LoadExpt2DB.pl Appendix C of this unit describes how certain operations must be automatically executed on the LAD server on some timed schedule. The authors suggest the utilization of the Linux cron functionality for this purpose. One of the specific resources that is mentioned is the script: /lad/www-data/cgi-bin/SMD/queue/LoadExpt2DB.pl When step 3 of this protocol (experiment record submission) has been completed, a record is placed in an operational loading queue of submitted experiments. This queue cannot be processed in any way by the LAD Web interface. Rather it must be monitored and processed by the script listed above. Experiment loading report 10. The experiment submission confirmation screen mentioned in step 8 (not shown in figure) contains a link that will make it possible to watch the experiment being processed into the database. This page will refresh regularly and will update itself when the experiment begins processing. The time of execution will depend on the action (i.e., the script) executed in step 9 of this protocol. Additionally, other experiments in the queue could delay the processing of your experiment, as they are sequentially loaded into LAD. Log examination 11. Carefully check the log for any error conditions the server may have detected. If the experiment loads successfully, one can now proceed to the next protocol in this unit (Basic Protocol 3). Analyzing Expression Patterns 7.10.7 Current Protocols in Bioinformatics Supplement 8 BASIC PROTOCOL 3 EXPERIMENT SEARCHING In order to analyze experimental data, it is necessary to successfully find and aggregate experiments of interest. This protocol will detail the operations needed to accomplish this goal. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Software preparation 1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). 2. Upload the data as in Basic Protocol 2. Experiment searching 3. From the LAD main menu (Fig. 7.10.2) select the Results Search link. The primary screen from which one can set filters to locate a desired set of experiments will then appear (Fig. 7.10.7). The search process is divided into three main methodologies: Experimenter/Category/SubCategory search Print search Array List search Microarray Data Visualization and Analysis with LAD Figure 7.10.7 Experiment search screen. 7.10.8 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.8 Experiment search results screen. All three of these methodologies, however, are globally filtered by the Select Organism filter. Each of these methods will now be demonstrated and described. 4. Perform organism filtering: From the Select Organism pull-down menu, select “Saccharomyces cerevisiae.” Press the Limit Lists By Organism button to limit all lists present on the screen. 5a. To perform experimenter/category/subcategory search: The Experimenter, Category, and SubCategory lists are a direct reflection of one’s group membership (see step 7, below) combined with the previous Organism filter (step 4). It will only be possible to see experiments to which one has group or user-defined access. By default, the username that was entered at login, pkillion, is selected in the Experimenter list. Select the radio button Use Method 1 to use the first method of searching. Press the Display Data button to retrieve experiments that are compliant with the current filter settings. A single experiment should now be seen, as only one experiment has been uploaded at this point. The display will be similar to that in Figure 7.10.8. Press the browser’s Back button to return to the search filter screen. 5b. To perform a Print search: Select the radio button Use Method 2 to use the second method of searching. This makes it possible to search for experiments in a single specific print batch. Note that any filter settings applied in the Use Method 1 section are not applied to the search. 5c. To perform an Array List search: Select the radio button Use Method 3 to use the third method of searching. This method utilizes array lists, which are stored sets of ordered experiments that make it extremely easy to recall commonly grouped sets of experiments. Array lists will be covered in much more detail in Basic Protocol 7. Naming conventions 6. Figure 7.10.9 is from the original LAD server at the Iyer Lab, University of Texas at Austin. This screen is presented for two primary reasons. First, it is intended to convey what LAD search results will look like when a LAD database contains many more Analyzing Expression Patterns 7.10.9 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.9 Experiment search results screen from the original LAD server at the Iyer Lab, University of Texas, Austin. Figure 7.10.10 Microarray Data Visualization and Analysis with LAD Experiment editing screen. experiments. Additionally and more importantly, these experiments are named in a very consistent and meaningful manner. It is important that grouped sets of experiments be uploaded in this manner. Subsequent data analysis and storage will be greatly enhanced through this operational habit. This and other best practices for the longterm storage of microarray data are more fully elaborated upon in the Commentary. 7.10.10 Supplement 8 Current Protocols in Bioinformatics Adding experiment group and user permissions 7. From the LAD main menu as depicted in Figure 7.10.2, select the Results Search link. Do not change the search methodology or filter values. Press the Display Data button to display the experiment uploaded in a previous protocol (format will be as in Fig. 7.10.8). Select “Edit” to navigate to a complete experiment information–editing screen as shown in Figure 7.10.10. From this screen, one has the ability to edit most information that was previously associated with the experiment at the time it was submitted to the database. The permissions options, not depicted in the figure but which would be at the very bottom of the screen, should be noted. One can utilize these Group and User entries to grant access to one’s experiments to users and groups other than those in one’s default group. SINGLE-EXPERIMENT DATA ANALYSIS LAD-based microarray data analysis is generally performed in one of two differing pathways of interaction. Single-experiment data analysis is focused upon both the qualitative and quantitative scrutiny of experimental data from a single microarray. This protocol will focus upon the toolsets and options available in LAD. Basic Protocol 5 will explore the other domain of LAD data investigation, multiple-experiment comparative analysis. BASIC PROTOCOL 4 In this example, LAD is asked to show high-quality spot data, joined dynamically with genomic annotations, sorted in order of chromosome, where the expression of the transcripts represented by each of the spots is significantly up-regulated. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Software preparation 1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). From the LAD main menu (Fig. 7.10.2), select the Results Search link and navigate to the post-search experiment list (Fig. 7.10.8) as described in Basic Protocol 3. Experimental data browsing 2. For each experiment, there will be a set of icons as pictured in Figure 7.10.11. Each icon leads to a unique function with respect to single experiment data analysis. 3. Browsing, filtering, and analysis of the original uploaded data, in line with genomic annotations depending on the organism being utilized, are available through the Data icon. For the sample experiment uploaded previously (Basic Protocols 2 and 3), select the Data icon. A screen will appear that is similar to the one pictured below in Figure 7.10.12. Figure 7.10.11 Single experiment analysis icons. Analyzing Expression Patterns 7.10.11 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.12 Single experiment data filtering. 4. The screen depicted in Figure 7.10.12 provides access to several functions that make it possible to investigate the experimental data. Set the Sort By pull-down menu to a value of Chromosome. Set the Sort By order pull-down menu (right of the above) to a value of Descending. Using the Ctrl key (or the Cmd key on Macintosh), add the following values to the Display list: Log(base2) of R/G Normalized Ratio (Median) Log(base2) of R/G Normalized Ratio (Mean). Select the following values from the Annotation list (to the right of the Display): Function Gene. Select the check box for Make downloadable files of ALL returned records. Activate the default filters #1, #2, and #3 by selecting their corresponding check boxes. Activate an additional filter #4 by selecting its corresponding check box. Change the filter #4 name to Log(base2) of R/G Normalized Ratio (Median). Microarray Data Visualization and Analysis with LAD Set the filter #4 operation to a value of >=. Set the filter #4 value to a value of 3. 7.10.12 Supplement 8 Current Protocols in Bioinformatics With regard to the data presented, it should be recognized that this data browsing functionality is being used to ask LAD to do more than simply display data. The options available have been used to ask a real biological question. It is important to remember that the end goal of any data analysis task is exactly that—to elucidate the biological phenomena behind the data. In this example LAD has been asked to show high-quality spot data, joined dynamically with genomic annotations, sorted in order of chromosome, where the expression of the transcripts represented by each of the spots is significantly up-regulated. The sort is accomplished through use of the Sort By drop-down list. The in-line genomic annotations are installed with LAD and are made viewable through the extra box of annotation columns. Selecting any or all of them automatically causes their parallel inclusion with the raw data columns that are selected for browsing. The filtering for highquality spots is accomplished by the activation of default filters #1, #2, and #3. The first default filter drops all spots that were flagged as bad during the GenePix Pro gridding process. The second and third filters ensure that the per-channel signal intensity is above some minimal threshold. The minimal threshold values provided are completely arbitrary and may or may not be applicable to a particular set of experimental data. Finally, the most up-regulated spots are selected through the inclusion of a custom filter #4. This filter only allows spots whose Log(base2) of R/G Normalized Ratio (Median) is greater than or equal a significant value of 3. Because this ratio is expressed in log2 this actually translates to a 23 or 8-fold upregulation of the expression level of the experimental sample (red) relative to the control sample (green). 5. Press the Submit button to query the database based on all of these values. The resulting screen will be similar to that pictured in Figure 7.10.13. Note the dynamic links that available in addition to the experimental data presented: Zoom: Shows individual spot data and genomic annotations Whole: Shows spot location on overall microarray SGD: Dynamic link to gene record in Saccharomyces Genome Database (Cherry et al., 1998) One should now experiment with other queries that may perhaps be more biologically meaningful for one’s particular field or biological questions of interest. Figure 7.10.13 Single experiment data query results. Analyzing Expression Patterns 7.10.13 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.14 Single experiment data view screen. Experiment details viewing 6. Return to the experiment list screen as pictured in Figure 7.10.8. For the sample experiment uploaded previously (Basic Protocols 2 and 3), select the View icon. A screen will appear that is similar to the one pictured in Figure 7.10.14. Here one has complete access to all of the experimental annotation provided when this experiment was submitted to the database. Additionally, there are links to visualization tools; these will be explored and described in step 7. Finally, in the Submitted Files section of this screen it is possible to access the original GenePix Pro files that were uploaded with this experiment. These files may be retrieved at any time from LAD for the purpose of analysis in other toolsets. It is important to note that only the LAD user who submitted this experiment will have access to these links. Even members of the default group, who by definition have access permission to query one’s experimental data, cannot download these original data files. Experiment visualizations: Data distribution 7. On the screen pictured in Figure 7.10.14, select the Data Distribution link. A browser window similar to the one pictured in Figure 7.10.15 will be displayed. The plot shown can be extremely useful with respect to understanding the distribution of your data relative to the log-center of zero. Close the extra browser window that opened with the selection of this tool. Microarray Data Visualization and Analysis with LAD The plot is a histogram where the vertical axis is an absolute count of spot frequency. The horizontal axis is a spread of both positive (up-regulated) and negative (down-regulated) normalized log-ratio bins. The computed normalization value shown is a direct function of the data present. In essence, this computed normalization value is a coefficient that was calculated when the experiment was uploaded to the database. Applied to every spot’s red-channel value, the normalization coefficient serves to bring the overall normalized logratio distribution to an arithmetic mean of zero. Microarray data normalization is more completely discussed in Guidelines for Understanding Results, below. Please refer to this section for a more complete description of the concepts and caveats of data normalization. 7.10.14 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.15 Single experiment data distribution. Figure 7.10.16 Single experiment plot data. Experiment visualizations: Plot data 8. On the screen pictured in Figure 7.10.14, select the Plot Data link. A browser window similar to the one pictured in Figure 7.10.16 will be displayed. This visualization tool is utilized in two steps. First, as shown in Figure 7.10.16, one is presented with options to create a scatter plot of one data column versus another within the context of this single microarray experiment. Second, these options are used to actually generate the desired plot. Analyzing Expression Patterns 7.10.15 Current Protocols in Bioinformatics Supplement 8 This scatter plot can have any GenePix Pro or LAD data value stored in the database on either axis. Additionally, these values can be rendered in either log2 or linear space with respect to the transformation of the data values. A common plot utilized in microarray data analysis is the MA plot (Dudoit et al., 2003). The MA plot demonstrates the relationship between the intensity of a spot and its log2 ratio. What relationship should these two variables have in a typical microarray experiment? Ideally, none. The net intensity of a spot should bear no relationship to the ratio of individual wavelength values. Nonetheless, biases can and will occur and should be checked for through the use of visualization toolsets. A traditional MA plot consists of the scatter-plot rendering of: log2 (R/G) versus 12 log2 (R×G). 9. This step will demonstrate LAD’s ability to closely approximate this scatter plot through the Plot Data functionality. Set the following values on the screen depicted in Figure 7.10.16: For the X-Axis Scale select Log(2) For the Values to Plot on X-Axis select the column SUM MEDIAN For the Y-Axis Scale select Linear For the Values to Plot on Y-Axis Scale select the column LOG RAT2N MEDIAN. 10. Press the “Plot ‘em!” button to create the scatter plot. The scatter plot that is rendered should be similar to the one pictured in Figure 7.10.17. Close the extra browser window that opened with the selection of this tool. The example plot shown in Figure 7.10.17 demonstrates the null relationship that one expects to find between the log-ratio and the sum of median intensities of the spots. One expects and observes a fairly flat distribution of log-ratio values, centered around zero. There are, of course, fewer spots at the higher range of spot intensity, but this is to be expected—there are always going to be more low-intensity spots than high-intensity spots. If there were a true relationship between log-ratio and spot intensity, one would see either an upswing or downswing in the log-ratio as the plot extends out towards higher intensity values. Microarray Data Visualization and Analysis with LAD Figure 7.10.17 Plot produced via single experiment plot data screen (also see Fig. 7.10.16). 7.10.16 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.18 Single experiment ratios on array (sample experiment). The Plot Data tool can, of course, be utilized to generate many other biologically meaningful combinations of scatter plots through the utilization of other column and numerical transformation values in the plot generation. It can also be used to detect spatial biases in ratios or intensities on the microarray. Experiment visualizations: Ratios on array 11. Return to the screen pictured in Figure 7.10.14 and select the Ratios on Array link. A browser window similar to the one pictured in Figure 7.10.18 will be displayed. This visualization tool is intended to aid in the location of spatial bias with respect to patterns of hybridization across the surface of the actual microarray chip. Aside from biased introduced through the intentional design of the microarray chip, one should not expect to see a correlation between the log-ratio of a spot and its geographic location on the array. This tool will aid in the detection of such biases. The data presented can be analyzed and manipulated in three ways (steps 12a, 12b, and 12c). 12a. First, through manipulation of thresholds required to render the blue, amber, and dim spots, one can use simple visual analysis to identify spatial bias. An example of significant spatial bias is depicted in Figure 7.10.19. 12b. Second, the ANOVA (analysis of variance) with respect to log-ratio as it relates to microarray chip sector can be used to expose a print-based bias with respect to the experimental data captured by this microarray experiment (Kerr et al., 2000). Analyzing Expression Patterns 7.10.17 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.19 Single experiment ratios on array with bias shown. 12c. Finally, the ANOVA of the log-ratio as it relates to the 384-well plate has the ability to expose experimental bias introduced by specific print plates during the fabrication of microarrays. Experiment visualizations: Channel intensities 13. Return to the screen pictured in Figure 7.10.14 and select the Channel Intensities link. A browser window similar to the one pictured in Figure 7.10.20 will be displayed. This tool is very similar to the Data Distribution tool that was utilized in step 7 of this protocol. There are two primary differences: (1) the histogram displays the distribution of net intensities of each channel (green and red) rather than the distribution of the log-ratio values; and (2) the data can be viewed as either normalized or prenormalized values. This is useful for visualizing the effect that data normalization has had upon your raw data. 14. On the screen depicted in Figure 7.10.20, change the “Channel 2 normalization” to a value of “Non-normalized.” Press the Submit button to view the result. Microarray Data Visualization and Analysis with LAD The change may be too quickly rendered to enable one to discern the difference in the plot. The authorsof this unit suggest that the Web browser’s Back and Forward buttons be used to quickly move back and forth between the current screen and the previous. One should see a movement of the red channel histogram while the green channel remains steady. This is simply because the normalization coefficient is only applied to the red channel’s raw 7.10.18 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.20 Single experiment channel intensities. Figure 7.10.21 Select data grids (filters). data, never the green. For more information on microarray data normalization please see Guidelines for Understanding Results, below. Data filters to gridded microarray image map 15. Return to the experiment list screen pictured in Figure 7.10.8. For the sample experiment uploaded in a previous protocol (Basic Protocols 2 and 3), select the View Array Image and Grids icon. A screen will appear that is similar to the one pictured in Figure 7.10.21. This tool allows one to again explore the relationship between one’s raw data and its distribution across the surface of the actual microarray image. Activate default filters #1, #2, and #3 (leaving the filters at their default values). Press the Submit button. 16. An image map of the entire microarray will now appear, as depicted in Figure 7.10.22. Spots that are surrounded by a box are the ones that passed the filter settings. Unboxed spots are spots that did not pass the filters provided. Individual spots are clickable in order to zoom to the spot-specific data for the experiment being browsed. Experiment details editing 17. Return to the experiment list screen as pictured in Figure 7.10.8. For the sample experiment uploaded in a previous protocol (Basic Protocol 2 or 3), select the Edit Analyzing Expression Patterns 7.10.19 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.22 Data grids (results). icon. A screen will appear that is similar to the one pictured below in Figure 7.10.10. This screen allows one to perform three functions. a. First, it is possible to alter and resubmit any of the annotations provided when the experiment was submitted. This is useful for maintaining a consistent and informative naming scheme as one begins to load more and more experiments into LAD. One might eventually adopt a consistent naming strategy, and therefore it might become desirable to revisit older experiments to bring them in line with that naming convention. b. Second, it is possible renormalize one’s data with a new normalization coefficient if desired. This can often be quite useful when one has utilized some other analysis program to determine a custom normalization coefficient. c. Finally, the Remove Access and Add Access functions, not depicted in the figure but near the bottom of the screen, make it possible to either add to or remove access permissions to this specific experiment. Experiment access, the consequences of group membership, and the global access model implemented by LAD are fully described in Appendix B of this unit. Experiment deletion 18. Return to the experiment list screen as pictured in Figure 7.10.8. The Delete icon brings up a screen that makes it possible to permanently delete a particular experiment. This function is to be used with extreme caution. Deletion of an experiment not only expunges its data from the database but causes the deletion of the archived GPR, GPS, and TIFF files. One should only delete an experiment if one truly intends to permanently remove it from the database. BASIC PROTOCOL 5 Microarray Data Visualization and Analysis with LAD 7.10.20 Supplement 8 MULTIEXPERIMENT DATA ANALYSIS Thus far it has been shown that LAD, like many microarray databases, is an extremely powerful software environment. It provides solutions to a vast number of problem domains that are encountered in microarray data analysis. How do I safely store all of my experiments in a central repository? How do I keep the experiments functionally organized and annotated over time? How do I provide secure yet flexible data access to a group of possibly geographically disparate research scientists? How can I systematically analyze my experiments to detect false positives, experimental artifacts, microarray fabrication biases, and meaningful biological relevance? Many of these questions have been addressed in the preceding protocols. Current Protocols in Bioinformatics Where microarray databases truly come into their own, however, is in the analysis of multiple microarray experiments. The intrinsic ability of LAD, powered by its relational database, to filter, annotate, and analyze thousands of genes across a large number of related experiments makes it a powerful analysis tool for obtaining biological insights from microarray data. In this protocol, what is often called the pipeline of microarray analysis will be described. This pipeline the process by which one takes more than one experiment and works with the combined dataset as a unit. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher 1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). 2. Prepare multiple data sets (Support Protocol). 3. From the LAD main menu as depicted in Figure 7.10.2 select the Results Search link. The familiar experiment search filters screen (Fig. 7.10.7) will appear. Make sure the filter radio button is set to a value of Use Method 1. Change no filter settings. At the bottom of the screen, press the Data Retrieval and Analysis button. A screen will appear (not depicted) that makes it possible to select multiple experiments for parallel analysis. In the selection box entitled “Select experiment names from the following list” on that screen select a few experiments for analysis. Please note that all experiments must be from the same organism in order to use them in the multiple experiment analysis pipeline. Gene selection and annotation 4. On the experiment selection screen (not depicted) mentioned in the previous step, press the Data Retrieval and Analysis button to proceed into the multiple experiment analysis pipeline. The first step of the process is a set of filters that specifically handle the selection of genes and external genomic annotations. The gene and annotation selection screen is as shown in Figure 7.10.23. 5. First, specify genes or clones for which to retrieve results This substep of the analysis pipeline allows one to provide either a text file or typed list containing a specific list of gene names to which one wishes to constrain the analysis. Select the All radio button. By selecting All it is indicated that one does not wish to perform any filtering of genes in this manner. As shown, the specific list of genes can be provided in one of two ways. a. A file can be placed in one’s own user genelists directory on the LAD server. This list contains the sequence names that were given to the genes when the sequences were uploaded to the database. Please consult Appendix B for further details on sequence names. b. A list of gene names can be placed in the provided text-area, each delimited by two colons. 6. Next, decide whether, and how, to collapse the data. For this example, leave the default selection. As indicated, this default behavior will cause LAD to detect duplicate gene names within the context of a single experiment and will average the value of the spots accordingly. Analyzing Expression Patterns 7.10.21 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.23 Gene annotation filters. 7. Next, choose the contents of the UID column of the output file. For this example, leave this box unchecked. This option will simply include the database keys for each of the spots in the output file of this process. 8. Next, choose the desired biological annotation. Select Function and Chromosome by holding down either the Ctrl key (Windows) or the Cmd key (Macintosh) while clicking the selections. This selection box makes it possible to pick and choose from the biological annotations available for the organism currently being analyzed. Note that if a custom genelist is used, one can choose to include a second column of custom annotations that will be retained if the “Genelist annotation” radio button is selected. 9. Finally, choose a label for each array/hybridization by selecting one of the radio buttons. This will simply select how the experiments are labeled in the final output file. Press the Proceed to Data Filtering button to bring up the raw data filters screen (Fig. 7.10.24). Raw data filters The raw data filters are similar to the ones seen several times elsewhere within the LAD environment. In the following steps LAD is once again selecting for genes that meet some specific criteria—this time the criteria are based upon their actual data values. In this example, most of the default selections will be used. Microarray Data Visualization and Analysis with LAD 10. First, choose the data column to retrieve, for this example, Log(base2) of R/G Normalized Ratio (Mean). In terms of downstream analysis, this is nearly always the most interesting value to select for each gene and experiment combination. One may often wish to use LAD, however, to aggregate a very different value across a set of experiments. Please note that use of values other than ratios will disable many subsequent analysis and visualization options. An example includes hierarchical clustering analysis, a tool that assumes it is receiving log-transformed ratio values. 7.10.22 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.24 Raw data filters. 11. Next, decide whether to filter by spot flag. For the example here, leave this option selected. We do not wish to include spots that have previously been flagged as bad in each of the experiments. This flagging occurred during the GenePix Pro gridding process. 12. Next, select criteria for spots to be selected. Activate filters #1, #2, #3, #4, and #5 by checking the corresponding check boxes. Note that in this and many of the other features in LAD where raw data filters are present, the ability to construct a unique Boolean combination of filters is provided. By default, filters operate with an implicit Boolean AND logic—each filter is combined with the next in a relationship that demands that resulting spots that pass the filters pass all of the filters. Conversely, the Filter String text box below the filter definitions in Figure 7.10.24 can be utilized to design a unique Boolean combination of activated filters to allow for the creation of more complex queries. For example, one could choose to create the following for the filters we have activated: 1 AND ((2 OR 3) AND (4 OR 5))) OR 6 13. Last, decide on some image presentation options. For this example, leave “Retrieve spot coordinates” selected and leave “Show all spots” not selected. The “Retrieve spot coordinates” option will make it possible to see actual spot images in parallel with synthetic hierarchical cluster spot coloring later in the analysis process. Proceed to the next step by pressing the Proceed to Gene Filtering button. Data transformation and gene filters The remaining steps describe the final stage of the filtering process. In summary, specific genes of interest were first selected, combining genomic annotations. Next, raw data filters were used to again select for genes that meet some quantitative thresholds with respect to dozens of data metrics on a per-array basis. Now, at the final step, filters and transformations are applied to a single chosen data column (typically a log-ratio) across Analyzing Expression Patterns 7.10.23 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.25 Final filters and data transformations. the set of selected experiments. The key distinction between this step and the previous filtering step is that the previous step applies filters to values on each array—if a spot fails the filter on a given array, that becomes a missing value, but there still may be data for that same gene in another array. In this step, filters are applied to a given spot across all arrays—if a spot fails a filter, it is excluded from subsequent analysis. The final data filter and transformation screen is as pictured in Figure 7.10.25. 14. Under the set of options labeled, “First, choose one of these methods to filter genes based on data distribution,” select the radio button for “Do not filter genes on the basis of data distribution.” Percentile ranking is often quite valuable for certain microarray applications, such as the ChIP-chip technique for studying protein-DNA interactions (Ren et al., 2000; Iyer et al., 2001). For the purposes of this example, a filter that is available further down this screen will instead be utilized. 15. Next, choose whether to filter genes and arrays based on the amount of data passing the spot filter criteria. For this example, leave both “Only use genes with greater than 80% good data” and “Only use arrays with greater than 80% good data” not selected. Microarray Data Visualization and Analysis with LAD These filters are typically utilized to ensure that a given gene has a sufficient number of data values across the set of experiments and that a given microarray has a sufficient number of data values across all genes. Including genes or microarrays with too few data points may cause noisy data to cluster together in subsequent steps, even though there is no underlying basis for the clustering. It is important to remember, however, that these filters can inadvertently drop genes or arrays that have true biological value. Care must be taken when applying quality controls of this nature. 7.10.24 Supplement 8 Current Protocols in Bioinformatics 16. Next, decide whether to center the data. For this example, leave “Center data for each gene by: Means,” “Center data for each array by: Means,” and “Don’t iterate” unselected. This option is only relevant if one is retrieving log-transformed data values. Centering gene data by means or medians would simply change each ratio value such that the average gene ratio will be zero after centering. Likewise, centering by arrays will transform the data such that the average ratio, in the context of a single experiment, will become zero. Stated differently, centering data is a generic transformation that will cause the mean or the median of either the row of gene values or the column of array values to be subtracted from each individual gene or array value. Gene centering is typically utilized when one has a set of microarray experiments that have all used a common reference sample in the hybridization (Eisen et al., 1998). 17. Next, select a method to filter genes based on data values. For this example, select the radio button labeled “Cutoff: select genes whole Log(base2) of R/G Normalized Ratio (Mean) is (absolute value >) 3 for at least 1 array(s).” This filter is a method by which it is possible to eliminate all but the most significantly up-regulated and down-regulated genes based on the extent of differential expression. The first part of the filter is where what is often denoted as a fold-change, i.e., an elimination of genes that are not showing a significant amount of either up- or down-regulation relative to a reference, is implemented. The second part of the filter, the “for at least X arrays” part, is where it is possible to implement some requirement for consistent change in expression. If ten microarray experiments were being analyzed in this example, the value could be changed from 1 to 5. This would impose a situation where genes would be required not only be significantly up- or down-regulated, but to do so in at least half of the experiments analyzed. 18. Finally, decide whether to timepoint-transform data. For this example, leave “Transform all data by the following experiment” unselected. This option is useful when analyzing time-course experiments. Often, in this situation, one is interested in the relative expression profiles of genes with respect to some timezero. When this option is selected, the gene values in the indicated experiment are either subtracted (log number space) or divided (linear number space) from the gene values in each of the other experiments. 19. Press the Retrieve Data button to apply all of these filters to the experiments selected. SYNTHETICALLY GENERATE MULTIPLE EXPERIMENTS FOR BASIC PROTOCOL 5 SUPPORT PROTOCOL If genuine multiple experiments are available, they should be used; the process described in Basic Protocol will be much more interesting in this case. If this is a new LAD installation and one has only the single experiment that was uploaded in Basic Protocol 2, the following steps should be performed to synthetically create a set of experiments in the database. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Analyzing Expression Patterns 7.10.25 Current Protocols in Bioinformatics Supplement 8 1. Re-execute the Experiment Submission protocol (Basic Protocol 2) several times. It is recommended that at least three additional experiments be submitted if supplementing the single experiment already submitted. If using the single set of sample experiment files that are provided with LAD it is possible to utilize them multiple times with one modification to the procedure. Rather than utilize a Computed normalization factor (see Basic Protocol 2, step 6), provide a unique UserDefined normalization factor for each additional experiment submitted. Simply make up a normalization factor between 0 and 10 and provide that unique number with each additional experiment submitted. 2. Once the new submissions are fully loaded into the database, restart Basic Protocol 5 at the beginning to utilize the newly submitted experiments, ensuring that a few experiments are selected for analysis. BASIC PROTOCOL 6 MANIPULATING STORED DATASETS AND EXPORTABLE FILES The result of the extraction process from Basic Protocol 5 will look similar to Figure 7.10.26. This screen is broken into two primary zones of interest. First, as the data are retrieved, filtered, transformed, and annotated, the results are printed to the browser window as a log of system activity. The extraction of data from many experiments can be slow, depending on the hardware used. Additionally, if the filters that are used do not eliminate many genes, the annotation of the resulting gene set can also be slow. Links on the “Retrieving Data” Screen The second zone of interest is the array of links provided at the bottom of the screen shown in Figure 7.10.26, which are discussed individually in the following paragraphs. Download PreClustering File This link is to a file that is often called the PCL file (for preclustering). In essence, it is a table of tab-delimited text information. Gene records are present as rows within the file while columns define the individual array experiments selected for analysis. This file Microarray Data Visualization and Analysis with LAD Figure 7.10.26 Data extraction results. 7.10.26 Supplement 8 Current Protocols in Bioinformatics is the input to both LAD’s hierarchical clustering functionality as well as many other stand-alone analysis applications. Download Spot Location File This file can be considered a sister file to the PCL. It is not terribly meaningful on its own but can become very useful when combined with the PCL file. These files will be utilized together in the Cluster Regeneration section of this protocol. Download Tab Delimited File This file is a version of the PCL file where genomic annotations that are present have been separated into their own tab-delimited columns rather than compressed into one column. This file is not a valid PCL file for this reason. This file is excellent for use in spreadsheet-type applications such as Microsoft Excel. Save This Dataset This link navigates to a LAD function that allows one to save this filtered dataset to the database for analysis at a later time. This function is described later. Clustering and Image Generation This link navigates to LAD’s intrinsic hierarchical clustering and visualization toolset. This function is described later in this protocol. Cluster Regeneration As mentioned above, the PreClustering and Spot Location files are the input into the hierarchical clustering process. Often, one may wish to right-click on these links to save these files to one’s local computer. It is then possible to utilize the Cluster Regeneration link on the LAD main menu (as seen in Fig. 7.10.2) to reupload these files for immediate entry into the hierarchical clustering toolset. Stored Datasets The steps required to properly apply filters for the extraction of multiexperiment data can be a time-consuming and error-prone process. Once one has settled on an appropriate set of filters that work, one may want to save the dataset, as extracted, for later analysis by hierarchical clustering. Stored datasets provide this functionality. Stored datasets can be considered as a more sophisticated solution to the same problem described above under Cluster Regeneration. Rather than having to save the PCL and SPOT files to the local computer and annotate them appropriately, one can simply save the dataset to the LAD database with a meaningful name and description. The remainder of this section will consist of walking through this process, as a gateway to hierarchical cluster analysis. In the data retrieval page (Fig. 7.10.26) select the Save This Dataset link A screen will appear that is similar to the one pictured in Figure 7.10.27. Note that a list of any datasets previously saved to the database is now visible. For the Name field, set a value; for this example, choose Multi-experiment Dataset. For the Description field, set a value; for this example type; The first dataset I created. Press the Create Dataset button to save the dataset to the LAD database. A screen confirming the creation will appear. This dataset can now be retrieved at any time through Stored Datasets item on the LAD main menu (as pictured in Fig. 7.10.2). Analyzing Expression Patterns 7.10.27 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.27 “Save this dataset” screen. Figure 7.10.28 Clustering and image generation—setup. Hierarchical clustering Return to the data retrieval page (Fig. 7.10.26) and select the Clustering and Image Generation link. A screen will appear that is similar to the one pictured in Figure 7.10.28. This screen is the setup for hierarchical clustering in LAD. Hierarchical clustering is a powerful analytical method for the detection of differential and similar gene expression in DNA microarray data (Eisen et al., 1998). Microarray Data Visualization and Analysis with LAD There are several configuration options available for this operation. The For Gene and For Experiment options near the top of Figure 7.10.28 select whether the data will or will not be hierarchically clustered in each of these dimensions. Additionally, if one allows for gene and/or experiment clustering, one has the option to either center or noncenter the data during the analytical process. This concept is similar to the concept of data centering that was previously discussed during the data extraction step of Basic Protocol 5 (step 16). 7.10.28 Supplement 8 Current Protocols in Bioinformatics The Use option determines the distance metric that will be used by the hierarchical clustering process. LAD uses two different possible distance metrics to measure the similarity between data vectors. First, the Pearson Correlation is insensitive to amplitude of changes between two data vectors and instead focuses on similarity in their directionality as seen in vector space. The Euclidean Distance measures the distance between two points in space and is thus sensitive to total amplitude of either of the data vectors under consideration. In addition to variables that will direct the hierarchical clustering process, the LAD preclustering setup screen (Fig. 7.10.28) contains options that allow for the manipulation of the final cluster diagram. The Contrast for Image option makes it possible to control the sensitivity of the coloring scheme to the level of highly expressed and repressed gene sets. Lower contrast values will lead to a more sensitive coloring scheme—only dramatic differences in expression will yield intensely colored spots. Additionally, options are available that allow one to specifically determine the color of missing data spots as well as the overall color scheme (red/green or blue/yellow). Select the following options to generate a hierarchical cluster diagram. For Gene: select Non-centered metric For Experiment: select No experiment clustering For Use: select Pearson correlation For Make: select Hierarchical cluster Figure 7.10.29 Clustering and image generation—generated. Analyzing Expression Patterns 7.10.29 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.30 Clustering and image generation—generated. For Contrast for Image: select 2.5 For RGB color for missing data: select 75% grey Color Scheme: select red/green Check the “Show spot images” check box Check the “Break up images” check box For “After how many genes”: type 200. Press the Submit Query button to begin the process of cluster creation. A screen similar to the one pictured in Figure 7.10.29 will appear. This screen is an intermediate in terms of visualizing the cluster diagram. It includes a thumbnail image of the cluster generated as well as links to the true diagram. On the page pictured in Figure 7.10.29, select the View Cluster Diagram with Spot Images link. A screen similar to the one pictured in Figure 7.10.30 will appear. On this screen, one can see that the clustering algorithm has separated up-regulated from downregulated genes. Additionally, one can see that it is now possible to visualize the spot images adjacent to the synthetic colors of the cluster diagram. Spots that perhaps should have been flagged “bad” may be selected and flagged in the single-spot screen that will be shown. This will prevent their reappearance in subsequent cluster analysis. BASIC PROTOCOL 7 Microarray Data Visualization and Analysis with LAD CREATING ARRAY LISTS It is quite common for a researcher to continually utilize the same experiments over and over in the data-analysis process. Often, a microarray researcher will spend significant time performing the actual array hybridizations in order to complete the full spectrum of experiments needed to define a complete research project or paper. At this point they then upload the experiments en masse and turn towards the task of data analysis. 7.10.30 Supplement 8 Current Protocols in Bioinformatics Because researchers often want to work with experiments as a defined group and not repeatedly search for and aggregate them together manually, LAD has a function called array lists. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Software preparation 1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). Array list creation 2. From the LAD main menu as depicted in Figure 7.10.2 select the Results Search link. The familiar experiment search filters screen (similar to Fig. 7.10.7) will appear. 3. Make sure the filter radio button is set to a value of Use Method 1. Change no filter settings. At the bottom of the screen, press the Data Retrieval and Analysis button. 4. In the selection box entitled “Select experiment names from the following list” select a few experiments. Press the button Create Array List. A screen will appear that is similar to the one pictured in Figure 7.10.31. 5. In the field shown in Figure 7.10.31 titled “Enter a name for your arraylist:” enter example-list. From the list entitled Starting List of Experiments select each of the previously selected experiments. Use the “> Add >” button to move these experiments to the list on the right hand side of the page (Experiments Included within Array List). Figure 7.10.31 Array list creation. Analyzing Expression Patterns 7.10.31 Current Protocols in Bioinformatics Supplement 8 6. At the top of the screen, press the Create Array List button. Close this extra browser window. In the browser window remaining, select the Longhorn Array Database link at the top of the screen. There should now have an array list consisting of the selected experiments. 7. Use Method 3 from the previously discussed Experiment Search screen (Fig. 7.10.7) to utilize this array list. BASIC PROTOCOL 8 OPEN DATA PUBLICATION One of the most useful features of LAD is the ability to host data publications to the public domain. In addition to robust and intuitive analysis tools and a proven architecture for the warehousing of thousands of microarray experiments, users can now utilize LAD to provide open access to their primary experimental data. In short, a publication is an aggregated group of experiments that is associated with a research journal publication. Publications and their associated data can be accessed by the public domain through a secure portion of LAD that provides limited but powerful functionality. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher View sample publication 1. If curious as to what an open access publication in LAD looks like, one can simply browse on the LAD server at the University of Texas at Austin by opening a new browser window and navigating to the URL http://www.iyerlab.org/hsf. This page is dedicated to supplementary data for a publication detailing the use of DNA microarrays to characterize the binding patterns of the transcription factor HSF in Saccharomyces cerevisiae (Hahn et al., 2004). 2. From this page select the “raw data” link. This link navigates directly to the LAD publication that provides direct access to the experimental data discussed and analyzed within the paper. Software preparation 3. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). Microarray Data Visualization and Analysis with LAD Publication creation 4. The first step in publication creation is modification of the access permissions for the experiments that will be aggregated together to define the publication itself. More specifically, one needs to grant access to one’s experiments to a special WORLD user. This user is created during the LAD installation. From the LAD main menu (Fig. 7.10.2) select the Add Experiment Access By Batch link. A screen similar to the one in Figure 7.10.32 will appear. 7.10.32 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.32 Add experiment access by batch. This screen takes advantage of array lists to make it possible to rapidly grant WORLD access to a large number of experiments. In this example the array list example-list that was created in Basic Protocol 7 will be used. 5. Set the “Choose an arraylist” pull-down menu to a value of “example-list” In the User pull-down menu, select WORLD. 6. Press the Check Access button to proceed to the next step. A confirmation screen will appear. Press the Add Access button to grant WORLD access to the experiments in this array list. A second confirmation screen will appear. Select the Longhorn Array Database link at the top of this screen. Experiment set creation Now that the experiments to be bundled into a publication have been modified for public access, it is time to aggregate them into a package. This package is known as an Experiment Set. 7. From the LAD main menu as depicted in Figure 7.10.2 select the Results Search link. The familiar experiment search filters screen (Fig. 7.10.7) will appear. 8. Change the filter radio-button to a value of Use Method 3. In the array list pull-down menu to the right of this option, select the array list “example-list.” At the bottom of the screen, press the Data Retrieval and Analysis button. In the selection box entitled “Select experiment names from the following list,” select each of the experiments displayed. 9. Press the Create Experiment Set button. A screen will appear similar to the one pictured in Figure 7.10.33. Use the “> Add >” button to move the selected experiments to the Experiments Included Within the Set list. 10. At the bottom of the screen press the Save Experiment Set button. A screen will appear similar to the one pictured in the following Figure 7.10.34. 11. On the page depicted in Figure 7.10.34: Enter Test Experiment Set in the Experiment Set Name box Set Cluster Weight to a value of 1.0 Analyzing Expression Patterns 7.10.33 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.33 Experiment set creation. Figure 7.10.34 Experiment set creation, part 2. Enter Creation of a sample experiment set in the Experiment Set Description box Select YES for “Do you want to publish this experiment set?”. Press the Create Experiment Set button to complete the process. A confirmation screen will appear. Press the Close Window button on the confirmation screen. Microarray Data Visualization and Analysis with LAD 12. In the browser window remaining, select the Longhorn Array Database link at the top of the screen. 7.10.34 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.35 Publication creation. Publication creation 13. With the experiments of interest given proper WORLD access and having created an experiment set bundling them together into a discrete package, it is now possible to actually create the publication. From the LAD main menu as depicted in Figure 7.10.2, select the Publication link (under Enter Data). A screen will appear similar to the one pictured in Figure 7.10.35. It is recommended that one create publications within LAD when one actually has a PubMed ID and citation for the actual journal publication. For the purposes of this protocol, the LAD publication PubMed ID will be used. 14. For the World-Viewable Experiment Sets list, select the experiment set that was just created, Test Experiment Set. Set the PubMed ID field to a value of 12930545. 15. Press the “Fill the fields below” button to populate the screen with PubMed information. With the fields populated by the remotely queried information it is possible to proceed with the creation of the publication. Press the Create Publication button to create the publication in the database. A confirmation screen will appear informing that the publication has been created. Select the Longhorn Array Database link at the top of the screen. Analyzing Expression Patterns 7.10.35 Current Protocols in Bioinformatics Supplement 8 Publication browsing 16. To view the publication just created, select the link Publications (under Search) from the LAD main menu (Fig. 7.10.2). This will bring up a list of the publications created in LAD. 17. Browse the links and functions provided to get an idea of the information exposed to the public domain through the creation of publications. The primary link to a publication is of the form: http://your lad server/cgi-bin/SMD/publication/ viewPublication.pl?pub no=1. It is highly recommended, however, that this link never be provided to any journal for inclusion in the supplementary data section of a publication. This URL structure may change in future releases of LAD or one’s server name may change due to network reorganizations. For this reason and many others, one should follow the strategy that was demonstrated in steps 1 and 2 of this protocol. In the example viewed there, the authors set up a static location to host the supplementary data Web site. Part of this Web site is a link to the publication in LAD. Because one will probably have access to the page that links to the LAD publication, one will always have the freedom to modify it. If the direct link to the LAD publication were included directly in the written publication, it would be forever documented that way, unchangeable, and possibly become incorrect and out of date. GUIDELINES FOR UNDERSTANDING RESULTS Nomenclature A GenePix Pro GPR file has been described as the source of primary microarray data with respect to a single microarray experiment. What values reside in this file, however, and how are these files used to describe the qualitative and quantitative aspects of a microarray experiment? A GPR file is a collection of over fifty distinct values for each spot on the microarray used in an experiment. Feature intensities, background intensities, sums of intensities, spot regression metrics, spot flags, mean/median intensities, and intensity ratios are but a few of the values provided. Some of these values are reflections of primary data acquired during the scanning process. Other values are informative statistical measures of the primary data. It is important to recognize that even many of the primary data values are products of processed and aggregated data values. For example, during the microarray scanning process, a spot is divided into a grid of dozens of individual pixels. These pixels are numerically combined to produce spot intensities. These intensities are then mathematically combined to form the ratio values that comprise many of the distinct values present for each spot in the GPR file. Each of these values eventually becomes a data column with respect to their final appearance in LAD. Additionally, LAD manipulates many data values to produce distinctly new data metrics that are only available after an experiment has been processed into the LAD database. Normalized intensities and normalized ratios are just two examples of these unique LADproduced data metrics. Microarray Data Visualization and Analysis with LAD Within the context of microarray data analysis, ratio values are often transformed to log-ratio values. The purpose of this operation is two-fold. First, in linear ratio space, the ratio values are centered around a value of 1.0. All genes that are up-regulated will have values above 1.0, but with no clear upper bound. All genes that are down-regulated will have ratio values that are compressed between 0 and 1.0. This situation creates a numerically asymmetric situation—the distribution of linear ratio values is clearly not normal. By transforming all ratio values to log-space, a situation is created in which the data are now normally distributed around zero. Genes that are up-regulated by two-fold 7.10.36 Supplement 8 Current Protocols in Bioinformatics are now at a value of +1, while two-fold repression is quantified by a value of –1. One can now see that log transformation has created a second benefit: it is possible to quickly discern up-regulated genes from repressed genes simply by their numerical sign. Microarray Data Normalization Data from a single microarray experiment can be affected by systematic variation. Printing biases such as plate and print-tip influences, total starting sample material amounts, sample storage conditions, experimenter biases, dye effects, and scanner effects are just a few of the possible sources of variation. This variation can significantly affect the ability of a researcher to meaningfully compare results from one microarray experiment to the results of another. LAD uses the technique of data normalization as an attempt to cleanse the data of this variation (Yang et al., 2002). The end goal of normalization is to, hopefully, remove nonbiological influences on the data, leaving only true biological phenomena behind for analysis. During experiment submission (see Basic Protocol 2, step 6), a user has the ability to choose a normalization scheme for the experiment to be processed. One of two options is available. First, a user may provide a User-Defined normalization coefficient. This coefficient will be applied to the ratio values for each individual spot such that Normalized Ratio = Coefficient × Ratio. This normalization coefficient can come from many sources. The use of positive controls can lead to the algorithmic computation of a normalization coefficient from the primary microarray data itself. A user might decide to provide a value of 1.0 as a User-Defined normalization coefficient. This value would in ensure that LAD does not apply normalization to the experiment in process, instead allowing the primary data to stand as recorded by GenePix Pro. The other option for LAD-base data normalization is for the user to select Computed as the Normalization Type. This selection will cause LAD to determine a global normalization coefficient for the experiment and numerically apply the coefficient as indicated above. Global normalization is based upon the assumption that in the average whole-genome microarray experiment, most of the genes on the microarray are not going to show differential expression. Expressed mathematically, this means that the expected median ratio value across all spots should be one. This is based upon the biological phenomenon whereby a small percentage of the total genome is being differentially expressed in a given cell at a given time. When a user requests a Computed normalization coefficient, LAD will identify a subset of spot values that are considered high-quality according to GenePix Pro quality metrics such as regression correlation and the amount of signal intensity above a standard deviation of average background intensity. The ratio values of these spots will then be averaged, providing a mathematical foundation for the calculation of a normalization coefficient. The actual calculation is a bit complex because it effectively sets the median ratio to zero. This is described in the LAD online help. If the median spot ratio value is >1, the normalization coefficient will be <1, and vice versa. This will ensure that the final application of the normalization coefficient to the primary data will yield a data distribution that, when log-transformed, will be normally distributed around an average of zero. Care must always be taken, when selecting Computed normalization, that the underlying assumption behind the computation of a global normalization coefficient is valid for the experiment being submitted. Analyzing Expression Patterns 7.10.37 Current Protocols in Bioinformatics Supplement 8 COMMENTARY Background Information One of the most successful and proven of publicly utilized microarray databases is the Stanford Microarray Database (SMD; http://genome-www.stanford.edu/microarray/; Sherlock et al., 2001). SMD’s source code has been freely available for some time, which theoretically allows any researcher to install an SMD server within his or her own research environment. SMD in this form, however, is based on proprietary hardware and software infrastructure that would require a significant capital expenditure from any laboratory that wished to operate such a server. Additionally, SMD was designed and written to utilize the Oracle relational database–management system. The cost of initial investment and long-term ownership of these technologies is significantly higher, however, than alternative open-source technology choices. Additionally, not only is Oracle expensive, it is a very demanding database in terms of the expertise required of professional database administrators to maintain it as a piece of software infrastructure. Given the numerous strengths and proven nature of SMD, the authors of this unit wanted to adapt it to run on a free, open-source, widely available, and powerful operating system and relational database. The combination of Linux and PostgreSQL was chosen to replace Solaris and Oracle. The Longhorn Array Database is the product of this effort (Killion et al., 2003); it is a completely open-source incarnation of SMD. Additionally, new features have been developed to enhance its warehousing and analytical capabilities. Critical Parameters and Troubleshooting Best practices for warehouse organization As previously discussed, one of LAD’s primary functions outside the domain of actual microarray data analysis is the organized warehousing of experiments. LAD has the capacity to store tens of thousands of experiments for a nearly unlimited number of users and groups. The ability to achieve this scale of operation, however, requires that LAD system curators and users adopt and maintain certain operational standards. Curators have the capacity to encourage the organization of the database environment through the proper utilization of user accounts, user groups, experimental categories and subcategories, and microarray prints. First, it is highly recommended that curators always assign each distinct LAD user an individual user account. Additionally, it is suggested that users be divided into meaningful groups as the research environment dictates. The implementation of these two practices will ensure that a curator is always able to identify specific users who may be misusing or abusing account privileges and that each user has no greater experiment access privileges than their research affiliation should allow. In addition to the organization of users and their organizational groups, the annotation of experiments is important with respect to the long-term organization of the LAD environment. Experiments should always be assigned meaningful and consistent names, descriptions, and channel descriptions. Assignment of experiments to proper categories and subcategories greatly enhances the search and location facilities that are provided for these experiment descriptors. During the submission of an experiment, slide names and numbers should always be carefully recorded to allow LAD to act as an inventory control mechanism for microarray utilization within a research environment. Microarray Data Visualization and Analysis with LAD Finally, print management is a key component of LAD best-practice implementation. The specific 384-well print plates used, as well as the order in which they are used, is often duplicated between microarray print runs. This can often make it tempting to utilize the 7.10.38 Supplement 8 Current Protocols in Bioinformatics same print global resource for microarray slides that were not spotted at the same time. Curators are discouraged from allowing this user action. Significant information can be mined from LAD with respect to print-induced data bias if and only if the primary data for a microarray experiment are correctly associated with slides that are truly members of the same print run. Suggestions for Further Analysis Data export to other analysis platforms The majority of the protocols in this unit are dedicated to LAD’s wide variety of intrinsic analytical toolsets. These toolsets are expansive and cover a significant breadth of the typical techniques used in microarray data analysis and quality control. The research field of DNA microarray analysis, nonetheless, is constantly exploring new data-mining techniques, visualization tools, and statistical-processing methodologies. How is LAD to successfully interact with these applications if the experimental data are sequestered in the relational database? The export of microarray data is detailed within the steps of the protocols provided. Figure 7.10.12 demonstrates that single experiment data analysis can include the export of data into flat tab-delimited text file and GFF formats. Additionally, Figure 7.10.14 details the part of LAD single-experiment manipulation that allows for the retrieval of the original GenePix Pro GPR file that contributed the primary experimental data. The multiexperiment analysis protocol (Basic Protocol 5) showed (see Fig. 7.10.26) that multiexperiment data are exportable from LAD through the use of the PreClustering and Tab Delimited text files. In these ways, LAD can be used in concert with new microarray analysis applications by providing for the retrieval of primary, filtered, or even annotated datasets for offline analysis and visualization. LITERATURE CITED Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. 2001. Minimum information about a microarray experiment (MIAME): Toward standards for microarray data. Nat. Genet. 29:365-371. Cherry, J.M., Adler, C., Ball, C., Chervitz, S.A., Dwight, S.S., Hester, E.T., Jia, Y., Juvik, G., Roe, T., Schroeder, M., Weng, S., and Botstein, D. 1998. SGD: Saccharomyces Genome Database. Nucl. Acids Res. 26:73-79. Dudoit, S., Gentleman, R.C., and Quackenbush, J. 2003. Open source software for the analysis of microarray data. BioTechniques Suppl:45-51. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A 95:14863-14868. Hahn, J.S., Hu, Z., Thiele, D.J., and Iyer, V.R. 2004. Genome-wide analysis of the biology of stress responses through heat shock transcription factor. Mol. Cell. Biol. 24:5249-5256. Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. 2001. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409:533-538. Kerr, M.K., Martin, M., and Churchill, G.A. 2000. Analysis of variance for gene expression microarray data. J. Comput. Biol. 7:819-837. Killion, P.J., Sherlock, G., and Iyer, V.R. 2003. The Longhorn Array Database (LAD): An open-source, MIAME compliant implementation of the Stanford Microarray Database (SMD). BMC Bioinformatics 4:32. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P., and Young, R.A. 2000. Genome-wide location and function of DNA binding proteins. Science 290:2306-2309. Analyzing Expression Patterns 7.10.39 Current Protocols in Bioinformatics Supplement 8 Sherlock, G., Hernandez-Boussard, T., Kasarskis, A., Binkley, G., Matese, J.C., Dwight, S.S., Kaloper, M., Weng, S., Jin, H., Ball, C.A., Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., and Cherry, J.M. 2001. The Stanford Microarray Database. Nucl. Acids Res. 29:152-155. Yang, Y.H., Dudoit, S., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002. Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucl. Acids Res. 30:e15. Contributed by Patrick J. Killion and Vishwanath R. Iyer University of Texas at Austin Austin, Texas Appendices A, B, and C appear on following pages. Microarray Data Visualization and Analysis with LAD 7.10.40 Supplement 8 Current Protocols in Bioinformatics APPENDIX A: USER ACCOUNT CREATION The default “curator” account is necessary to manage and configure LAD (see Appendix C). This section of the unit will detail the process by which other users can request an account and the “curator” can approve their request thereby creating the requested account. Necessary Resources To perform the procedures described below, one needs a Web browser window at the LAD front page (see Basic Protocol 1 and Fig. 7.10.2). A terminal window on the LAD server, logged in as the Linux system user root, is also necessary. User Creation User creation within LAD generally happens as a three-step process. First, a prospective new user navigates to the page accessed be the “register” link on the LAD front page (Fig. 7.10.1) to request an account. Secondly, a system curator creates the account by accessing the saved information from within LAD. Finally, the server administrator creates a system account so that the new user will have file system space to upload experimental files. User self-registration New users may request an account that a system curator can either approve or deny. By selecting the register link from the LAD front page (see Fig. 7.10.1 and Basic Protocol 1, step 1), one will obtain screen that is shown in Figure 7.10.36. Figure 7.10.36 LAD user registration page. Analyzing Expression Patterns 7.10.41 Current Protocols in Bioinformatics Supplement 8 The following fields are required and must be provided by a prospective new user: First Name Last Name Office/Lab Phone Office/Lab FAX Email Address Institution Project Description Organism of Study Laboratory. Complete the required fields and press Submit to store the prospective user information for consideration by the system curator. Navigate to LAD login screen Select the Login link on the LAD front page (Fig. 7.10.1) to proceed to the authentication gateway. Authenticate with LAD Utilize the LAD “curator” account that was created during the installation phase (Appendix C). Fill in the login screen (Fig. 7.10.37) with the following information: User Name: curator Password: the password assigned to “curator” during installation (Appendix C). The LAD main menu as pictured in Figure 7.10.38 should now be visible. This view contains many options that the average LAD user will not see (only accounts with curator permissions will see full administrative options; compare Fig. 7.10.38 to Fig. 7.10.2). Microarray Data Visualization and Analysis with LAD Figure 7.10.37 Login screen. 7.10.42 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.38 LAD main menu (curator). Approval of the prospective user After having logged in as the user “curator,” from the screen pictured in Figure 7.10.38 select the menu item from the left column entitled Users. From the screen which then appears, a system curator may either create a user or approve a prospective user who has registered for account consideration. If there is a user record for consideration there will be a message stating “The following users are still pending entry” with a link to prospective user records. Select the link for the user record created in the initial step (see “User self-registration”). One of two actions may be performed. Pressing the Delete Record enables the curator to reject the application, thereby sending an e-mail of account denial to the prospective user. Pressing the Submit button enables the curator to create the account for the prospective user. This will result in the new user being sent an e-mail of account approval. Note that in order to create the account one will need to provide a password for the new user, which will be included in the e-mail. It is also recommended that one carefully proofread and correct any other incorrect information the user has provided in the account application. The approval screen also contains very important information regarding LAD system permissions the new user will be granted: Update Gene: will give the new user curator permissions. Update Print: will allow a user to modify a print record. Update User: will allow a user to update other user records. Analyzing Expression Patterns 7.10.43 Current Protocols in Bioinformatics Supplement 8 Restricted User: will restrict a user to view-only permissions with respect to experimental data. This option should be unselected for any user that will be allowed to upload experiments. A local user will typically be unrestricted, that is, they will be allowed to load experiments, but they will not have permission to update genes, prints, or users. These administrative functions are best left to a curator who can be responsible for maintaining consistency. User System Accounts As previously mentioned, every unrestricted user must be given a system account on the LAD server. Creation of system account LAD includes a script to help with the creation of these accounts. In order to utilize this script, one will need to have a terminal window open and must use the Linux su command to become the root user. The script has the following usage: /lad/install/addSysAccount.pl [username] [password] [home dirs] Hypothetically, assuming that one created a user name pkillion with a password lad4me, one would now execute the command: /lad/install/addSysAccount.pl pkillion lad4me /home This example command assumes that user home directories are located under the directory /home. A new user account and its respective system account have now been created. APPENDIX B: GLOBAL RESOURCE CREATION AND MANAGEMENT LAD requires the creation and management of many global resources that are continually accessed throughout the system. This unit has already described how users are registered and approved. Users are just one type of record that defines the entire LAD system of interacting objects. Other resources such as user groups, organisms, plates, prints, printtip configurations, experimental categories, and sequence types are but a few of the other objects that are critical to the organized storage and analysis of microarray data. This section of the unit details the most important of these global resources and provides examples of their creation and management. Necessary Resources Hardware Computer capable of running an up-to-date Web browser Software Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version 1.6 or higher Terminal window to LAD server Microarray Data Visualization and Analysis with LAD Files Sample files for the creation of yeast sequences, plates, and print. Each of these files is in the directory: /lad/install/yeast_example_print/ 7.10.44 Supplement 8 Current Protocols in Bioinformatics Sample SUID creation file [yeast_pre_suids.txt] Sample plate creation file [yeast_retrieve_suids_into_plate.txt] Experiment Group Management In addition to the numerical analysis and visualization of microarray data, LAD serves as a very powerful environment for the long-term warehousing and organization of experimental results. This many not seem like a very important function when one first begins using a microarray database. However, as the total number of database users begins to grow, the frequency of experiment submission increases, experiment names and details start to become uninformative and inconsistent, and subsequent disorganization of results begins to set in. This appendix will serve to present guidelines by how to locate one’s experiments, organize them effectively, and set up the best practices for the long-term organization of LAD as a functional warehouse of data. Users and Groups Users and Groups can be considered examples of these global resources. It has already been described how users are created and utilized through the creation of the noncurator pkillion account. The following will demonstrate the mechanism by which new groups are created and how users can be assigned and removed from experimental groups. Experiment group creation From the LAD main menu as depicted in Figure 7.10.38, select the Experiment Groups link. The screen depicted in Figure 7.10.39. will appear. On this screen enter Iyer Lab in the Group Name field. Enter Members of the Iyer Lab, University of Texas at Austin in the Description field. Press the Submit button to create this new experimental group. Select the Longhorn Array Database link at the top of the screen which then appears to return to the LAD main menu. List users assigned to experiment group From the LAD main menu as depicted in Figure 7.10.38, select the User Group link. It will be seen that LAD has four experiment groups defined: Figure 7.10.39 Experiment group creation entry form. Analyzing Expression Patterns 7.10.45 Current Protocols in Bioinformatics Supplement 8 Default Group HS_CURATOR SC_CURATOR MM_CURATOR The first group is the default group that is created for noncurator users by the LAD installation. The other groups were created by the LAD installation when their respective organisms were created. The addition of organisms to the database will create subsequent experiment groups for curators of those organisms. A noncurator user would normally not be assigned to one of these experiment groups. From the LAD main menu as depicted in Figure 7.10.38 select the Default Group link. One should see that this group has two members, curator and pkillion.. Select the Longhorn Array Database link at the top of the screen to return to the LAD main menu. Consequences of group membership What consequence does group membership have throughout the LAD system? Basically, group membership controls which submitted experiments can be seen by a user. The basic rules are: a. All users that are comembers of one’s default group will be able to see one’s experiments unless one specifically removes that permission. b. Membership in additional groups (above and beyond one’s default group) will allow one to see the experiments that have been submitted by users whose default group is that additional group. User assignment to a default group A user’s default group can only be modified through their user profile. The following procedures are used to modify the default group for the pkillion user. From the LAD main menu as depicted in Figure 7.10.38 select the Users link. To the left of the pkillion user record, select the Edit icon. A screen appears that will allow one to edit the pkillion user. Change the Default Group to a value of Iyer Lab. Press the Submit button to change the default group. Select the Longhorn Array Database link at the top of the screen to return to the LAD main menu. User assignment to additional experiment groups As previously detailed, there may be times that one wishes to assign users to groups above and beyond their default group. From the LAD main menu as depicted in Figure 7.10.38 select the Users into Experiment Groups link. Select the UserID “pkillion” and select the Group Name Default Group. Press the Submit button to add the additional group. Select the Longhorn Array Database link at the top of the screen to return to the LAD main menu. Microarray Data Visualization and Analysis with LAD User deletion from additional experiment groups Additionally, there may be times that one wishes to remove users from their additional groups. From the LAD main menu as depicted in Figure 7.10.38 select the Users from Experiment Groups link. Select the UserID “pkillion.” Press the Process to Group Selection button to see a list of groups of which this user is a member. Select the Group Name Default Group. Press the Submit button to remove the additional group. Select the Longhorn Array Database link at the top of the screen to return to the LAD main menu. 7.10.46 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.40 LAD global resources links from the LAD main menu (also see Fig. 7.10.38). Complete user deletion from a specific group Finally, there may be times that one wishes to remove all users from an additional group. From the LAD main menu as depicted in Figure 7.10.38 select the link Remove All Users From An Experiment Group link. Select the Group Name Iyer Lab. Press the Submit button to remove all users from the Iyer Lab group. Select the Longhorn Array Database link at the top of the screen to return to the LAD main menu. Preconfigured Global Resources Many resources are preconfigured by the default relational database schema that is instantiated during LAD installation. Figure 7.10.40 shows a detail view of the links for these resources (which are accessible through the LAD main menu in Figure 7.10.38). By navigating to a few of these resources, the reader will become familiar with the elements that define the overall LAD environment. In order to complete the following steps, one will need a Web browser window logged into LAD as “curator,” at the main menu (as depicted in Fig. 7.10.38) and a terminal window on the LAD server, logged in as the Linux system user “curator.” Organism List Select the Organism List link from LAD the main menu (Fig. 7.10.38). Organisms are generally associated with nearly everything in the LAD environment. Experiments, plates, prints, and sequences are just a few examples. One can see that LAD comes preloaded with Saccharomyces cerevisiae, Mus musculus, and Homo sapiens as default organisms. Analyzing Expression Patterns 7.10.47 Current Protocols in Bioinformatics Supplement 8 Sequence Types Click the browser back button to return to the main menu. Select the Sequence Types link. Sequences are a key element with respect to data storage and analysis. In order to maintain annotation information for gene sequences on microarrays separately from the experimental values for spots, sequences are abstracted within LAD. This will be covered in much more detail under Sequence Creation, below. For now, it is important to recognize that LAD comes predefined to support a variety of sequence types, this being one of the variables that describes an individual sequence within the database. Tip Configurations Return to the main menu using the Back button and select the Tip Configurations link. A tip configuration is one of the subcomponents that help define a print and its respective geometry. Note that LAD comes predefined with support for 16-, 32-, and 48-pin-tip configurations. Categories Return to the main menu and select the Categories link. Categories are utilized as a descriptor to help organize experiments. They can be utilized as search keys in many of the data-analysis pipelines. As with all of these global resources that are being demonstrated, LAD system curators can freely add new categories to the database as desired. Print Creation Before experimental data can be uploaded into LAD a print must be defined for the microarray chip that was used. The first question that one may ask is “What is a print”? A print, in essence, is an abstract declaration of everything that went into constructing the actual microarray chip that was utilized in an experiment. For most microarray chips this means, first, that DNA of some type was arrayed into 384-well printing plates. Each well of each of these plates most likely contains a unique sequence. These plates were serially used, in some order, by the microarray robot to print some number, perhaps hundreds, of duplicate slides in a unique geometry that depends upon the tip configuration utilized, as well as other variables controlled by the robot software. The following discussion will explore the LAD resources that are available to define each step of the process described above. This will be done using sample files included with LAD that aid in the creation of a sample print for a set of Saccharomyces cerevisiae microarray chips. SUID creation The first step in defining a print is to define all of the sequences that will be utilized in the print. Sequences within LAD are typically referred to as SUIDs. Each sequence that is uploaded gets a unique name and, once uploaded, need never be defined or uploaded again. Uploading SUID file to server Using the terminal window copy the file: /lad/install/yeast_example_print/yeast_pre_suids.txt Microarray Data Visualization and Analysis with LAD to the location: /home/curator/ORA-OUT/ 7.10.48 Supplement 8 Current Protocols in Bioinformatics This may be done with the following command: cp --f /lad/install/yeast_example_print/yeast_pre_suids.txt /home/curator/ORA-OUT/ It is highly recommended that one inspect this file with a text editor at a later date in order to learn about its overall structure and content. Uploading SUIDs to the database In the LAD main menu (Fig. 7.10.38) page select the “Sequence IDs (SUIDs)(by file)” link. Select Saccharomyces cerevisiae from the Organism pull-down menu. Edit the text field at the bottom of the screen to the value: /home/curator/ORA-OUT/yeast_pre_suids.txt Select the Assign SUIDs button to initiate the creation of sequences within the database. This process may take a few minutes to complete. SUID extraction to plate creation file In this appendix, a list of sequences has been submitted for storage to the database. In order to model a microarray print, it is now necessary to be able to store a set of plates containing these sequences. There is one intermediate step that must be performed, however. When the sequences were uploaded to the database, they received a unique identification that is independent of the name given to each of them. In order to upload plates, it is first necessary to take the file that contains the plate information and have the database associate the sequence names in it to the independent keys it now has for each of these sequences. Plate file preparation Using the terminal window copy the file: /lad/install/yeast_example_print/ yeast_retrieve_suids_into_plate.txt to the location: /home/curator/ORA-OUT/ This may be done with the following command: cp --f /lad/install/yeast_example_print/ yeast_retrieve_suids_into_plate.txt /home/curator/ORA-OUT/ It is highly recommended that one inspect this file with a text editor at a later date in order to learn about its overall structure and content. Figure 7.10.41 displays the links that will now be visited in succession for the purpose of creating a print; these links are selected from the LAD main menu (Fig. 7.10.38). Analyzing Expression Patterns 7.10.49 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.41 LAD print creation links from the LAD main menu (also see Fig. 7.10.38). SUID extraction into plate file Select the Get SUIDs for Name in Plate File link on the LAD main menu page (Figure 7.10.38). Select Saccharomyces cerevisiae from the organism pull-down menu. Edit the text field at the bottom of the screen to the value: /home/curator/ORAOUT/yeast_retrieve_suids_into_plate.txt Select the Get SUIDs button to initiate the creation of sequences within the database. This process may take a few minutes to complete. The following file will be created: /home/curator/ORAOUT/yeast_retrieve_suids_into_plate.txt.SUID.oral This new file contains the same information as the file that was submitted to the process, but with one new column of information—the database unique identifier that defines each of the sequences. This file will now be utilized directly in the next step of the print creation process. Submission of plate creation file A plate file is now available that is complete for database submission. Next, it is necessary to create plates within the database. These plates are virtual representations of the 384well plates utilized by the microarray robot. Once entered into the database they can be continually reused to create future prints. Plate file submission Select the “Enter Plates (by file)” link on the LAD main menu page. Select Saccharomyces cerevisiae from the Organism pull-down menu. Edit the text field at top of the screen to the value: /home/curator/ORAOUT/yeast_retrieve_suids_into_plate.txt.SUID.oral Set Plate Name Prefix to the value SC. Set Project to the value Yeast and set Plate Source to the value Refrigerator One. The rest of the fields can be left as their default values. Select the Enter Plates button to begin the process of plate creation. It should be noted that this is a CPU-intensive process. The database and the LAD application code exercise many operations to check for the integrity of the plate information that has been provided. The overall process may take some time to complete. Microarray Data Visualization and Analysis with LAD 7.10.50 Supplement 8 Current Protocols in Bioinformatics Extraction of plates Plates have now been submitted to and created within LAD. One can now extract these plates from LAD in the form of a plates file that can be submitted for the creation of a print. The question may be asked “Why is it necessary to extract plates from the database to submit a print for creation”? The answer is simply that every print performed with a microarray robot may utilize the same plate set. Through design, omission, or mistake, however, the order in which plates are used by the microarray robot may change from print to print. For this reason LAD provides the ability to extract plates from the database to a text file that may or may not be reorganized before print creation. Plate file creation Go to the Select Plates link in the LAD main menu (Fig. 7.10.38). Select Saccharomyces cerevisiae from the Organism pull-down menu. Select SC from the next column. Edit the file name to: /home/curator/ORA-OUT/sc_plates.txt Press the Select Plates button in order to extract the plates into the named text file. Creation of a print It is now finally possible to create the print within LAD. If it were desirable to change the order of the plates, one would simply edit the file /home/curator/ORAOUT/sc_plates.txt that was produced under Plate File Creation, above. That will not be done for this example, however. Select the Enter Print and Spotlist link on the LAD main menu page Select Saccharomyces cerevisiae from the Organism pull-down menu Edit the File Name text field to a value of /home/curator/ORA-OUT/sc_plates.txt Set the Printer pull-down to a value of Default Printer Set the Print Name text field to a value of SC1 Set the Tip Configuration drop-down to a value of 32 Tip Set the Number of Slides text field to a value of 400 Set the Columns per Sector text field to a value of 24 Set the Rows per Sector text field to a value of 19 Set the Column Spacing text field to a value of 175 Set the Row Spacing text field to a value of 175 Set the Description text area to a value of Sample Print. Also, be sure to select the option Rotate Plates, which is a flag that informs LAD that the 384-well plate was parallel to the slides on the platter during printing. Select the Enter Print button to begin the creation of the print. It should be noted that this is a CPUintensive process. The database and the LAD application code exercise many operations to check for the integrity of the print information that has been provided. The overall process may take some time to complete. APPENDIX C: INSTALLATION OF THE LONGHORN ARRAY DATABASE LAD is designed and intended to run on an Intel-compatible PC running a modern Linux distribution. The authors of this unit have tested and developed LAD on Mandrake Linux versions 8.0, 8.2, 9.0, 9.1, and 9.2. They have also utilized RedHat Linux versions 7.2 and 8.0 in previous test installations. Analyzing Expression Patterns 7.10.51 Current Protocols in Bioinformatics Supplement 8 This section will include literal and detailed instructions based on the Mandrake 9.1 platform. Please adapt for the chosen Linux distribution. Please note that RPM version numbers presented in this section will vary greatly between Linux distributions. Always be sure to simply utilize the most recent version of any package available from either the installation or update mechanism of a particular Linux distribution. Installation Hardware Overview LAD was conceptualized and designed to be a cost-efficient yet immensely scalable microarray analysis platform and data warehouse. As with all server resources deployed for the purpose of community operation, care must be taken to deploy LAD on appropriate hardware for its intended and projected utilization. Dedication of insufficient hardware relative to the average load will lead to insufficient performance and parallel user capacity. In general there are five primary variables at play when designing a hardware infrastructure for LAD: Performance level and number of CPUs System RAM quantity System hard-disk capacity System hard-disk performance Overall Linux compatibility. First, the number of CPUs and their speed will primarily impact the multiuser experience of the database system. The average single user will not be able to detect the difference between a single low-end Intel Celeron processor and dual or quad high-end Pentium IV processor system. Conversely, ten simultaneous users will dramatically experience the difference between these two choices. The total amount of system RAM follows this pattern as well. A single user will most likely not benefit from high levels of RAM. Multiple users, launching multiple parallel database and analytical processes, will definitely utilize increasing amounts of dedicated RAM. There is a caveat to this generalization, however. Relational databases like PostgreSQL typically attempt to cache substantial amounts of their data to system RAM in order to increase performance for queries that are consistently accessing frequently utilized tables and records. Therefore, the database system will operate with increasing levels of performance relative to amount of system RAM it is able to utilize as dynamic cache. The Linux system administrator will want to be sure to clearly profile the relational database performance to determine if it could benefit from an increase in the available RAM regardless of the size of the simultaneous user base. Hard disk capacity is directly relevant to the number of experiments that will be loaded into the system. Be sure to plan for future growth of the user base in this regard. Larger disk capacity will not increase performance for small or large user bases. On the other hand, the technology that is utilized in the hard disk subsystem can have a dramatic affect upon overall system performance and throughput. For example, a single large-capacity IDE drive combined with a powerful dual processors and a relatively large amount of system RAM will most likely create a system that is often described as I/O (input/output) bound. The total system performance will be constrained by this weak link (the IDE disk) in the overall chain of system interactivity. A higher-throughput I/O system, e.g., a RAID (Redundant Array of Inexpensive Disks)–based system, which combines the power and capacity of multiple drives in parallel, will ensure that the hard disk I/O is not the limiting variable in overall system performance. Microarray Data Visualization and Analysis with LAD 7.10.52 Supplement 8 Current Protocols in Bioinformatics The final consideration in the design of a system to be dedicated to LAD operation is Linux compatibility. LAD is documented to support many distributions of Linux (RedHat, Mandrake, Gentoo). Care should be taken to be sure to match a distribution that will run LAD with a distribution that includes proper hardware support for one’s chosen technology infrastructure. A specific product, such as an internal RAID controller, may require experimental drivers to operate within the Linux environment. Other vendors and hardware providers, however, may provide a more tested and production-worthy version of this same type of RAID controller that has been documented to interoperate with standard kernels and nonexperimental drivers. One will want to be sure to choose options that are verified in production level Linux deployments as opposed to newer, untested, and experimental hardware. Given these variables, the Necessary Resources section below will detail three options for LAD deployment. Consider these options to be generic representations of the much wider spectrum of choices in designing a system that is appropriate for one’s long term needs. Installation and Administration Skill Set Installation and administration of LAD does not require a high degree of technical skill or knowledge. It does, however, require a basic administrative knowledge of Linux, PostgreSQL, and general system maintenance functions. Users should be sure that they are comfortable with Apache, PostgreSQL, and general Linux file and system operations before attempting to install LAD. If one is not confident that one has this skill set, one will most likely not be comfortable with the operations that are required to maintain LAD over a longer term. Necessary Resources Hardware Low-Range Performance: Intel Pentium III ≥500 MHz, or Linux-compatible equivalent thereof ≥20 Gb disk space available 500 Mb RAM Mid-Range Performance: Intel Pentium IV ≥1.0 GHz, or Linux-compatible equivalent thereof ≥300 Gb disk space available 5 Gb RAM High-Range Performance: Intel Pentium IV Multi-CPU ≥2.5 GHz, or Linux-compatible equivalent thereof High-performance motherboard and system chipset ≥1.0 terabyte (Tb) RAID-managed disk space available 20 Gigabytes RAM Software Mandrake Linux 9.1 Installation CDs Web Browser: Microsoft Internet Explorer version 6.0 or higher Mozilla version 1.6 or higher Files This entire document is authored specific to the release of LAD version 1.4. It is called for within this protocol and can be downloaded from the LAD Web site. Analyzing Expression Patterns 7.10.53 Current Protocols in Bioinformatics Supplement 8 Fulfillment of Server Prerequisites Follow the installation instructions provided by Mandrake to begin the deployment of Mandrake Linux 9.1. Special care should be taken during initial partitioning of the file system to create a structure that allows for the future growth of the relational database PostgreSQL as well as the LAD experimental data file archive. Installation initiation Complete all steps of the Mandrake Linux installation up to the selection of Software Groups. Mandrake Linux installation allows one to select groups of software to be installed, thus greatly simplifying the installation of commonly utilized package sets. Select the following groups for full installation: Internet Station Network Computer (client) Configuration Console Tools Development Documentation Mail Database Firewall/Router Network Computer Server KDE. Group selection In addition to these complete groups, one will need to select individual packages for installation. For this reason, be sure to select the checkbox that calls for individual package selection. With the groups selected, proceed to the next step. Apache RPM packages Note that the Web group was not installed during selection of group packages for installation. This is because, currently, LAD does not support Apache version 2.x for operation. For this reason Apache 1.x must be installed manually from the appropriate RPM packages. The Apache 1.x packages are: apache-1.3.27-8mdk apache-devel-1.3.27-8mdk apache-modules-1.3.27-8mdk apache-mod_perl-1.3.27_1.27-7mdk apache-source-1.3.27-8mdk perl-Apache-Session-1.54-4mdk. Apache package selection Find and select each of these packages. Do not yet begin installation of the selected groups or packages. Microarray Data Visualization and Analysis with LAD 7.10.54 Supplement 8 Current Protocols in Bioinformatics PostgreSQL RPM packages PostgreSQL is the open-source relational database system (RDBMS) that LAD utilizes for its data persistence and organization. Most of the required PostgreSQL packages are installed by selecting the Database group during installation. Be sure that the following RPM packages are installed before attempting to configure PostgreSQL for LAD operation. Required PostgreSQL packages: postgresql-7.3.2-5mdk postgresql-server-7.3.2-5mdk perl-DBD-Pg-1.21-2mdk pgaccess-0.92.8.20030117-2mdk PostgreSQL package selection Find and select each of these packages. Do not yet begin installation of the selected groups or packages. Graphics libraries: LAD utilizes several types of graphic file formats for its intrinsic image manipulation functionalities. Additionally, several graphic libraries are utilized to build LAD binary applications from source code during the installation process: freetype-1.3.1-18mdk freetype-devel-1.3.1-18mdk freetype-tools-1.3.1-18mdk freetype2-2.1.3-12mdk freetype2-devel-2.1.3-12mdk freetype-static-devel-2.1.3-12mdk freetype2-tools-2.1.3-11mdk libxpm4-3.4k-23mdk libxpm4-devel-3.4k-23mdk libpng3-1.2.5-2mdk libpng3-devel-1.2.5-2mdk libnetpbm9-9.24-4.1mdk libnetpbm9-devel-9.24-4.1mdk libnetpbm9-static-devel-9.24-4.1mdk netpbm-9.24-4.1mdk ImageMagick-5.5.4.4-7mdk. Image library and application selection: Be sure to install all of the listed applications and libraries. Finish installation of Mandrake Linux With all prerequisite groups and packages selected installation of Mandrake Linux may now be completed. Server Configuration It is assumed that at this stage of the protocol the server has been fully installed with Mandrake Linux and cleanly rebooted into normal operation. Analyzing Expression Patterns 7.10.55 Current Protocols in Bioinformatics Supplement 8 PostgreSQL configuration For efficient and acceptable performance LAD requires a few PostgreSQL configuration issues be attended to. By default, PostgreSQL is not configured to store or manage largerthan-average datasets. By their very nature, microarray data warehouses are larger than the average dataset. The typical LAD deployment is no exception in this matter. Here are a few simple changes that can be made to a PostgreSQL configuration file to ensure proper performance. These values will differ on machines with differing amounts of total system RAM. Values have been included that are appropriate for a machine with ∼1 Gb RAM. Configuration of [postgresql.conf]: Add/Edit [/var/lib/pgsql/data/postgresql.conf]: the following values tcpip_socket = true port = 5432 shared_buffers = 100000 sort_mem = 20000 vacuum_mem = 20000. These values will most likely not be accepted by your Linux kernel as valid as most systems have a low shmmax by default. For this reason, configure [postgresql] by adding the following line (after line that says something like PGVERSION=7.3) to [/etc/init.d/postgresql]: echo “1999999999” > /proc/sys/kernel/shmmax Be sure to restart the PostgreSQL server by executing (as root): /etc/init.d/postgresql stop /etc/init.d/postgresql start Configuration of [.pgpass]: LAD’s Web operations will operate in either a “trust” or password-protected PostgreSQL environment. Scripts found in /lad/custom-bin/ and /lad/mad/bin/, however, may not operate in a password-protected PostgreSQL environment. It is recommended that one create a .pgpass for any user who has either manual or cron-related jobs that execute these programs. See the PostgreSQL documentation on this issue for more details. Apache configuration The default configuration of the Apache Web server needs to be altered slightly for appropriate LAD operation. Configuration of [common-httpd.conf] (options): One should also edit the following entries to one’s HTTP configuration file [/etc/httpd/conf/commonhttpd.conf]: Microarray Data Visualization and Analysis with LAD <Directory /var/www/html> Options -Indexes FollowSymLinks MultiViews Includes </Directory>``; <Directory /> Options -Indexes FollowSymLinks MultiViews Includes </Directory>``; 7.10.56 Supplement 8 Current Protocols in Bioinformatics Many operations in LAD can take a significant amount of time to complete. The Apache Web server tends to come preconfigured to allow operations to run for a maximum time of 5 min. For this reason, one will want to edit the Apache configuration to allow for the successful completion of longer processes. Configuration of [common-httpd.conf] (timeout): The overall browser timeout should be adjusted with the HTTP configuration file [/etc/httpd/conf/commonhttpd.conf]. The value Timeout 300 should be edited to a value of Timeout 100000. Be sure to restart the Apache Web server by executing (as root): /etc/init.d/httpd stop /etc/init.d/httpd start Configuration of file system: The Apache Web server installation must be altered for LAD operation. LAD must own the default html and cgi-bin directories in order for it to function properly. Execute the following commands to allow LAD to own these directories: mv mv ln ln /var/www/html /var/www/html_pre_lad /var/www/cgi-bin /var/www/cgi-bin_pre_lad -s /lad/www-data/html /var/www/html -s /lad/www-data/cgi-bin /var/www/cgi-bin LAD Installation With PostgreSQL and Apache properly configured one is now ready to begin the installation of LAD. PostgreSQL connectivity diagnostic test Be sure that the root user has the ability to create a PostgreSQL database. The following commands can be used as a diagnostic for this goal: createdb test_lad psql test_lad dropdb test_lad If these tests fail, one will need to perform the following commands to allow root to interact with the PostgreSQL server: su postgres createuser --adduser --createdb root exit Now you should again attempt the PostgreSQL test commands as listed above. LAD installation file download The installation file [lad_1.4.tar.gz] is available at the LAD distribution site http://www.longhornarraydatabase.org. Using the Web browser, download the file to the local computer and save it to the some directory that will be accessible to the root user. Installation invocation To start the installation, it is first necessary to open a command terminal. Now, through the Linux su command, become the root user. Finally, switch to the directory in which Analyzing Expression Patterns 7.10.57 Current Protocols in Bioinformatics Supplement 8 Figure 7.10.42 Terminal display of LAD installation program. the LAD installation file was placed. The archived contents of this file will now be decompressed and expanded. Execute as the root user: gzip --d lad_1.4.tar.gz tar --xvf lad_1.4.tar cd lad/install/ ./install.pl The menu in Figure 7.10.42 should now be seen within the terminal window: The LAD installation program is directed by a simple terminal window program that allows one to step through a list of sequential operations. Steps may be repeated if problems are encountered only if prerequisite steps are re-executed beforehand. Each step of the installation process will be detailed in the following paragraphs. PostgreSQL configuration Type 1 and press the Enter key to display the PostgreSQL configuration information. This step is simply a reminder to perform the PostgreSQL custom configuration that was previously discussed in this section. When complete, press the Enter key again to return to the main installation menu. Microarray Data Visualization and Analysis with LAD Linux prerequisites Type 2 and press the Enter key to execute a check for Linux system prerequisites. This check is most appropriate for Mandrake Linux, version 9.1. When complete, press the Enter key to return to the main installation menu. 7.10.58 Supplement 8 Current Protocols in Bioinformatics Figure 7.10.43 Example /etc/crontab file with a suggested schedule. Install LAD files Type 3 and press the Enter key to install the LAD file tree to /lad/. Previous installations will be removed if they are detected. Be sure to make backup copies of these files, especially the LAD file archive /lad/mad/archive/, if performing an upgrade. When complete, press the Enter key again to return to the main installation menu. Configure LAD server Type 4 and press the Enter key to configure the LAD server. A short list of questions will be asked that are important with respect to system configuration. Default values are given within brackets. New values may be simply typed. Press the Enter key after each question. When complete press the Enter key again to return to the main installation menu. Configure LAD database Type 5 and press the Enter key to create the PostgreSQL instance of the LAD relational database. Previous installations will be removed if they are detected. When complete, press the Enter key to return to the main installation menu. Setup CURATOR account LAD requires that each unrestricted user (users that have permission to upload experiments) have a system account with a user directory visible by the LAD server. Be sure to save the password provided for this “curator” account. Type 6 and press the Enter key to create a system account for the default LAD curator. When complete press the Enter key to return to the main installation menu. Build LAD binaries LAD has several applications that must be compiled from source code before it will successfully operate. Type 7 and press the Enter key to build these applications. The screen will show the applications being compiled. Keep in mind that warnings are expected but error conditions should be addressed. The most common cause of error is unfulfilled prerequisite libraries. When complete, press the Enter key to return to the main installation menu. Analyzing Expression Patterns 7.10.59 Current Protocols in Bioinformatics Supplement 8 Install GD.pm LAD utilizes a specially modified version of the Perl module GD for many of its image manipulations. Type 8 and press the Enter key to build and install this version of Perl module GD. When complete, press the Enter key to return to the main installation menu. HTTP configuration The HTTP configuration screen is a simple reminder that the Apache Web server installation must be modified after LAD installation. This step is simply a reminder to perform the Apache Web server custom configuration that was previously discussed in this section. Type 9 and press the Enter key to read these instructions. When complete press Enter key to return to the main installation menu. Scheduled Tasks LAD has several system scripts that should be automatically run on a regular schedule. One will probably wish to utilize Linux cron to manage the execution of these scripts. Figure 7.10.43 shows an example /etc/crontab file with a suggested schedule. Microarray Data Visualization and Analysis with LAD 7.10.60 Supplement 8 Current Protocols in Bioinformatics Gene Expression Analysis via Multidimensional Scaling UNIT 7.11 A first step in studying the gene expression profiles derived from a collection of experiments is to visualize the characteristics of each sample in order to gain an impression as to the similarities and differences between samples. A standard measure of sample similarity is the Pearson correlation coefficient (see Commentary). While popular clustering algorithms may imply the intrinsic relationship between genes and samples, a visual appreciation of this relationship is not only essential for hypothesis generation (Khan et al., 1998; Bittner et al., 2000; Hendenfalk, 2001), but is also a powerful tool for communicating the utility of a microarray experiment. Visualization of similarities between samples provides an alternative to the hierarchical clustering algorithm. The multidimensional scaling (MDS) technique is one of the methods that convert the structure in the similarity matrix to a simple geometrical picture: the larger the dissimilarity between two samples (evaluated through gene expression profiling), the further apart the points representing the experiments in the picture should be (Green and Rao, 1972; Schiffman et al., 1981; Young, 1987; Green et al., 1989; Borg and Groenen, 1997; Cox and Cox, 2000). The installation and implementation of the MDS method for gene expression analysis is described in the Basic Protocol, which explains how to obtain a set of MATLAB scripts and includes step-by-step procedures to enable users to quickly obtain results. To demonstrate the functions of the MDS program, a publicly available data set is used to generate a set of MDS plots; the interpretation of the plots is described in Guidelines for Understanding Results. The mathematical fundamentals of the method are described in the Commentary section. A diagram of the expression data flow described in this unit is shown in Figure 7.11.1. MDS plots enable statisticians to communicate with biologists with ease, which, in turn, helps biologists form new hypotheses. Figure 7.11.1 Flow chart illustrating the data flow of the MDS program. Contributed by Yidong Chen and Paul S. Meltzer Current Protocols in Bioinformatics (2005) 7.11.1-7.11.9 C 2005 by John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.11.1 Supplement 10 BASIC PROTOCOL USING THE MDS METHOD FOR GENE EXPRESSION ANALYSIS The following protocol describes how to install and execute the MDS program. A publicly available and downloadable data set is used to illustrate each step and the interpretation of the output generated by the MDS program. Necessary Resources Hardware The distributed program was tested under MATLAB v. 6.2 for PowerPC Macintosh computer with Mac OS X v. 10.2, and additional machine support may be included that will be specified in a README file provided with the distribution. Software MATLAB implementation of MDS, provided for distribution at http://research. nhgri.nih.gov/microarray/MDS Supplement/; note that the URL is case-sensitive along with a data set from Bittner et al. (2000). MATLAB license, purchased from Mathworks Inc., (http://www.mathworks.com). The Statistics Toolbox for MATLAB is required for this implementation. The newer version of Statistics Toolbox has its own implementation of classical MDS: [y, e] = cmdscale(D) where D is the input n×n distance matrix, y is the coordinates of n points in p-dimension (p<n), and e is the vector of eigenvalues from which one can determine the best reduced dimension of the system. However, the result according to this function does not necessarily minimize the stress defined by Equation 7.11.6 (see Background Information). Files Data matrix: The user can use either a ratio matrix or an intensity matrix. The expression ratios (normalized ratio of mean intensity of sample 1 to sample 2, for example) or intensities of each experiment are organized in columns, while each row provides expression data for each gene. A typical data table is shown in Figure 7.11.2. The data matrix may contain missing values, meaning that no value was provided in an experiment for a given gene (most likely because the Multidimensional Scaling Figure 7.11.2 Data matrix format (tab-delimited text file displayed as Microsoft Excel spreadsheet). The first row of the matrix contains the experiment names and the first column contains the IDs for each gene. It is recommended that white-space characters be excluded from IDs and experiment names. 7.11.2 Supplement 10 Current Protocols in Bioinformatics Figure 7.11.3 Color assignment table format (tab-delimited text file displayed as Microsoft Excel spreadsheet). The first row has the color assigmnets (r = red, b = blue, y = yellow) and the second row has the gene ID. measurement was not reliable). The data matrix might contain no values for an entire experiment if no probe was printed at one particular location for a batch of microarray slides. These kinds of missing entries will be discarded during the computation of the pairwise Pearson’s correlation coefficient. Data matrices (whether ratio, intensity, or quality) are stored as tab-delimited text. Quality matrix (optional): This file is organized similarly to the data matrix, but each element of the quality matrix takes a value from 0 to 1, indicating the lowest measurement quality (0) to highest measurement quality (1). Many image analysis software programs or data-filtering schemes do not provide quantitative quality assessment; instead, an incomplete data matrix (containing missing values) may be provided. Inside MATLAB, a default quality matrix of 1 is automatically generated when no quality matrix is supplied. Any detected missing value will be assigned a quality 0 at the same location in quality matrix. Color assignment file (optional): This is a simple text file and the format is shown in Figure 7.11.3. Each sample is assigned to a color code to provide some visual assistance. Currently, the program only takes the following colors: r, g, b, c, m, y, k, and w, representing red, green, blue, cyan, magenta, yellow, black, and white, respectively. Installing the MDS program It is assumed that the base MATLAB program and Statistics Toolbox were installed prior to this step. Users are encouraged to read the further installation instructions provided in the README file that can be downloaded from the Web site mentioned in Necessary Resources. 1. Download the software from http://research.nhgri.nih.gov/microarray/MDS Supplement/. 2. To install the package, drag the MDS folder to any location on the local hard drive, then set the appropriate MATLAB default path to include the MDS folder (see README file for details). Executing the MDS program Upon installing MATLAB MDS scripts, one is ready to execute the program without further compilation. To demonstrate the application of MDS algorithm listed below, the authors have provided a set of gene expression data derived from a melanoma study (Bittner et al., 2000) downloadable from the aforementioned Web site. Briefly, a set of Analyzing Expression Patterns 7.11.3 Current Protocols in Bioinformatics Supplement 10 Figure 7.11.4 MDS graphical user interface showing various options. 38 microarray gene expression profiles, including 31 melanoma cell lines and tissue samples and seven controls against a common reference sample are included. There are total of 3614 gene expression data points per experiment after careful quality control. The original data can also be found in following Web sites along with detailed description: the NHGRI microarray Web site (http://research.nhgri.nih.gov/microarray/) for selected publications; and the NCBI GEO site http://www.ncbi.nih.gov/geo/ with the GEO accession number GSE1. Two files are available: (1) the gene expression ratio file Melanoma3614ratio.data; and (2) the color assignment file Melanoma3614color.txt. These two files will be used in the following steps. 3. Launch the MATLAB environment and type: > MDSDialog A dialog window should appear as shown in Figure 7.11.4. 4. For Input File Types, choose the “Ratio-based” radio button. 5. For Input Files: a. Check Data Matrix File, and a standard file input dialog box will be displayed. Navigate to where the file Melanoma3614ratio.data is located, highlight the file, and click the Open button. b. Since there is no quality matrix provided in the example data set, leave the Quality Matrix File check box unchecked. c. Check Color Assignment File, navigate to where the file Melanoma 3614color.txt locates, highlight the file, and click the Open button. 6. For Preprocessing: a. Check “Choose to round =” and use the default value 50 to automatically round down every ratio greater than 50 to 50, or round up any ratio less than 1/50 to 1/50. b. Check “Log-transform.” Multidimensional Scaling 7.11.4 Supplement 10 Current Protocols in Bioinformatics 7. For the Similarity/Distance choice, choose the default, “Pearson correction.” 8. For Output choice: a. Check “With sample name,” so that the final display will have name attached to each object. b. Check “AVI Movie” to request a AVI movie output. 9. For Display mode, select the “3-D” radio button. 10. Click the OK button, and the MDS program will execute. 11. The user may rotate the coordinate system by holding the mouse button down while moving the cursor on the screen, after the program ends. GUIDELINES FOR UNDERSTANDING RESULTS In many expression profiling studies, one of the common questions is the relationship between samples. Sometimes, the class designations of samples are known (e.g., cancer types, stage of cancer progression, treatment, etc.), while in other cases there are merely array-quality-related classifications (e.g., fabrication batch or hybridization date). If it is assumed that similar molecular changes or similar microarray measurements can result from samples having similar phenotypes or from a particular fabrication batch, it is desirable to visualize the relationships between samples before committing to a particular statistical analysis. To visually observe the similarity between test samples, one can simply use an expression ratio matrix that was deemed to be of good measurement quality. To illustrate the result, the authors have employed the data files from Bittner et al. (2000) supplied with this distribution. The expression ratio matrix file Melanoma3614ratio.data contains 3614 genes that passed the measurement quality criterion. The authors have also used the color assignment file Melanoma3614color.txt, where samples with red color represent less aggressive melanoma and those with blue color represent more aggressive; yellow signifies control samples (see Bittner et al., 2000). Following the default setting (log-transform, rounding at 50, Pearson correlation), a 3-D MDS plot is generated as shown in Figure 7.11.5A. One of the immediate questions is which gene’s (or set of genes’) expression levels correspond to this clustering effect? It should be easy to see that when one assigns red color to samples whose expression ratio of WNT5A (CloneID: 324901) is less than 0.5, and otherwise blue color (color assignment file is provided as Melanoma3614WNT5A.txt), then the MDS plot, shown in Figure 7.11.5B, resembles Figure 7.11.5A. Using a subset of genes (276 total) derived from Bittner et al. (2000), MelanomaSigGene276ratio.txt (notice that order of columns in this file is different from Melanoma3614ratio.data), the authors applied the same algorithm along with the color assignment file MelanomaSigGene276color.txt. The MDS result is shown in Figure 7.11.5C. Clearly, Figure 7.11.5C shows a remarkable clustering effect in the less aggressive class (19 samples), but the same effect is not observed with the other two groups (aggressive class and control class). Another excellent example of the application of MDS techniques was by Hedenfalk et al. (2001). In their study of breast cancer samples, cancer subtypes BRCA1, BRCA2, and sporadic were predetermined by genotyping, a statistical analysis approach similar to that discussed before was employed, and MDS plots with a set of discriminative genes demonstrated the power of MDS visualization technique. Specific methods of selecting significantly expressed genes is beyond the scope of this unit; users should refer to publications by Bittner et al. (2000), Hedenfalk et al. (2001), Tusher et al. (2001), and Smyth (2004). Analyzing Expression Patterns 7.11.5 Current Protocols in Bioinformatics Supplement 10 Figure 7.11.5 Three-dimensional MDS plots. (A) Generated from 3614 genes that passed measurement quality criterion; (B) Color overlay with WNT5A genes’ expression ratio (red = expression ratio <0.5, blue = all others); (C) MDS plot with 276 discriminative genes (derived from Bittner et al., 2000). This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm. COMMENTARY Background Information Multidimensional Scaling Similarity and distance between samples A gene expression data set is typically presented in a matrix form, X = {x1 , x2 , . . . , xM }, where vector xi represents the ith microarray experiment, with a total of M experiments in the data set. Each experiment consists of n measurements, either gene expression intensity or expression ratio, denoted by xi = {xi1 , xi2 , . . ., xin }. The matrix X can also be viewed in the row direction, which provides profiles of genes across all samples. Samples may be ordered according to time or dosage, or may be unordered collections. Without the loss of generality, the following discussion will present MDS concepts and algorithms by exploring the relationship of experimental samples. In addition, it will be assumed that the microarray data measurements (for expression ratio or intensity) are log-transformed. Some other data-transformation methods have also been proposed in order to preserve some expected statistical characteristics for higher-level analysis. For detailed discussion of data transformations, readers are referred to the literature (Durbin et al., 2002; Huber et al., 2002). Pearson correlation coefficient One of the most commonly used similarity measure is the Pearson correlation coefficient. Given a matrix X, a similarity matrix of all columns, S = {sij }, is defined as, Equation 7.11.1 where µxi , µxj , (and σ xi , σ xj ) are the mean (and standard deviation) over all genes in array i and j, respectively. The value of correlation coefficient is between −1 and 1, with 1 signifying perfectly correlated, 0 not correlated at all, and −1 inversely correlated. The distance between two samples i and j is simply defined as 7.11.6 Supplement 10 Current Protocols in Bioinformatics dij = 1 − sij . Commonly, a normalization procedure is applied to microarray data before it is entered into similarity calculation, and when gene expression ratios are under consideration, the mean expression ratio, µxi , should be zero. An uncentered correlation coefficient is defined as: where rik is the rank of kth measurement in ith microarray. The range of the Spearman correlation coefficient is also from −1 to 1. Euclidean distance Euclidean distance is a most commonly used dissimilarity measure defined as: Equation 7.11.4 Equation 7.11.2 Equation 7.11.2 assumes that the mean is 0, even when it is not. The difference between Equations. 7.11.1 and 7.11.2 is that if two samples’ expression profiles have identical shape, but with an offset relative to each other by a fixed value, the standard Pearson correlation from these two profiles will be 1 but will not have an uncentered correlation of 1. Spearman correlation coefficient On many occasions when it is not convenient, accurate, or even possible to give actual values to variables, the rank order of instances of each variable is commonly employed. The rank order may also be a better measure when the relationship is nonlinear. Rather than using the raw data as in the Pearson correlation coefficient, the Spearman correlation coefficient uses the rank of each data to perform the calculation as given in Equation 7.11.1. The Spearman correlation coefficient is defined as: Equation 7.11.3 Notice that the Euclidean distance is not bounded; therefore, normalization and dynamic range rescaling for each microarray are very important steps before the distance calculation. Otherwise, the distance measure will reflect the bias due to inadequate normalization. There are many other distance or similarity measures (see Everitt and Dunn, 1992 and Duda et al., 2001), which are not covered exhaustively here. Multidimensional scaling As mentioned before, the similarity matrix can be converted to distance matrix, D = {dij }, by D = 1 − S. A typical distance matrix is shown in Table 7.11.1, where the diagonal elements are all zeros, the upper triangle part is symmetrical to the lower triangle part, and each element takes value from [−1, 1]. After this step, the original data dimension has been effectively reduced to an M × M matrix, and, clearly, the distance scale derived from similarity matrix preserved the definition of distance concept (origin at zero, ordered scale, and ordered difference). The objective of multidimensional scaling is to further reduce the Table 7.11.1 Partial Distance Matrix, D, Generated from the Provided Data Sample (Bittner et al., 2000) UACC 383 UACC 457 UACC 3093 UACC 2534 M92 −001 A-375 UACC 502 M91 −054 UACC 1256 UACC383 0 UACC457 0.4 0 UACC3093 0.37 0.32 0 UACC2534 0.44 0.49 0.41 0 M92-001 0.47 0.54 0.46 0.41 0 A-375 0.46 0.46 0.45 0.47 0.36 0 UACC502 0.57 0.57 0.52 0.54 0.42 0.5 0 M91-054 0.49 0.52 0.47 0.54 0.42 0.5 0.43 0 UACC1256 0.51 0.49 0.49 0.48 0.41 0.47 0.4 0.46 0 UACC091 0.48 0.56 0.5 0.53 0.47 0.48 0.49 0.41 0.42 Analyzing Expression Patterns 7.11.7 Current Protocols in Bioinformatics Supplement 10 dimension of the matrix such that an observable graphic model is obtained with a minimum error. Typically, a Cartesian coordinate system represents an observable graphic model with the Euclidean distance measuring the between-point distance. To archive the objective of mapping the distance matrix to a graphical model, one would like to find a target matrix X̂ = {x̂ ij } in 1-, 2-, or 3-dimensional space, and its accompanying distance matrix D̂ = {d̂ ij } derived from Euclidean distance in p-dimensional space, where di2 , d 2j , and di2j are the results of summation of di2j over j, i, and both i and j, respectively. By selecting the first k principal components of X to form a target matrix X̂ = {x̂ ij }, for i = 1, . . . , M and j = 1, . . . , k, the classic MDS implementation is achieved. It is noted that some implementations of MDS provide a second step to optimize against Equation 7.11.5 via an optimization procedure such as the Steepest-Descend method. On many occasions, the second step is quite significant where stress T of MDS can be further reduced. Readers who are interested in actual implementation are strongly advised to consult the MDS literature listed at the end of this section. Equation 7.11.5 such that the stress T, a measure of the goodness-of-fit, is minimized. Here, the stress T is defined by: Public domain and commercial software packages There are a large number of software packages provide MDS functionality. Some popular applications are listed in Table 7.11.2. Critical Parameters Equation 7.11.6 If acceptably accurate representation, ranging from T = 20% (poor), to 5% (good), to 2.5% (excellent), as suggested by Everitt and Dunn (1992), can be found in threedimensional (or less) space, visualization via MDS will be an extremely valuable way to gain insight into the structure of data. However, due to the high dimensionality of gene expression profiling experiments, it is normally expected that MDS plots will have a stress parameter T in the range of 10% to 20%. Implementation and applications Many statistical software packages provide their implementation of classic MDS algorithms. Without going into lengthy discussion about various MDS definitions and their mathematics, the authors of this unit provide a concise version of a possible implementation of classic MDS (Everitt and Dunn, 1992). In the classical MDS setting, given the pair-wise distance dij between all M experiments, the coordinates of M points in M-dimensional Euclidean space, X = (xij ), can be obtained by finding the eigenvectors of B = XX , while the matrix elements of B, {bij }, can be estimated by: Multidimensional Scaling Equation 7.11.7 Some important clarifications are provided as follows. 1. If the data matrix contains raw expression-ratio or intensity data, a logarithm transform is hightly recommended. Typically, log2 transforms are applied to the ratio variable in order to obtain a symmetric distribution around zero in which two-fold increases and decreases in the ration (2 and 0.5, respectively) will be conveniently converted into 1 and −1, respectively. While the purpose of transforming ratio data is mainly to yield a symmetrical distribution, log-transformation of the intensity mainly provides for the stabilization of variance across all possible intensity ranges. If not satisfied with the log-transforms automatically provided by the program, the user can supply a matrix with the transformation of choice, in which case the “transformation” check box should be unchecked. 2. Rounding is necessary to fix some extreme ratio or intensity problem. However, when a pretransformed data matrix is supplied, one should not use any data rounding. 3. The preferable distance metric is the Pearson correlation coefficient. However other distance measures (e.g., Spearman, Euclidean; see Background Information) are provided for comparison purpose. 4. Generation of an AVI movie requires large RAM, and the program typically produces an AVI movie of ∼300 Mb. Other compression techniques (QuickTime, MPEG, etc) may be required in order to produce a movie of manageable size. 7.11.8 Supplement 10 Current Protocols in Bioinformatics Table 7.11.2 Software Packages Providing MDS Functionality Application Function Availability Platform R Classic multidimensional scaling cmdscale(d, k, eig, add, x.ret) Free (http://www.r-project.org/) Windows, Unix, Mac OS X S-plus Classic multidimensional scaling cmdscale(d, k, eig, add, x.ret) Commercial Insightful Corp. (http://www.insightful.com) Windows, Unix BRB ArrayTools Menu-driven microarray analysis tools Free (see licensing agreement) Biometric Research Branch, DCTD, NCI, NIH (http://linus.nci.nih.gov/BRBArrayTools.html) Windows Partek Pro/Partek Discover Menu-driven data analysis and visualization Commercial Partek, Inc. (http://www.partek.com) Windows, Unix xGobi/xGvis Multivariate data visualization and multidimensional scaling Free (http://www.research.att.com/ areas/stat/xgobi/) Windows, Unix, Mac OS X Literature Cited Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., and Sondak, V. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406:536-540. Borg, I. and Groenen, P. 1997. Modern Multidimensional Scaling: Theory and Applications. Springer, New York. Cox, T.F. and Cox, M.A.A. 2000. Multidimensional Scaling, 2nd ed. CRC Press, Boca Raton, Fla. Durbin, B., Hardin, J., Hawkins D., and Rocke D. 2002. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 18:105-110. Duda, R.O., Hart, E., and Stork, D.G. 2001. Pattern Classification, 2nd ed. John Wiley & Sons, New York. Everitt, B.S. and Dunn, G., 1992. Applied Multivariate Data Analysis. Oxford University Press, New York. Green, P.E. and Rao V.R. 1972. Applied Multidimensional Scaling. Dryden Press, Hinsdale, Ill. Green, P.E., Carmone, F.J., and Smith, S.M. 1989. Multidimensional Scaling: Concepts and Applications. Allyn and Bacon, Needham Heights, Mass. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O.P., Wilfond, B., Borg, A., and Trent, J. 2001. Gene expression profiles in hereditary breast cancer. N. Engl. J. Med. 344:539-548. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., and Vingron, M. 2002. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:S96-S104. Khan, J., Simon, R., Bittner, M., Chen, Y., Leighton, S.B., Pohida, T., Smith, P.D., Jiang, Y., Gooden, G.C., Trent, J.M., and Meltzer, P.S. 1998. Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res. 58:5009-5113. Schiffman, S.S., Reynolds, M.L., and Young, F.W. 1981. Introduction to Multidimensional Scaling: Theory, Method and Applications. Academic Press, Inc., New York. Smyth, G., 2004. Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments, Statistical Applications in Genetics and Molecular Biology Vol. 3: No. 1, Article 3. http://www.bepress.com/ sagmb/vol3/iss1/art3. Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98:5116-5121. Young, F.W. 1987. Multidimensional Scaling: History, Theory and Applications (R.M. Hamer, ed.). Lawrence Erlbaum Associates, Hillsdale, N.J. Contributed by Yidong Chen and Paul S. Meltzer National Human Genome Research Institute National Institutes of Health Bethesda, Maryland Analyzing Expression Patterns 7.11.9 Current Protocols in Bioinformatics Supplement 10 Using GenePattern for Gene Expression Analysis UNIT 7.12 Heidi Kuehn,1 Arthur Liberzon,1 Michael Reich,1 and Jill P. Mesirov1 1 Broad Institute of MIT and Harvard, Cambridge, Massachusetts ABSTRACT The abundance of genomic data now available in biomedical research has stimulated the development of sophisticated statistical methods for interpreting the data, and of special visualization tools for displaying the results in a concise and meaningful manner. However, biologists often find these methods and tools difficult to understand and use correctly. GenePattern is a freely available software package that addresses this issue by providing more than 100 analysis and visualization tools for genomic research in a comprehensive user-friendly environment for users at all levels of computational experience and sophistication. This unit demonstrates how to prepare and analyze microarray data C 2008 by John Wiley & in GenePattern. Curr. Protoc. Bioinform. 22:7.12.1-7.12.39. Sons, Inc. Keywords: GenePattern r microarray data analysis r workflow r clustering r classification r differential r expression analysis pipelines INTRODUCTION GenePattern is a freely available software package that provides access to a wide range of computational methods used to analyze genomic data. It allows researchers to analyze the data and examine the results without writing programs or requesting help from computational colleagues. Most importantly, GenePattern ensures reproducibility of analysis methods and results by capturing the provenance of the data and analytic methods, the order in which methods were applied, and all parameter settings. At the heart of GenePattern are the analysis and visualization tools (referred to as “modules”) in the GenePattern module repository. This growing repository currently contains more than 100 modules for analysis and visualization of microarray, SNP, proteomic, and sequence data. In addition, GenePattern provides a form-based interface that allows researchers to incorporate external tools as GenePattern modules. Typically, the analysis of genomic data consists of multiple steps. In GenePattern, this corresponds to the sequential execution of multiple modules. With GenePattern, researchers can easily share and reproduce analysis strategies by capturing the entire set of steps (along with data and parameter settings) in a form-based interface or from an analysis result file. The resulting “pipeline” makes all the necessary calls to the required modules. A pipeline allows repetition of the analysis methodology using the same or different data with the same or modified parameters. It can also be exported to a file and shared with colleagues interested in reproducing the analysis. GenePattern is a client-server application. Application components can all be run on a single machine with requirements as modest as that of a laptop, or they can be run on separate machines allowing the server to take advantage of more powerful hardware. The server is the GenePattern engine: it runs analysis modules and stores analysis results. Two point-and-click graphical user interfaces, the Web Client, and the Desktop Client, provide easy access to the server and its modules. The Web Client is installed with the Current Protocols in Bioinformatics 7.12.1-7.12.39, June 2008 Published online June 2008 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471250953.bi0712s22 C 2008 John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.12.1 Supplement 22 server and runs in a Web browser. The Desktop Client is installed separately and runs as a desktop application. In addition, GenePattern libraries for the Java, MATLAB, and R programming environments provide access to the server and its modules via function calls. The basic protocols in this unit use the Web Client; however, they could also be run from the Desktop Client or a programming environment. This unit demonstrates the use of GenePattern for microarray analysis. Many transcription profiling experiments have at least one of the three following goals: differential expression analysis, class discovery, or class prediction. The objective of differential expression analysis is to find genes (if any) that are differentially expressed between distinct classes or phenotypes of samples. The differentially expressed genes are referred to as marker genes and the analysis that identifies them is referred to as marker selection. Class discovery allows a high-level overview of microarray data by grouping genes or samples by similar expression profiles into a smaller number of patterns or classes. Grouping genes by similar expression profiles helps to detect common biological processes, whereas grouping samples by similar gene expression profiles can reveal common biological states or disease subtypes. A variety of clustering methods address class discovery by gene expression data. In class prediction studies, the aim is to identify key marker genes whose expression profiles will correctly classify unlabeled samples into known classes. For illustration purposes, the protocols use expression data from Golub et al. (1999), which is referred to as the ALL/AML dataset in the text. The data from this study was chosen because it contains all three of the analysis objectives mentioned above. Briefly, the study built predictive models using marker genes that were significantly differentially expressed between two subtypes of leukemia, acute lymphoblastic (ALL) and acute myelogenous (AML). It also showed how to rediscover the leukemia subtypes ALL and AML, as well as the B and T cell subtypes of ALL, using sample-based clustering. The sample data files are available for download on the GenePattern Web site at http://www.genepattern.org/datasets/. PREPARING THE DATASET Analyzing gene expression data with GenePattern typically begins with three critical steps. Step 1 entails converting gene expression data from any source (e.g., Affymetrix or cDNA microarrays) into a tab-delimited text file that contains a column for each sample, a row for each gene, and an expression value for each gene in each sample. GenePattern defines two file formats for gene expression data: GCT and RES. The primary difference between the formats is that the RES file format contains the absent (A) versus present (P) calls as generated for each gene by Affymetrix GeneChip software. The protocols in this unit use the GCT file format. However, the protocols could also use the RES file format. All GenePattern file formats are fully described in GenePattern File Formats (http://genepattern.org/tutorial/gp fileformats.html). Step 2 entails creating a tab-delimited text file that specifies the class or phenotype of each sample in the expression dataset, if available. GenePattern uses the CLS file format for this purpose. Step 3 entails preprocessing the expression data as needed, for example, to remove platform noise and genes that have little variation across samples. GenePattern provides the PreprocessDataset module for this purpose. Using GenePattern for Gene Expression Analysis 7.12.2 Supplement 22 Current Protocols in Bioinformatics Creating a GCT File Four strategies can be used to create an expression data file (GCT file format; Fig. 7.12.1) depending on how the data was acquired: BASIC PROTOCOL 1 1. Create a GCT file based on expression data extracted from the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) or the National Cancer Institute’s caArray microarray expression data repository (http://caarray.nci.nih.gov). GenePattern provides two modules for this purpose: GEOImporter and caArrayImportViewer. 2. Convert MAGE-ML format data to a GCT file. MAGE-ML is the standard format for storing both Affymetrix and cDNA microarray data at the ArrayExpress repository (http://www.ebi.ac.uk/arrayexpress). GenePattern provides the MAGEMLImportViewer module to convert MAGE-ML format data. 3. Convert raw expression data from Affymetrix CEL files to a GCT file. GenePattern provides the ExpressionFileCreator module for this purpose. 4. Expression data stored in any other format (such as cDNA microarray data) must be converted into a tab-delimited text file that contains expression measurements with genes as rows and samples as columns. Expression data can be intensity values or ratios. Use Excel or a text editor to manually modify the text file to comply with the GCT file format requirements. Excel is a popular choice for editing gene expression data files. However, be aware that (1) its auto-formatting can introduce errors in gene names (Zeeberg et al., 2004) and (2) its default file extension for tab-delimited text is .txt. GenePattern requires a .gct file extension for GCT files. In Excel, choose Save As and save the file in text (tab delimited) format with a .gct extension. Table 7.12.1 lists commonly used gene expression data formats and the recommended method for converting each into a GenePattern GCT file. For the protocols in this unit, download the expression data files all aml train.gct and all aml test.gct from the GenePattern Web site, at http://www.genepattern.org/datasets/. Figure 7.12.1 all aml train.gct as it appears in Excel. GenePattern File Formats (http://genepattern.org/tutorial/gp fileformats.html) fully describes the GCT file format. Analyzing Expression Patterns 7.12.3 Current Protocols in Bioinformatics Supplement 22 Table 7.12.1 GenePattern Modules for Translating Expression Data into GCT or RES File Formats Source data GenePattern modulea Output filea CEL files from Affymetrix ExpressionFileCreator GCT or RES Gene Expression Omnibus (GEO) data GEOImporter GCT MAGE-ML expression data from ArrayExpress MAGEMLImportViewer GCT caArray expression data caArrayImportViewer GCT N/A N/A b Two-color ratio data a N/A, not applicable. b Two-color ratio data in text format files, such as PCL and CDT, can be opened in Excel or a text editor and modified to match the GCT or RES file format. BASIC PROTOCOL 2 Creating a CLS File Many of the GenePattern modules for gene expression analysis require both an expression data file and a class file (CLS format). A CLS file (Fig. 7.12.2) identifies the class or phenotype of each sample in the expression data file. It is a space-delimited text file that can be created with any text editor. The first line of the CLS file contains three values: the number of samples, the number of classes, and the version number of file format (always 1). The second line begins with a pound sign (#) followed by a name for each class. The last line contains a class label for each sample. The number and order of the labels must match the number and order of the samples in the expression dataset. The class labels are sequential numbers (0, 1, . . .) assigned to each class listed in the second line. For the protocols in this unit, download the class files all aml train.cls and all aml test.cls from the GenePattern Web site at http://www.genepattern. org/datasets/. Figure 7.12.2 all aml train.cls as it appears in Notepad. GenePattern File Formats (http://genepattern.org/tutorial/gp fileformats.html) fully describes the CLS file format. BASIC PROTOCOL 3 Using GenePattern for Gene Expression Analysis Preprocessing Gene Expression Data Most analyses require preprocessing of the expression data. Preprocessing removes platform noise and genes that have little variation so the analysis can identify interesting variations, such as the differential expression between tumor and normal tissue. GenePattern provides the PreprocessDataset module for this purpose. This module can perform one or more of the following operations (in order): 1. Set threshold and ceiling values. Any expression value lower than the threshold value is set to the threshold. Any value higher than the ceiling value is set to the ceiling value. 7.12.4 Supplement 22 Current Protocols in Bioinformatics 2. Convert each expression value to the log base 2 of the value. When using ratios to compare gene expression between samples, this transformation brings up- and down-regulated genes to the same scale. For example, ratios of 2 and 0.5, indicating two-fold changes for up- and down-regulated expression, respectively, become +1 and −1 (Quackenbush, 2002). 3. Remove genes (rows) if a given number of its sample values are less than a given threshold. This may be an indication of poor-quality data. 4. Remove genes (rows) that do not have a minimum fold change or expression variation. Genes with little variation across samples are unlikely to be biologically relevant to a comparative analysis. 5. Discretize or normalize the data. Discretization converts continuous data into a small number of finite values. Normalization adjusts gene expression values to remove systematic variation between microarray experiments. Both methods may be used to make sample data more comparable. For illustration purposes, this protocol applies thresholds and variation filters (operations 1, 3, and 4 in the list above) to expression data, and Basic Protocols 4, 5, and 6 analyze the preprocessed data. In practice, the decision of whether to preprocess expression data depends on the data and the analyses being run. For example, a researcher should not preprocess the data if doing so removes genes of interest from the result set. Similarly, while researchers generally preprocess expression data before clustering, if doing so removes relevant biological information, the data should not be preprocessed. For example, if clusters based on minimal differential gene expression are of biological interest, do not filter genes based on differential expression. Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software GenePattern software, which is freely available at http://www.genepattern.org/, or browser to access GenePattern on line (the Support Protocol describes how to start GenePattern) Modules used in this protocol: PreprocessDataset (version 3) Files The PreprocessDataset module requires gene expression data in a tab-delimited text file (GCT file format, Fig. 7.12.1) that contains a column for each sample and a row for each gene. Basic Protocol 1 describes how to convert various gene expression data into this file format. As an example, this protocol uses the ALL/AML leukemia training dataset (Golub et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML). Download the data file (all aml train.gct) from the GenePattern Web site at http://www.genepattern.org/datasets/. 1. Start PreprocessDataset: select it from the Modules & Pipelines list on the GenePattern start page (Fig. 7.12.3). The PreprocessDataset module is in the Preprocess & Utilities category. GenePattern displays the parameters for the PreprocessDataset module (Fig. 7.12.4). For information about the module and its parameters, click the Help link at the top of the form. Analyzing Expression Patterns 7.12.5 Current Protocols in Bioinformatics Supplement 22 Figure 7.12.3 GenePattern Web Client start page. The Modules & Pipelines pane lists all modules installed on the GenePattern server. For illustration purposes, we installed only the modules used in this protocol. Typically, more modules are listed. Figure 7.12.4 parameters. PreprocessDataset parameters. Table 7.12.2 describes the PreprocessDataset Using GenePattern for Gene Expression Analysis 7.12.6 Supplement 22 Current Protocols in Bioinformatics Table 7.12.2 Parameters for PreprocessDataset Parameter Description input filename Gene expression data (GCT or RES file format) output file Output file name (do not include file extension) output file format Select a file format for the output file filter flag Whether to apply thresholding (threshold and ceiling parameter) and variation filters (minchange, mindelta, num excl, and prob thres parameters) to the dataset preprocessing flag Whether to discretize (max sigma binning parameter) the data, normalize the data, or both (by default, the module does neither) minchange Exclude rows that do not meet this minimum fold change: maximum-value/minimum-value < minchange mindelta Exclude rows that do not meet this minimum variation filter: maximum-value – minimum-value < mindelta threshold Reset values less than this to this value: threshold if < threshold ceiling Reset values greater than this to this value: ceiling if > ceiling (by default, the ceiling is 20,000) max sigma binning Used for discretization (preprocessing flag parameter), which converts expression values to discrete values based on standard deviations from the mean. Values less than one standard deviation from the mean are set to 1 (or –1), values one to two standard deviations from the mean are set to 2 (or –2), and so on. This parameter sets the upper (and lower) bound for the discrete values. By default, max sigma binning = 1, which sets expression values above the mean to 1 and expression values below the mean to –1. prob thres Use this probability threshold to apply variation filters (filter flag parameter) to a subset of the data. Specify a value between 0 and 1, where 1 (the default) applies variation filters to 100% of the dataset. We recommend that only advanced users modify this option. num excl Exclude this number of maximum (and minimum) values before the selecting the maximum-value (and minimum-value) for minchange and mindelta. This prevents a gene that has “spikes” in its data from passing the variation filter. log base two Converts each expression value to the log base 2 of the value; any negative or 0 value is marked “NaN”, indicating an invalid value number of columns above threshold Removes underexpressed genes by removing rows that do not have at least a given number of entries (this parameter) above a given value (column threshold parameter). column threshold Removes underexpressed genes by removing rows that do not have at least a given number of entries (column threshold parameter) above a given value (this parameter). 2. For the “input filename” parameter, select gene expression data in the GCT file format. For example, use the Browse button to select all aml train.gct. 3. Review the remaining parameters to determine which values, if any, should be modified (see Table 7.12.2). For this example, use the default values. 4. Click Run to start the analysis. GenePattern displays a status page. When the analysis completes, the status page lists the analysis result files: the all aml train.preprocessed.gct file contains the preprocessed gene expression data; the gp task execution log.txt file lists the parameters used for the analysis. 5. Click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. Analyzing Expression Patterns 7.12.7 Current Protocols in Bioinformatics Supplement 22 BASIC PROTOCOL 4 DIFFERENTIAL ANALYSIS: IDENTIFYING DIFFERENTIALLY EXPRESSED GENES This protocol focuses on differential expression analysis, where the aim is to identify genes (if any) that are differentially expressed between distinct classes or phenotypes. GenePattern uses the ComparativeMarkerSelection module for this purpose (Gould et al., 2006). For each gene, the ComparativeMarkerSelection module uses a test statistic to calculate the difference in gene expression between the two classes and then estimates the significance (p-value) of the test statistic score. Because testing tens of thousands of genes simultaneously increases the possibility of mistakenly identifying a non-marker gene as a marker gene (a false positive), ComparativeMarkerSelection corrects for multiple hypothesis testing by computing both the false discovery rate (FDR) and the family-wise error rate (FWER). The FDR represents the expected proportion of non-marker genes (false positives) within the set of genes declared to be differentially expressed. The FWER represents the probability of having any false positives. It is in general stricter or more conservative than the FDR. Thus, the FWER may frequently fail to find marker genes due to the noisy nature of microarray data and the large number of hypotheses being tested. Researchers generally identify marker genes based on the FDR rather than the more conservative FWER. Measures such as FDR and FWER control for multiple hypothesis testing by “inflating” the nominal p-values of the single hypotheses (genes). This allows for controlling the number of false positives but at the cost of potentially increasing the number of false negatives (markers that are not identified as differentially expressed). We therefore recommend fully preprocessing the gene expression dataset as described in Basic Protocol 3 before running ComparativeMarkerSelection, to reduce the number of hypotheses (genes) to be tested. ComparativeMarkerSelection generates a structured text output file that includes the test statistic score, its p-value, two FDR statistics, and three FWER statistics for each gene. The ComparativeMarkerSelectionViewer module accepts this output file and displays the results interactively. Use the viewer to sort and filter the results, retrieve gene annotations from various public databases, and create new gene expression data files from the original data. Optionally, use the HeatMapViewer module to generate a publication quality heat map of the differentially expressed genes. Heat maps represent numeric values, such as intensity, as colors making it easier to see patterns in the data. Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software GenePattern software, which is freely available at http://www.genepattern.org/, or browser to access GenePattern on line (the Support Protocol describes how to start GenePattern) Modules used in this protocol: ComparativeMarkerSelection (version 4), ComparativeMarkerSelectionViewer (version 4), and HeatMapViewer (version 8) Files Using GenePattern for Gene Expression Analysis The ComparativeMarkerSelection module requires two files as input: one for gene expression data and another that specifies the class of each sample. The classes usually represent phenotypes, such as tumor or normal. The expression data file is a tab-delimited text file (GCT file format, Fig. 7.12.1) that contains a column 7.12.8 Supplement 22 Current Protocols in Bioinformatics for each sample and a row for each gene. Classes are defined in another tab-delimited text file (CLS file format, Fig. 7.12.2). Basic Protocols 1 and 2 describe how to convert various gene expression data into these file formats. As an example, this protocol uses the ALL/AML leukemia training dataset (Golub et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML). Download the data files (all aml train.gct and all aml train.cls) from the GenePattern Web site at http://www.genepattern.org/datasets/. This protocol assumes that the expression data file, all aml train.gct, has been preprocessed according to Basic Protocol 3. The preprocessed expression data file, all aml train.preprocessed.gct, is used in this protocol. Run ComparativeMarkerSelection analysis 1. Start ComparativeMarkerSelection by selecting it from the Modules & Pipelines list on the GenePattern start page (this can be found in the Gene List Selection category). GenePattern displays the parameters for the ComparativeMarkerSelection (Fig. 7.12.5). For information about the module and its parameters, click the Help link at the top of the form. 2. For the “input filename” parameter, select gene expression data in GCT file format. For example, select the preprocessed data file, all aml train. preprocessed.gct in the Recent Job list, locate the PreprocessDataset module and its all aml train.preprocessed.gct result file, click the icon next to the result file, and, from the menu that appears, select the Send to input filename command. 3. For the “cls filename” parameter, select a class descriptions file. This file should be in CLS format (see Basic Protocol 2). For example, use the Browse button to select the all aml train.cls file. 4. Review the remaining parameters to determine which values, if any, should be modified (see Table 7.12.3). For this example, use the default values. Figure 7.12.5 ComparativeMarkerSelection parameters. Table 7.12.3 describes the ComparativeMarkerSelection parameters. Current Protocols in Bioinformatics Analyzing Expression Patterns 7.12.9 Supplement 22 Table 7.12.3 Parameters for the ComparativeMarkerSelection Analysis Parameter Description input file Gene expression data (GCT or RES file format) cls file Class file (CLS file format) that specifies the phenotype of each sample in the expression data confounding variable cls filename Class file (CLS file format) that specifies a second class—the confounding variable—for each sample in the expression data. Specify a confounding variable class file to have permutations shuffle the phenotype labels only within the subsets defined by that class file. For example, in Lu et al. (2005), to select features that best distinguish tumors from normal samples on all tissue types, tissue type is treated as the confounding variable. In this case, the CLS file that defines the confounding variable lists each tissue type as a phenotype and associates each sample with its tissue type. Consequently, when ComparativeMarkerSelection performs permutations, it shuffles the tumor/normal labels only among samples with the same tissue type. test direction Determine how to measure differential expression. By default, ComparativeMarkerSelection performs a two-sided test: a differentially expressed gene might be up-regulated for either class. Alternatively, have ComparativeMarkerSelection perform a one-sided test: a differentially expressed gene is up-regulated for class 0 or up-regulated for class 1. A one-sided test is less reliable; therefore, if performing a one-sided test, also perform the two-sided test and consider both sets of results. test statistic Statistic to use for computing differential expression. t-test (the default) is the standardized mean difference in gene expression between the two classes: μ − μb a σ2 σa2 + b na nb where μ is the mean of the sample, σ 2 is the variance of the population, and n is the number of samples. Signal-to-noise ratio is the ratio of mean difference in gene expression and standard deviation: μa − μb σa + σb where μ is the mean of the sample and σ is the population standard deviation. Either statistic can be modified by using median gene expression rather than mean, enforcing a minimum standard deviation, or both. Using GenePattern for Gene Expression Analysis min std When the selected test statistic computes differential expression using a minimum standard deviation, specify that minimum standard deviation. number of permutations Number of permutations used to estimate the p-value, which indicates the significance of the test statistic score for a gene. If the dataset includes at least eight samples per phenotype, use the default value of 1000 permutations to estimate a p-value accurate to four significant digits. If the dataset includes fewer than eight samples in any class a permutation test should not be used. complete Whether to perform all possible permutations. By default, complete is set to “no” and number of permutations determines the number of permutations performed. Because of the statistical considerations surrounding permutation tests on small numbers of samples, we recommend that only advanced users select this option. continued 7.12.10 Supplement 22 Current Protocols in Bioinformatics Table 7.12.3 Parameters for the ComparativeMarkerSelection Analysis, continued Parameter Description balanced Whether to perform balanced permutations. By default, balanced is set to “no” and phenotype labels are permuted without regard to the number of samples per phenotype (e.g., if the dataset has twenty samples in class 0 and ten samples in class 1, for each permutation the thirty labels are randomly assigned to the thirty samples). Set balanced to “yes” to permute phenotype labels after balancing the number of samples per phenotype (e.g., if the dataset has twenty samples in class 0 and ten in class 1, for each permutation ten samples are randomly selected from class 0 to balance the ten samples in class 1, and then the twenty labels are randomly assigned to the twenty samples). Balancing samples is important if samples are very unevenly distributed across classes. random seed The seed for the random number generator smooth p values Whether to smooth p-values by using Laplace’s Rule of Succession. By default, smooth p-values are set to “yes”, which means p-values are always <1.0 and >0.0 phenotype test Tests to perform when the class file (CLS file format) has more than two classes: “one versus all” or “all pairs”. The p-values obtained from the one-versus-all comparison are not fully corrected for multiple hypothesis testing. output filename Output filename 5. Click Run to start the analysis. GenePattern displays a status page. When the analysis completes, the status page lists the analysis result files: the .odf file (all aml train.preprocessed. comp.marker.odf in this example) is a structured text file that contains the analysis results; the gp task execution log.txt file lists the parameters used for the analysis. 6. Click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. The Recent Jobs list includes the ComparativeMarkerSelection module and its result files. View analysis results using the ComparativeMarkerSelectionViewer The analysis result file from ComparativeMarkerSelection includes the test statistic score, p-value, FDR, and FWER statistics for each gene. The ComparativeMarkerSelectionViewer module accepts this output file and displays the results in an interactive, graphical viewer to simplify review and interpretation of the data. 7. Start the ComparativeMarkerSelectionViewer by clicking the icon next to the ComparativeMarkerSelection analysis result file (in this example, all aml train.preprocessed.comp.marker.odf); from the menu that appears, select ComparativeMarkerSelectionViewer. GenePattern displays the parameters for the ComparativeMarkerSelectionViewer module. Because the module was selected from the file menu, GenePattern automatically uses the analysis result file as the value for the first input file parameter. 8. For the “dataset filename” parameter, select the gene expression data file used for the ComparativeMarkerSelection analysis. For this example, select all aml train.preprocessed.gct. In the Recent Job list, locate the PreprocessDataset module and its analysis result files; click the icon next to the all aml train.preprocessed.gct result file, and, from the menu that appears, select the Send to dataset filename command. Analyzing Expression Patterns 7.12.11 Current Protocols in Bioinformatics Supplement 22 Figure 7.12.6 ComparativeMarkerSelection Viewer. 9. Click the Help link at the top of the form to display documentation for the ComparativeMarkerSelectionViewer. 10. Click Run to start the viewer. GenePattern displays the ComparativeMarkerSelectionViewer (Fig. 7.12.6). In the upper pane of the visualizer, the Upregulated Features graph plots the genes in the dataset according to score—the value of the test statistic used to calculate differential expression. Genes with a positive score are more highly expressed in the first class. Genes with a negative score are more highly expressed in the second class. Genes with a score close to zero are not significantly differentially expressed. In the lower pane, a table lists the ComparativeMarkerSelection analysis results for each gene including the name, description, test statistic score, p-value, and the FDR and FWER statistics. The FDR controls the fraction of false positives that one can tolerate, while the more conservative FWER controls the probability of having any false positives. As discussed in Gould et al. (2006), the ComparativeMarkerSelection module computes the FWER using three methods: the Bonferroni correction (the most conservative method), the maxT method of Westfall and Young (1993), and the empirical FWER. It computes the FDR using two methods: the BH procedure developed by Benjamini and Hochberg (1995) and the less conservative q-value method of Storey and Tibshirani (2003). Using GenePattern for Gene Expression Analysis Apply a filter to view the differentially expressed genes Due to the noisy nature of microarray data and the large number of hypotheses tested, the FWER often fails to identify any genes as significantly differentially expressed; therefore, researchers generally identify marker genes based on the false discovery rate (FDR). For this example, marker genes are identified based on an FDR cutoff value of 0.05. An FDR value of 0.05 indicates that a gene identified as a marker gene has a 1 in 20 (5%) chance of being a false positive. 7.12.12 Supplement 22 Current Protocols in Bioinformatics In the ComparativeMarkerSelectionViewer, apply a filter with the criterion FDR <= 0.05 to view the marker genes. To further analyze those genes, create a new derived dataset that contains only the marker genes. 11. Select Edit>Filter Features>Custom Filter, then the Filter Features dialog window appears. Specify a filter criterion by selecting a column from the drop-down list and entering the allowed values for that column. To add a second filter criterion, click Add Filter. After entering all of the criterion, click OK to apply the filter. 12. Enter the filter criterion FDR(BH) >= 0 <= 0.05 and click OK to apply the filter. This example identifies marker genes based on the FDR values computed using the more conservative BH procedure developed by Benjamini and Hochberg (1995). When the filter is applied, the ComparativeMarkerSelectionViewer updates the display to show only those genes that have an FDR(BH) value ≤0.05. Notice that the Upregulated Features graph now shows only genes identified as marker genes. 13. Review the filtered results. In the ALL/AML leukemia dataset, >500 genes are identified as marker genes based on the FDR cutoff value of 0.05. Depending on the question being addressed, it might be helpful to explore only a subset of those genes. For example, one way to select a subset would be to choose the most highly differentially expressed genes, as discussed below. Create a derived dataset of the top 100 genes By default, the ComparativeMarkerSelectionViewer sorts genes by differential expression based on the value of their test statistic scores. Genes in the first rows have the highest scores and are more highly expressed in the first class, ALL; genes in the last rows have the lowest scores and are more highly expressed in the second class, AML. To create a derived dataset of the top 100 genes, select the first 50 genes (rows 1 through 50) and the last 50 genes (rows 536 through 585). 14. Select the top 50 genes: Shift-click a value in row 1 and Shift-click a value in row 50. 15. Select the bottom 50 genes: Ctrl-click a value in row 585 and Ctrl-Shift-click a value in row 536. On the Macintosh, use the Command (cloverleaf) key instead of Ctrl. 16. Select File>Save Derived Dataset. The Save Derived Dataset window appears. 17. Select the Use Selected Features radio button. Selecting Use Selected Features creates a dataset that contains only the selected genes. Selecting the Use Current Features radio button would create a dataset that contains the genes that meet the filter criteria. Selecting Use All Features would create a dataset that contains all of the genes in the dataset; essentially a copy of the existing dataset. 18. Click the Browse button to select a directory and specify the name of the file to hold the new dataset. A Save dialog window appears. Navigate to the directory that will hold the new expression dataset file, enter a name for the file, and click Save. The Save dialog window closes and the name for the new dataset appears in the Save Derived Dataset window. For this example, use the file name all aml train top100.gct. Note that the viewer uses the file extension of the specified file name to determine the format of the new file. Thus, to create a GCT file, the file name must include the .gct file extension. 19. Click Create to create the dataset file and close the Save Derived Dataset window. Analyzing Expression Patterns 7.12.13 Current Protocols in Bioinformatics Supplement 22 20. Select File>Exit to close the ComparativeMarkerSelectionViewer. 21. In the GenePattern Web Client, click Modules & Pipelines to return to the GenePattern start page. View the new dataset in the HeatMapViewer Use the HeatMapViewer (Fig. 7.12.7) to create a heat map of the differentially expressed genes. The heat map displays the highest expression values as red cells, the lowest expression values as blue cells, and intermediate values in shades of pink and blue. 22. Start the HeatMapViewer by selecting it from the Modules & Pipelines list on the GenePattern start page (it is in the Visualizer category). GenePattern displays the parameters for the HeatMapViewer. 23. For the “input filename” parameter, use the Browse button to select the gene expression dataset file created in steps 16 through 19. 24. Click Run to open the HeatMapViewer. In the HeatMapViewer, the columns are samples and the rows are genes. Each cell represents the expression level of a gene in a sample. Visual inspection of the heat map (Fig. 7.12.7) shows how well these top-ranked genes differentiate between the classes. Using GenePattern for Gene Expression Analysis 7.12.14 Supplement 22 Figure 7.12.7 Heat map for the top 100 differentially expressed genes. Current Protocols in Bioinformatics To save the heat map image for use in a publication, select File>Save Image. The HeatMapViewer supports several image formats, including bmp, eps, jpeg, png, and tiff. 25. Select File>Exit to close the HeatMapViewer. 26. Click the Return to Modules & Pipelines start link at the bottom of the status page to return to the GenePattern start page. CLASS DISCOVERY: CLUSTERING METHODS One of the challenges in analyzing microarray expression data is the sheer volume of information: the expression levels of tens of thousands of genes for tens or hundreds of samples. Class discovery aims to produce a high-level overview of data by creating groups based on shared patterns. Clustering, one method of class discovery, reduces the complexity of microarray data by grouping genes or samples based on their expression profiles (Slonim, 2002). GenePattern provides several clustering methods (described in Table 7.12.4). BASIC PROTOCOL 5 In this protocol, the HierarchicalClustering module is first used to cluster the samples and genes in the ALL/AML training dataset. Then the HierarchicalClusteringViewer module is used to examine the results and identify two large clusters (groups) of samples, which correspond to the ALL and AML phenotypes. Table 7.12.4 Clustering Methods Module Description HierachicalClustering Hierarchical clustering recursively merges items with other items or with the result of previous merges. Items are merged according to their pair-wise distance with closest pairs being merged first. The result is a tree structure, referred to as a dendrogram. To view clustering results, use the HierarchicalClusteringViewer. KMeansClustering K-means clustering (MacQueen, 1967) groups elements into a specified number (k) of clusters. A center data point for each cluster is randomly selected and each data point is assigned to the nearest cluster center. Each cluster center is then recalculated to be the mean value of its members and all data points are re-assigned to the cluster with the closest cluster center. This process is repeated until the distance between consecutive cluster centers converges. The result is k stable clusters. Each cluster is a subset of the original gene expression data (GCT file format) and can be viewed using the HeatMapViewer. SOMClustering Self-organizing maps (SOM; Tamayo et al., 1999) creates and iteratively adjusts a two-dimensional grid to reflect the global structure in the expression dataset. The result is a set of clusters organized in a two-dimensional grid where similar clusters lie near each other and provide an “executive summary” of the dataset. To view clustering results, use the SOMClusterViewer. NMFConsensus Non-negative matrix factorization (NMF; Brunet et al., 2004) is an alternative method for class discovery that factors the expression data matrix. NMF extracts features that may more accurately correspond to biological processes. ConsensusClustering Consensus clustering (Monti et al., 2003) is a means of determining an optimal number of clusters. It runs a selected clustering algorithm and assesses the stability of discovered clusters. The matrix is formatted as a GCT file (with the content being the matrix rather than gene expression data) and can be viewed using the HeatMapViewer. Analyzing Expression Patterns 7.12.15 Current Protocols in Bioinformatics Supplement 22 Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software GenePattern software, which is freely available at http://www.genepattern.org/, or browser to access GenePattern on line (the Support Protocol describes how to start GenePattern) Modules used in this protocol: HierarchicalClustering (version 3) and HierarchicalClusteringViewer (version 8) Files The HierarchicalClustering module requires gene expression data in a tab-delimited text file (GCT file format, Fig. 7.12.1) that contains a column for each sample and a row for each gene. Basic Protocol 1 describes how to convert various gene expression data into this file format. As an example, this protocol uses the ALL/AML leukemia training dataset (Golub et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML). Table 7.12.5 Parameters for the HierarchicalClustering Analysis Parameter Setting input filename all aml train. Gene expression data (GCT or RES file format) preprocessed.gct column distance measure Pearson Correlation (the default) Method for computing the distance (similarity measure) between values when clustering samples. Pearson Correlation, the default, determines similarity/dissimilarity between the shape of genes’ expression profiles. For discussion of the different distance measures, see Wit and McClure (2004). row distance measure Pearson Correlation (the default) Method for computing the distance (similarity measure) between values when clustering genes. clustering method Pairwise-complete linkage (the default) Method for measuring the distance between clusters. Pairwise-complete linkage, the default, measures the distance between clusters as the maximum of all pairwise distances. For a discussion of the different clustering methods, see Wit and McClure (2004). log transform No (the default) Transforms each expression value by taking the log base 2 of its value. If the dataset contains absolute intensity values, using the log transform helps to ensure that differences between expressions (fold change) have the same meaning across the full range of expression values (Wit and McClure, 2004). row center Subtract the mean of each row Method for centering row data. When clustering genes, Getz et al. (2006) recommend centering the data by subtracting the mean of each row. row normalize Yes Whether to normalize row data. When clustering genes, Getz et al. (2006) recommend normalizing the row data. column center Subtract the mean of each column Method for centering column data. When clustering samples, Getz et al. (2006) recommend centering the data by subtracting the mean of each column. column normalize Yes output base name <input.filename basename> (the default) Description Whether to normalize column data. When clustering samples, Getz et al. (2006) recommend normalizing the column data. Output file name 7.12.16 Supplement 22 Current Protocols in Bioinformatics Download the data file (all aml train.gct) from the GenePattern Web site at http://genepattern.org/datasets/. This protocol assumes the expression data file, all aml train.gct, has been preprocessed according to Basic Protocol 3. The preprocessed expression data file, all aml train.preprocessed.gct, is used in this protocol. Run the HierarchicalClustering analysis 1. Start HierarchicalClustering by looking in the Recent Jobs list and locating the PreprocessDataset module and its all aml train.preprocessed.gct result file; click the icon next to the result file; and from the menu that appears, select HierarchicalClustering. GenePattern displays the parameters for the HierarchicalClustering analysis. Because the module was selected from the file menu, GenePattern automatically uses the analysis result file as the value for the “input filename” parameter. For information about the module and its parameters, click the Help link at the top of the form. Note that a module can be started from the Modules & Pipelines list, as shown in the previous protocol, or from the Recent Jobs list, as shown in this protocol. 2. Use the remaining parameters to define the desired clustering analysis (see Table 7.12.5). Clustering genes groups genes with similar expression patterns, which may indicate coregulation or membership in a biological process. Clustering samples groups samples with similar gene expression patterns, which may indicate a similar biological or phenotype subtype among the clustered samples. Clustering both genes and samples may be useful for identifying genes that are coexpressed in a phenotypic context or alternative sample classifications. For this example, use the parameter settings shown in Table 7.12.5 to cluster both genes (rows) and samples (columns). Figure 7.12.8 shows the HierarchicalClustering parameters set to these values. Figure 7.12.8 HierarchicalClustering parameters. Table 7.12.5 describes the HierarchicalClustering parameters. Analyzing Expression Patterns 7.12.17 Current Protocols in Bioinformatics Supplement 22 3. Click Run to start the analysis. GenePattern displays a status page. When the analysis is complete (3 to 4 min), the status page lists the analysis result files: the Clustered Data Table (.cdt) file contains the original data ordered to reflect the clustering, the Array Tree Rows (.atr) file contains the dendrogram for the clustered columns (samples), the Gene Tree Rows (.gtr) file contains the dendrogram for the clustered rows (genes) and the gp task execution log.txt file lists the parameters used for the analysis. 4. Click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. The Recent Jobs list includes the HierachicalClustering module and its result files. View analysis results using the HierarchicalClusteringViewer The HierarchicalClusteringViewer provides an interactive, graphical viewer for displaying the analysis results. For a graphical summary of the results, save the content of the viewer to an image file. Using GenePattern for Gene Expression Analysis Figure 7.12.9 HierarchicalClustering Viewer. 7.12.18 Supplement 22 Current Protocols in Bioinformatics 5. Start the HierarchicalClusteringViewer by looking in the Recent Jobs list and clicking the icon next to the HierarchicalClustering result file (all aml train. preprocessed .atr, .cdt, or .gtr); and from the menu that appears, select HierarchicalClusteringViewer. GenePattern displays the parameters for the HierarchicalClusteringViewer. Because the module was selected from the file menu, GenePattern automatically uses the analysis result files as the values for the input file parameters. 6. Click Run to start the viewer. GenePattern displays the HierarchicalClusteringViewer (Fig. 7.12.9). Visual inspection of the dendrogram shows the hierarchical clustering of the AML and ALL samples. 7. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. CLASS PREDICTION: CLASSIFICATION METHODS This protocol focuses on the class prediction analysis of a microarray experiment, where the aim is to build a class predictor—a subset of key marker genes whose transcription profiles will correctly classify samples. A typical class prediction method “learns” how to distinguish between members of different classes by “training” itself on samples whose classes are already known. Using known data, the method creates a model (also known as a classifier or class predictor), which can then be used to predict the class of a previously unknown sample. GenePattern provides several class prediction methods (described in Table 7.12.6). BASIC PROTOCOL 6 For most class prediction methods, GenePattern provides two approaches for training and testing class predictors: train/test and cross-validation. Both approaches begin with an expression dataset that has known classes. In the train/test approach, the predictor is first trained on one dataset (the training set) and then tested on another independent dataset (the test set). Cross-validation is often used for setting the parameters of a model predictor or to evaluate a predictor when there is no independent test set. It repeatedly leaves one sample out, builds the predictor using the remaining samples, and then tests it on the sample left out. In the cross-validation approach, the accuracy of the predictor is determined by averaging the results over all iterations. GenePattern provides pairs of modules for most class prediction methods: one for train/test and one for cross-validation. This protocol applies the k-nearest neighbors (KNN) class prediction method to the ALL/AML data. First introduced by Fix and Hodges in 1951, KNN is one of the simplest classification methods and is often recommended for a classification study when there is little or no prior knowledge about the distribution of the data (Cover and Hart, 1967). The KNN method stores the training instances and uses a distance function to determine which k members of the training set are closest to an unknown test instance. Once the k-nearest training instances have been found, their class assignments are used to predict the class for the test instance by a majority vote. GenePattern provides a pair of modules for the KNN class prediction method: one for the train/test approach and one for the cross-validation approach. Both modules use the same input parameters (Table 7.12.7). This protocol first uses the cross-validation approach (KNNXValidation module) and a training dataset to determine the best parameter settings for the KNN prediction method. It then uses the train/test KNN module with the best parameters identified by the KNNXValidation module to build a classifier on the training dataset and to test that classifier on a test dataset. Analyzing Expression Patterns 7.12.19 Current Protocols in Bioinformatics Supplement 22 Table 7.12.6 Class Prediction Methods Prediction method Algorithm CART CART (Breiman et al., 1984) builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). It works by recursively splitting the feature space into a set of non-overlapping regions and then predicting the most likely value of the dependent variable within each region. A classification tree represents a set of nested logical if-then conditions on the values of the features variables that allows for the prediction of the value of the dependent categorical variable based on the observed values of the feature variables. A regression tree is similar but allows for the prediction of the value of a continuous dependent variable instead. KNN k-nearest-neighbors (KNN) classifies an unknown sample by assigning it the phenotype label most frequently represented among the k nearest known samples (Cover and Hart, 1967). In GenePattern, the user selects a weighting factor for the “votes” of the nearest neighbors (unweighted: all votes are equal; weighted by the reciprocal of the rank of the neighbor’s distance: the closest neighbor is given weight 1/1, next closest neighbor is given weight 1/2, and so on; or weighted by the reciprocal of the distance). PNN Probabilistic Neural Network (PNN) calculates the probability that an unknown sample belongs to a given set of known phenotype classes (Specht, 1990; Lu et al., 2005). The contribution of each known sample to the phenotype class of the unknown sample follows a Gaussian distribution. PNN can be viewed as a Gaussian-weighted KNN classifier—known samples close to the unknown sample have a greater influence on the predicted class of the unknown sample. SVM Support Vector Machines (SVM) is designed for multiple class classification (Vapnik,1998). The algorithm creates a binary SVM classifier for each class by computing a maximal margin hyperplane that separates the given class from all other classes; that is, the hyperplane with maximal distance to the nearest data point. The binary classifiers are then combined into a multiclass classifier. For an unknown sample, the assigned class is the one with the largest margin. Weighted Voting Weighted Voting (Slonim et al., 2000) classifies an unknown sample using a simple weighted voting scheme. Each gene in the classifier “votes” for the phenotype class of the unknown sample. A gene’s vote is weighted by how closely its expression correlates with the differentiation between phenotype classes in the training dataset. Basic Protocol 3 describes how to preprocess the training dataset to remove platform noise and genes that have little variation. Preprocessing the test dataset may result in a test dataset that contains a different set of genes than the training dataset. Therefore, do not preprocess the test dataset. Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software Using GenePattern for Gene Expression Analysis GenePattern software, which is freely available at http://www.genepattern.org/, or browser to access GenePattern on line (the Support Protocol describes how to start GenePattern) 7.12.20 Supplement 22 Current Protocols in Bioinformatics Table 7.12.7 Parameters for k-Nearest Neighbors Prediction Modules Parameter Description num features Number of features (genes or probes) to use in the classifier. For KNN, choose the number of features or use the Feature List Filename parameter to specify which features to use. For KNNXValidation, the algorithm chooses the feature list for each leave-one-out cycle. feature selection statistic Statistic to use for computing differential expression. The genes most differentially expressed between the classes will be used in the classifier to predict the phenotype of unknown samples. For a description of the statistics, see the test statistic parameter in Table 7.12.3. min std When the selected feature selection statistic computes differential expression using a minimum standard deviation, specify that minimum standard deviation num neighbors Number (k) of neighbors to consult when consulting the k-nearest neighbors weighting type Weight to give the “votes” of the k neighbors. None: gives each vote the same weight. One-over-k: weighs each vote by reciprocal of the rank of the neighbor’s distance; that is, the closest neighbor is given weight 1/1, the next closest neighbor is given weight 1/2, and so on. Distance: weighs each vote by the reciprocal of the neighbor’s distance. distance measure Method for computing the distance (dissimilarity measure) between neighbors (Wit and McClure, 2004) Modules used in this protocol: KNNXValidation (version 5), PredictionResultsViewer (version 4), FeatureSummaryViewer (version 3), and KNN (version 3) Files Class prediction requires two files as input: one for gene expression data and another that specifies the class of each sample. The classes usually represent phenotypes, such as tumor or normal. The expression data file is a tab-delimited text file (GCT file format, Fig. 7.12.1 that contains a column for each sample and a row for each gene. Classes are defined in another tab-delimited text file (CLS file format, Fig. 7.12.2). Basic Protocols 1 and 2 describe how to convert various gene expression data into these file formats. As an example, this protocol uses two ALL/AML leukemia datasets (Golub et al., 1999): a training set consisting of 38 bone marrow samples (all aml train.gct, all aml train.cls) and a test set consisting of 35 bone marrow and peripheral blood samples (all aml test.gct, all aml test.cls). Download the data files from the GenePattern Web site at http://genepattern.org/datasets/. This protocol assumes the training set all aml train.gct has been preprocessed according to Basic Protocol 3. The preprocessed expression data file, all aml train.preprocessed.gct, is used in this protocol. Run the KNNXValidation analysis The KNNXValidation module builds and tests multiple classifiers, one for each iteration of the leave-one-out, train, and test cycle. The module generates two result files. The feature result file (*.feat.odf) lists all genes used in any classifier and the number of times that gene was used in a classifier. The prediction result file (*.pred.odf) averages the accuracy of and error rates for all classifiers. Use the FeatureSummaryViewer module to display the feature result file and the PredictionResultsViewer to display the prediction result file. Analyzing Expression Patterns 7.12.21 Current Protocols in Bioinformatics Supplement 22 Figure 7.12.10 KNNXValidation parameters. Table 7.12.7 describes the parameters for the k-nearest neighbors (KNN) class prediction method. 1. Start KNNXValidation by selecting it from the Modules & Pipelines list on the GenePattern start page (it is in the Prediction category). GenePattern displays the parameters for the KNNXValidation analysis (Fig. 7.12.10). For information about the module and its parameters, click the Help link at the top of the form. 2. For the “data filename” parameter, select gene expression data in the GCT file format. For example, select the preprocessed data file, all aml train.preprocessed. gct: in the Recent Job lists, locate the PreprocessDataset module and its all aml train.preprocessed.gct result file; click the icon next to the result file; and from the menu that appears, select the Send to data filename command. 3. For the “class filename” parameter, select the class data (CLS file format) file. For this example, use the Browse button to select the all aml train.cls file. 4. Review the remaining parameters to determine which values, if any, should be modified (see Table 7.12.7). For this example, use the default values. 5. Click Run to start the analysis. GenePattern displays a status page. When the analysis is complete, the status page lists the analysis result files: the feature result file (*.feat.odf) lists the genes used in the classifiers and the prediction result file (*.pred.odf) averages the accuracy of and error rates for all of the classifiers. Both result files are structured text files. Using GenePattern for Gene Expression Analysis View KNNXValidation analysis results GenePattern provides interactive, graphical viewers to simplify, review, and interpret the result files. To view the prediction results (*.pred.odf file), use the PredictionResultsViewer. To view the feature result file (*.feat.odf file), use the FeatureSummaryViewer. 7.12.22 Supplement 22 Current Protocols in Bioinformatics 6. Start the PredictionResultsViewer by looking in the Recent Jobs list, then clicking the icon next to the prediction result file, all aml train.preprocessed. pred.odf; and from the menu that appears, select PredictionResultsViewer. GenePattern displays the parameters for the PredictionResultsViewer. Because the module was selected from the file menu, GenePattern automatically uses the analysis result file as the value for the input file parameter. 7. Click Run to start the viewer. GenePattern displays the PredictionResultsViewer (Fig. 7.12.11). In this example, all samples in the dataset were correctly classified. 8. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. Figure 7.12.11 PredictionResults Viewer. Each point represents a sample, with color indicating the predicted class. Absolute confidence value indicates the probability that the sample belongs to the predicted class. Analyzing Expression Patterns 7.12.23 Current Protocols in Bioinformatics Supplement 22 9. Start the FeatureSummaryViewer by looking in the Recent Jobs list, and then clicking the icon next to the feature result file, all aml train.preprocessed. feat.odf; from the menu that appears, select FeatureSummaryViewer. GenePattern displays the parameters for the FeatureSummaryViewer. Because the module was selected from the file menu, GenePattern automatically uses the analysis result file as the value for the input file parameter. 10. Click Run to start the viewer. GenePattern displays the FeatureSummaryViewer (Fig. 7.12.12). The viewer lists each gene used in any classifier created by any iteration and shows how many of the classifiers included this gene. Generally, the most interesting genes are those used by all (or most) of the classifiers. 11. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. Using GenePattern for Gene Expression Analysis Figure 7.12.12 FeatureSummary Viewer. 7.12.24 Supplement 22 Current Protocols in Bioinformatics In this example, the default parameter values for the k-nearest neighbors (KNN) class prediction method create class predictors that successfully predict the class of unknown samples. However, in practice, the researcher runs the KNNXValidation module several times with different parameter values (e.g., using the “num features” parameter values of 10, 20, and 30) to find the most effective parameter values for the KNN method. Run the KNN analysis After using the cross-validation approach (KNNXValidation module) to determine which parameter settings provide the best results, use the KNN module with those parameters to build a model using the training dataset and test it using an independent test dataset. The KNN module generates two result files: the model file (*.model.odf) describes the predictor and the prediction result file (*.pred.odf) shows the accuracy of and error rate for the predictor. Use a text editor to display the model file and the PredictionResultsViewer to display the prediction result file. 12. Start KNN by selecting it from the Modules & Pipelines list on the GenePattern start page (it is in the Prediction category). GenePattern displays the parameters for the KNN analysis (Fig. 7.12.13). For information about the module and its parameters, click the help link at the top of the form. 13. For the “train filename” and “test filename” parameters, select gene expression data in the GCT file format. For this example, select all aml train.preprocessed.gct as the input file for the “train filename” parameter. In the Recent Job list, locate the PreprocessDataset module and its all aml train.preprocessed.gct result file; click the icon next to the result file; and from the menu that appears, select the Send to train filename command. Next, use the browse button to select all aml test.gct as the input file for the “test filename” parameter. Figure 7.12.13 KNN parameters. Table 7.12.7 describes the parameters for the k-nearest neighbors (KNN) class prediction method. Analyzing Expression Patterns 7.12.25 Current Protocols in Bioinformatics Supplement 22 14. For the “train class filename” and “test class filename” parameters, select the class data (CLS file format) for each expression data file. For this example, use the Browse button to select all aml train.cls as the input file for the “train class filename” parameter. Similarly, select all aml test.cls as the input file for the “test class filename” parameter. 15. Review the remaining parameters to determine which values, if any, should be modified (see Table 7.12.7). For this example, use the default values. 16. Click Run to start the analysis. GenePattern displays a status page. When the analysis is complete, the status page lists the analysis result files: the model file (*.model.odf) contains the classifier (or model) created from the training dataset and the prediction result file (*.pred.odf) shows the accuracy of and error rate for the classifier when it was run against the test data. Both result files are structured text files. 17. Click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. The Recent Jobs list includes the KNN module and its result files. View KNN analysis results GenePattern provides interactive, graphical viewers to simplify review and interpretation of the result files. To view the prediction results (*.pred.odf file), use the PredictionResultsViewer. To view the model file (*.model.odf), simply use a text editor. 18. Display the model file (all aml train.preprocessed.model.odf): in the Recent Jobs list, click the model file. GenePattern displays the model file in the browser. The classifier uses the genes in this model to predict the class of unknown samples. Retrieving annotations for these genes might provide insight into the underlying biology of the phenotype classes. 19. Click the Back button in the Web browser to return to the GenePattern start page. 20. Start the PredictionResultsViewer by looking in the Recent Jobs list and then clicking the icon next to the prediction result file, all aml test. pred.odf; and from the menu that appears, select PredictionResultsViewer. GenePattern displays the parameters for the PredictionResultsViewer. Because the module was selected from the file menu, GenePattern automatically uses the analysis result file as the value for the input file parameter. 21. Click Run to start the viewer. GenePattern displays the PredictionResultsViewer (similar to the one shown in Fig. 7.12.11). The classifier created by the KNN algorithm correctly predicts the class of 32 of the 35 samples in the test dataset. The classifier created by the Weighted Voting algorithm (Golub et al., 1999) correctly predicted the class of all samples in the test dataset. The error rate (number of cases correctly classified divided by the total number of cases) is useful for comparing results when experimenting with different prediction methods. 22. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. Using GenePattern for Gene Expression Analysis 7.12.26 Supplement 22 Current Protocols in Bioinformatics PIPELINES: REPRODUCIBLE ANALYSIS METHODS Gene expression analysis is an iterative process. The researcher runs multiple analysis methods to explore the underlying biology of the gene expression data. Often, there is a need to repeat an analysis several times with different parameters to gain a deeper understanding of the analysis and the results. Without careful attention to detail, analyses and their results can be difficult to reproduce. Consequently, it becomes difficult to share the analysis methodology and its results. BASIC PROTOCOL 7 GenePattern records every analysis it runs, including the input files and parameter values that were used and the output files that were generated. This ensures that analysis results are always reproducible. GenePattern also makes it possible for the user to click on an analysis result file to build a pipeline that contains the modules and parameter settings used to generate the file. Running the pipeline reproduces the analysis result file. In addition, one can easily modify the pipeline to run variations of the analysis protocol, share the pipeline with colleagues, or use the pipeline to describe an analysis methodology in a publication. This protocol describes how to create a pipeline from an analysis result file, edit the pipeline, and run it. As an example, a pipeline is created based on the class prediction results from Basic Protocol 6. Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software GenePattern software, which is freely available at http://www.genepattern.org/, or browser to access GenePattern on line (the Support Protocol describes how to start GenePattern) Modules used in this protocol: PreprocessDataset (version 3), KNN (version 3), and PredictionResultsViewer (version 4) Files Input files for a pipeline depend on the modules called; for example, the input file for the PreprocessDataset module is a gene expression data file Create a pipeline from a result file Creating a pipeline from a result file captures the analysis strategy used to generate the analysis results. To create the pipeline, GenePattern records the modules used to generate the result file, including their input files and parameter values. Tracking the chain of modules back to the initial input files, GenePattern builds a pipeline that records the sequence of events used to generate the result file. For this example, create a pipeline from the prediction result file, all aml test.pred.odf, generated by the KNN module in Basic Protocol 6. 1. Create the pipeline by looking in the Recent Jobs list, locating the KNN module and its all aml test.pred.odf result file and then clicking the icon next to the result file; from the menu that appears, select Create Pipeline. GenePattern creates the pipeline that reproduces the result file and displays it in a form-based editor (Fig. 7.12.14). The pipeline includes the KNN analysis, its input files, and parameter settings. The input file for the “train filename” parameter, all aml train.preprocessed.gct, is a result file from a previous PreprocessDataset analysis; therefore, the pipeline includes a PreprocessDataset analysis to generate the all aml train.preprocessed.gct file. Analyzing Expression Patterns 7.12.27 Current Protocols in Bioinformatics Supplement 22 Figure 7.12.14 Create Pipeline for KNN classification analysis. The Pipeline Designer form defines the steps that will replicate the KNN classification analysis. Click the arrow icon next to a step to collapse or expand that step. When the form opens, all steps are expanded. This figure shows the first step collapsed. 2. Scroll to the top of the form and edit the pipeline name. Because the pipeline was created from an analysis result file, the default name of the pipeline is the job number of that analysis. Change the pipeline name to make it easier to find. For this example, change the pipeline name to KNNClassificationPipeline. (Pipeline names cannot include spaces or special characters.) Add the PredictionResultsViewer to the pipeline The PredictionResultsViewer module displays the KNN prediction results. Use the following steps to add this visualization module to the pipeline. 3. Scroll to the bottom of the form. 4. In the last step of the pipeline, click the Add Another Module button. 5. From the Category drop-down list, select Visualizer. 6. From the Modules list, select PredictionResultsViewer. 7. Rather than selecting a prediction result filename, use the prediction result file generated by the KNN analysis. Notice that GenePattern has selected this automatically: next to Use Output From, GenePattern has selected 2. KNN and Prediction Results. 8. Click Save to save the pipeline. GenePattern displays a status page confirming pipeline creation. Using GenePattern for Gene Expression Analysis 9. Click the Continue to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. The pipeline appears in the Modules & Pipelines list in the Pipeline category. 7.12.28 Supplement 22 Current Protocols in Bioinformatics Run the pipeline GenePattern automatically selects the new pipeline as the next module to be run. 10. Click Run to run the pipeline. GenePattern runs each module in the pipeline, preprocessing the all aml train.gct file, running the KNN class prediction analysis, and then displaying the prediction results. 11. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start link at the bottom of the status page to return to the GenePattern start page. USING THE GenePattern DESKTOP CLIENT GenePattern provides two point-and-click graphical user interfaces (clients) to access the GenePattern server: the Web Client and the Desktop Client. The Web Client is automatically installed with the GenePattern server, the Desktop Client is installed separately. Most GenePattern features are available from both clients; however, only the Desktop Client provides access to the following ease-of-use features: adding project directories for easy access to dataset files, running an analysis on every file in a directory by specifying that directory as an input parameter, and filtering the lists of modules and pipelines displayed in the interface. ALTERNATE PROTOCOL 1 This protocol introduces the Desktop Client by running the PreprocessDataset and HeatMapViewer modules. The aim is not to discuss the analyses, but simply to demonstrate the Desktop Client interface. Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software GenePattern software, which is freely available at http://www.genepattern.org. Installing the Desktop Client is optional. If it is not installed with the GenePattern software, the Desktop Client can be installed at any time from the GenePattern Web Client. To install the Desktop Client from the Web Client, click Downloads>Install Desktop Client and follow the on-screen instructions. Modules used in this protocol: PreprocessDataset (version 3) and HeatMapViewer (version 8) Files The PreprocessDataset module requires gene expression data in a tab-delimited text file (GCT file format, Fig. 7.12.1) that contains a column for each sample and a row for each gene. Basic Protocol 1 describes how to convert various gene expression data into this file format. As an example, this protocol uses an ALL/AML leukemia dataset (Golub et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML). Download the data file (all aml train.gct) from the GenePattern Web site at http://genepattern.org/datasets/. Start the GenePattern server The GenePattern server must be started before the Desktop Client. Use the following steps to start a local GenePattern server. Alternatively, use the public GenePattern server hosted at http://genepattern.broad.mit.edu/gp/. For more information, refer to the GenePattern Tutorial (http://www.genepattern.org/tutorial/gp tutorial.html) or GenePattern Desktop Client Guide (http://www.genepattern.org/tutorial/gp java client.html). Analyzing Expression Patterns 7.12.29 Current Protocols in Bioinformatics Supplement 22 1. Double-click the Start GenePattern Server icon (GenePattern installation places icon on the desktop). On Windows, while the server is starting, the cursor displays an hourglass. On Mac OS X, while the server is starting, the server icon bounces in the Dock. Start the Desktop Client 2. Double-click the GenePattern Desktop Client icon (GenePattern installation places icon on the desktop). The Desktop Client connects to the GenePattern server, retrieves the list of available modules, builds its menus, and displays a welcome message. The Projects pane provides access to selected project directories (directories that hold the genomic data to be analyzed). The Results pane lists analysis jobs run by the current GenePattern user. Open a project directory 3. To open a project directory, select File>Open Project Directory. GenePattern displays the Choose a Project Directory window. 4. Navigate to the directory that contains the data files and click Select Directory. For example, select the directory that contains the example data file, all aml train. gct. GenePattern adds the directory to the Projects pane. 5. In the Projects pane, double-click the directory name to display the files in the directory. Run an analysis 6. To start an analysis, select it from the Analysis menu. For example, select Analysis>Preprocess & Utilities>PreprocessDataset. GenePattern displays the parameters for the PreprocessDataset module. 7. For the “input filename” parameter, select gene expression data in the GCT file format. For example, drag-and-drop the all aml train.gct file from the Project pane to the “input filename” parameter box. 8. Review the remaining parameters to determine which values, if any, should be modified (see Table 7.12.2). For this example, use the default values. 9. Click Run to start the analysis. GenePattern displays the analysis in the Results pane with a status of Processing. When the analysis is complete, the output files are added to the Results pane and a dialog box appears showing the completed job. Close the dialogue box. In the Results pane, doubleclick the name of the analysis to display the result files. This example generates two result files: all aml train.preprocessed.gct, which is the new, preprocessed gene expression data file, and gp task execution log.txt, which lists the parameters used for the analysis. Using GenePattern for Gene Expression Analysis Run an analysis from a result file Research is an iterative process and the input file for an analysis is often the output file of a previous analysis. GenePattern makes this easy. As an example, the following steps use the gene expression file created by the PreprocessDataset module (all aml train.preprocessed.gct) as the input file for the HeatMapViewer module, which displays the expression data graphically. 7.12.30 Supplement 22 Current Protocols in Bioinformatics 10. To start the analysis, in the Results pane, right-click the result file and, from the menu that appears, select the Modules submenu and then the name of the module to run. For example, in the Results pane, right-click the result file from the PreprocessDataset analysis, all aml train.comp.marker.odf. From the menu that appears, select Modules>HeatMapViewer. GenePattern displays the parameters for the HeatMapViewer. Because the module was selected from the file menu, GenePattern automatically uses the analysis result file as the value of the first input filename parameter. 11. Click Run to start the viewer. The first time a viewer runs on the desktop, a security warning message may appear. Click Run to continue. GenePattern opens the HeatMapViewer. 12. Close the HeatMapViewer by selecting File>Exit. Notice that the HeatMapViewer does not appear in the Results pane. The Results pane lists the analyses run on the GenePattern server. Visualizers, unlike analysis modules, run on the client rather than the server; therefore, they do not appear in the Results pane. USING THE GenePattern PROGRAMMING ENVIRONMENT GenePattern libraries for the Java, MATLAB, and R programming environments allow applications to run GenePattern modules and retrieve analysis results. Each library supports arbitrary scripting and access to GenePattern modules via function calls, as well as development of new methodologies that combine modules in arbitrarily complex combinations. Download the libraries from the GenePattern Web Client by clicking Downloads>Programming Libraries. ALTERNATE PROTOCOL 2 For more information about accessing GenePattern from a programming environment, see the GenePattern Programmer’s Guide at http://www.genepattern.org/tutorial/gp programmer.html. SETTING USER PREFERENCES FOR THE GenePattern WEB CLIENT GenePattern provides two point-and-click graphical user interfaces (clients) to access the GenePattern server: the Web Client and the Desktop Client. The Web Client is automatically installed with the GenePattern server. Most GenePattern features are available from both clients; however, only the Web Client provides access to GenePattern administrative features, such as configuring the GenePattern server and installing modules from the GenePattern repository. SUPPORT PROTOCOL Necessary Resources Hardware Computer running MS Windows, Mac OS X, or Linux Software GenePattern software, which is freely available at http://www.genepattern.org/, or browser to access GenePattern on line Files Input files for the Web Client depend on the module called Analyzing Expression Patterns 7.12.31 Current Protocols in Bioinformatics Supplement 22 Table 7.12.8 GenePattern Account Settings Setting Description Change Email Change the e-mail address for your GenePattern account on this server Change Password Change the password for your GenePattern account on this server; by default, GenePattern servers are installed without password protection History Specify the number of recent analyses listed in the Recent Jobs pane on the Web Client start page Visualizer Memory Specify the Java virtual machine configuration parameters (such as VM memory settings) to be used when running visualization modules; by default, this option is used to specify the amount of memory to allocate when running visualization modules (-Xmx512M) Start the GenePattern server The GenePattern server must be started before the Web Client. Use the following steps to start a local GenePattern server. Alternatively, use the public GenePattern server hosted at http://genepattern.broad.mit.edu/gp/. For more information, refer to the GenePattern Tutorial (http://www.genepattern.org/tutorial/gp tutorial.html) or GenePattern Web Client Guide (http://www.genepattern.org/tutorial/gp web client.html). 1. Double-click the Start GenePattern Server icon (GenePattern installation places icon on the desktop). On Windows, while the server is starting, the cursor displays an hourglass. On Mac OS X, while the server is starting, the server icon bounces in the Dock. Start the Web Client 2. Double-click the GenePattern Web Client icon (GenePattern installation places icon on the desktop). GenePattern displays the Web Client start page (Fig. 7.12.3). Modules & Pipelines, at the left of the start page, lists all available analyses. By default, analyses are organized by category. Use the radio buttons at the top of the Modules & Pipelines list to organize analyses by suite or list them alphabetically. A suite is a user-defined collection of pipelines and/or modules. Suites can be used to organize pipelines and modules in GenePattern in much the same way “play lists” can be used to organize an online music collection. Recent Jobs, at the right of the start page, lists analysis jobs recently run by the current GenePattern user. Set personal preferences 3. Click My Settings (top right corner) to display your GenePattern account settings. Table 7.12.8 lists the available settings. 4. Click History to modify the number of jobs displayed in the Recent Jobs list. The Recent Jobs list provides easy access to analysis result files. Increasing the number of jobs simplifies access to the files used in the basic protocols. 5. Increase the value (e.g., enter 10) and click Save. 6. Click the GenePattern icon in the title bar to return to the start page. GUIDELINES FOR UNDERSTANDING RESULTS Using GenePattern for Gene Expression Analysis This unit describes how to use GenePattern to analyze the results of a transcription profiling experiment done with DNA microarrays. Typically, such results are represented as a gene-by-sample table, with a measurement of intensity for each gene element on 7.12.32 Supplement 22 Current Protocols in Bioinformatics the array for each biological sample assayed in the microarray experiment. Analysis of microarray data relies on the fundamental assumption that “the measured intensities for each arrayed gene represent its relative expression level” (Quackenbush, 2002). Depending on the specific objectives of a microarray experiment, analysis can include some or all of the following steps: data preprocessing and normalization, differential expression analysis, class discovery, and class prediction. Preprocessing and normalization form the first critical step of microarray data analysis. Their purpose is to eliminate missing and low-quality measurements and to adjust the intensities to facilitate comparisons. Differential expression analysis is the next standard step and refers to the process of identifying marker genes—genes that are expressed differently between distinct classes of samples. GenePattern identifies marker genes using the following procedure. For each gene, it first calculates a test statistic to measure the difference in gene expression between two classes of samples, and then estimates the significance (p-value) of this statistic. With thousands of genes assayed in a typical microarray experiment, the standard confidence intervals can lead to a substantial number of false positives. This is referred to as the multiple hypothesis testing problem and is addressed by adjusting the p-values accordingly. GenePattern provides several methods for such adjustments as discussed in Basic Protocol 4. The objective of class discovery is to reduce the complexity of microarray data by grouping genes or samples based on similarity of their expression profiles. The general assumptions are that genes with similar expression profiles correspond to a common biological process and that samples with similar expression profiles suggest a similar cellular state. For class discovery, GenePattern provides a variety of clustering methods (Table 7.12.4), as well as principal component analysis (PCA). The method of choice depends on the data, personal preference, and the specific question being addressed (D’haeseleer, 2005). Typically, researchers use a variety of class discovery techniques and then compare the results. The aim of class prediction is to determine membership of unlabeled samples in known classes based on their expression profiles. The assumption is that the expression profile of a reasonable number of differentially expressed marker genes represents a molecular “signature” that captures the essential features of a particular class or phenotype. As discussed in Golub et al. (1999), such a signature could form the basis of a valuable diagnostic or prognostic tool in a clinical setting. For gene expression analysis, determining whether such a gene expression signature exists can help refine or validate putative classes defined during class discovery. In addition, a deeper understanding of the genes included in the signature may provide new insights into the biology of the phenotype classes. GenePattern provides several class prediction methods (Table 7.12.6). As with class discovery, it is generally a good idea to try several different class prediction methods and to compare the results. COMMENTARY Background Information Analysis of microarray data is an iterative process that starts with data preprocessing and then cycles between computational analysis, hypothesis generation, and further analysis to validate and/or refine hypotheses. The GenePattern software package and its repository of analysis and visualization modules support this iterative workflow. Two graphical user interfaces, the Web Client and the Desktop Client, and a programming environment provide users at any level of computational skill easy access to the diverse collection of analysis and visualization methods in the GenePattern module repository. By packaging methods as individual modules, GenePattern facilitates the rapid integration of new techniques and the Analyzing Expression Patterns 7.12.33 Current Protocols in Bioinformatics Supplement 22 growth of the module repository. In addition, researchers can easily integrate external tools into GenePattern by using a simple form-based interface to create modules from any computational tool that can be run from the command line. Modules are easily combined into workflows by creating GenePattern pipelines through a form-based interface or automatically from a result file. Using pipelines, researchers can reproduce and share analysis strategies. By providing a simple user interface and a diverse collection of computational methods, GenePattern encourages researchers to run multiple analyses, compare results, generate hypotheses, and validate/revise those hypotheses in a naturally iterative process. Running multiple analyses often provides a richer understanding of the data; however, without careful attention to detail, critical results can be difficult to reproduce or to share with colleagues. To address this issue, GenePattern provides extensive support for reproducible research. It preserves each version of each module and pipeline; records each analysis that is run, including its input files and parameter values; provides a method of building a pipeline from an analysis result file, which captures the steps required to generate that file; and allows pipelines to be exported to files and shared with colleagues. Critical Parameters Using GenePattern for Gene Expression Analysis Gene Expression data files GenePattern accepts expression data in tabdelimited text files (GCT file format) that contain a column for each sample, a row for each gene, and an expression measurement for each gene in each sample. As discussed in Basic Protocol 1, how the expression data is acquired determines the best way to translate it into the GCT file format. GenePattern provides modules to convert expression data from Affymetrix CEL files, convert MAGE-ML format data, and to extract data from the GEO or caArray microarray expression data repositories. Expression data stored in other formats can be converted into a tab-delimited text file that contains expression measurements with genes as rows and samples as columns and formatted to comply with the GCT file format. When working with cDNA microarray data, do not blindly accept the default values provided for the GenePattern modules. Most default values are optimized for Affymetrix data. Many GenePattern analysis modules do not allow missing values, which are common in cDNA two-color ratio data. One way to address this issue is to remove the genes with missing values. An alternative approach is to use the ImputeMissingValues.KNN module to impute missing values by assigning gene expression values based on the nearest neighbors of the gene. Class files A class file is a tab-delimited text file (the CLS format) that provides class information for each sample. Typically, classes represent phenotypes, such as tumor or normal. Basic Protocol 2 describes how to create class files. Microarray experiments often include technical replicates. Analyze the replicates as separate samples or remove them by averaging or other data reduction technique. For example, if an experiment includes five tumor samples and five control samples each run three times (three replicate columns) for a total of 30 data columns, one might combine the three replicate columns for each sample (by averaging or some other data reduction technique) to create a dataset containing 10 data columns (five tumor and five control). Analysis methods Table 7.12.9 lists the GenePattern modules as of this writing; new modules are continuously released. For a current list of modules and their documentation, see the Modules page on the GenePattern Web site at http://www.genepattern.org. Categories group the modules by function and are a convenient way of finding or reviewing available modules. To ensure reproducibility of analysis results, each module is given a version number. When modules are updated, both the old and new versions are in the module repository. If a protocol in this unit does not work as documented, compare the version number in the protocol with the version number installed on the GenePattern server used to execute the protocol. If the server has a different version of a module, click Modules & Pipelines>Install from Repository to install the desired version of the module from the module repository. Analysis result files GenePattern is a client-server application. All modules are stored on the GenePattern server. A user interacts with the server through the GenePattern Web Client, Desktop Client, or a programming environment. When the user runs an analysis module, the GenePattern client sends a message to the server, which runs 7.12.34 Supplement 22 Current Protocols in Bioinformatics Table 7.12.9 GenePattern Modulesa Module Description Annotation GeneCruiser Retrieve gene annotations for Affy probe IDs Clustering ConsensusClustering Resampling-based clustering method HierarchicalClustering Hierarchical clustering KMeansClustering k-means clustering NMFConsensus Non-negative matrix factorization (NMF) consensus clustering SOMClustering Self-organizing maps algorithm SubMap Maps subclasses between two datasets Gene list selection ClassNeighbors Select genes that most closely resemble a profile ComparativeMarkerSelection Computes significance values for features using several metrics ExtractComparativeMarkerResults Creates a dataset and feature list from ComparativeMarkerSelection output GSEA Gene set enrichment analysis GeneNeighbors Select the neighbors of a given gene according to similarity of their profiles SelectFeaturesColumns Takes a “column slice” from a .res, .gct, .odf, or .cls file SelectFeaturesRows Takes a “row slice” from a .res, .gct, or .odf file Image creators HeatMapImage Creates a heat map graphic from a dataset HierarchicalClusteringImage Creates a dendrogram graphic from a dataset Missing value imputation ImputeMissingValues.KNN Impute missing values using a k-nearest neighbor algorithm Pathway analysis ARACNE Runs the ARACNE algorithm MINDY Runs the MINDY algorithm for inferring genes that modulate the activity of a transcription factor at post-transcriptional levels Pipeline Golub.Slonim.1999.Science.all.aml ALL/AML methodology, from Golub et al. (1999) Lu.Getz.Miska.Nature.June.2005. PDT.mRNA Probabilistic Neural Network Prediction using mRNA, from Lu et al. (2005) Lu.Getz.Miska.Nature.June.2005. PDT.miRNA Probabilistic Neural Network Prediction using miRNA, from Lu et al. (2005) Lu.Getz.Miska.Nature.June.2005. clustering.ALL Hierarchical clustering of ALL samples with genetic alterations, from Lu et al. (2005) Lu.Getz.Miska.Nature.June.2005. clustering.ep.mRNA Hierarchical clustering of 89 epithelial samples in mRNA space, from Lu et al. (2005) Lu.Getz.Miska.Nature.June.2005. clustering.ep.miRNA Hierarchical clustering of 89 epithelial samples in miRNA space, from Lu et al. (2005) Lu.Getz.Miska.Nature.June.2005. clustering.miGCM218 Hierarchical clustering of 218 samples from various tissue types, from Lu et al. (2005) Lu.Getz.Miska.Nature.June.2005. mouse.lung Normal/tumor classifier and KNN prediction of mouse lung samples, from Lu et al. (2005) continued 7.12.35 Current Protocols in Bioinformatics Supplement 22 Table 7.12.9 GenePattern Modulesa , continued Module Description Prediction CART Classification and regression tree classification CARTXValidation Classification and regression tree classification with leave-one-out cross-validation KNN k-nearest neighbors classification KNNXValidation k-nearest neighbors classification with leave-one-out cross-validation PNN Probabilistic Neural Network (PNN) PNNXValidationOptimization PNN leave-one-out cross-validation optimization SVM Classifies samples using the support vector machines (SVM) algorithm WeightedVoting Weighted voting classification WeightedVotingXValidation Weighted voting classification with leave-one-out cross-validation Preprocess and utilities ConvertLineEndings Converts line endings to the host operating system’s format ConvertToMAGEML Converts a gct, res, or odf dataset file to a MAGE-ML file DownloadURL Downloads a file from a URL ExpressionFileCreator Creates a res or gct file from a set of Affymetrix CEL files ExtractColumnNames Lists the sample descriptors from a .res file ExtractRowNames Extracts the row names from a .res, .gct, or .odf file GEOImporter Imports data from the Gene Expression Omnibus (GEO); http://www.ncbi.nlm.nih.gov/geo MapChipFeaturesGeneral Map the features of a dataset to user-specified values MergeColumns Merge datasets by column MergeRows Merge datasets by row MultiplotPreprocess Creates derived data from an expression dataset for use in the Multiplot and Multiplot Extractor visualizer modules PreprocessDataset Preprocessing options on a res, gct, or Dataset input file ReorderByClass Reorder the samples in an expression dataset and class file by class SplitDatasetTrainTest Splits a dataset (and cls files) into train and test subsets TransposeDataset Transpose a dataset—.gct, .odf UniquifyLabels Makes row and column labels unique Projection NMF Non-negative matrix factorization PCA Principal component analysis Proteomics AreaChange Calculates fraction of area under the spectrum that is attributable to signal CompareSpectra Compares two spectra to determine similarity LandmarkMatch A proteomics method to propagate identified peptides across multiple MS runs LocatePeaks Locates detected peaks in a spectrum mzXMLToCSV Converts a mzXML file to a zip of csv files continued 7.12.36 Supplement 22 Current Protocols in Bioinformatics Table 7.12.9 GenePattern Modulesa , continued Module Description PeakMatch Perform peak matching on LC-MS data Peaks Determine peaks in the spectrum using a series of digital filters. PlotPeaks Plot peaks identified by PeakMatch ProteoArray LC-MS proteomic data processing module ProteomicsAnalysis Runs the proteomics analysis on the set of input spectra Sequence analysis GlobalAlignment Smith-Waterman sequence alignment SNP analysis CopyNumberDivideByNormals Divides tumor samples by normal samples to create a raw copy number value GLAD Runs the GLAD R package LOHPaired Computes LOH for paired samples SNPFileCreator Process Affymetrix SNP probe-level data into an expression value SNPFileSorter Sorts a .snp file by chromosome and location SNPMultipleSampleAnalysis Determine regions of concordant copy number aberrations XChromosomeCorrect Corrects X Chromosome SNP’s for male samples Statistical methods KSscore Kolmogorov-Smirnov score for a set of genes within an ordered list Survival analysis SurvivalCurve Draws a survival curve based on a phenotype or class (.cls) file SurvivalDifference Tests for survival difference based on phenotype or (.cls) file Visualizer caArrayImportViewer A visualizer to import data from caArray into GenePattern ComparativeMarkerSelectionViewer View the results from ComparativeMarkerSelection CytoscapeViewer View a gene network using Cytoscape (http://cytoscape.org) FeatureSummaryViewer View a summary of features from prediction GeneListSignificanceViewer Views the results of marker analysis GSEALeadingEdgeViewer Leading edge viewer for GSEA results HeatMapViewer Display a heat map view of a dataset HiearchicalClusteringViewer View results of hierarchical clustering JavaTreeView Hierarchical clustering viewer that reads in Eisen’s cdt, atr, and gtr files MAGEMLImportViewer A visualizer to import data in MAGE-ML format into GenePattern Multiplot Creates two-parameter scatter plots from the output file of the MultiplotPreprocess module MultiplotExtractor Provides a user interface for saving the data created by the MultiplotPreprocess module PCAViewer Visualize principal component analysis results PredictionResultsViewer Visualize prediction results SnpViewer Displays a heat map of SNP data SOMClusterViewer Visualize clusters created with the SOM algorithm VennDiagram Displays a Venn diagram a As of April18, 2008. 7.12.37 Current Protocols in Bioinformatics Supplement 22 the analysis. When the analysis is complete, the user can review the analysis result files, which are stored on the GenePattern server. The term “job” refers to an analysis run on the server. The term “job results” refers to the analysis result files. Analysis result files are typically formatted text files. GenePattern provides corresponding visualization modules to display the analysis results in a concise and meaningful way. Visualization tools provide support for exploring the underlying biology. Visualization modules run on the GenePattern client, not the server, and do not generate analysis result files. Most GenePattern modules include an output file parameter, which provides a default name for the analysis result file. On the GenePattern server, the output files for an analysis are placed in a directory associated with its job number. The default file name can be reused because the server creates a new directory for each job. However, changing the file name to distinguish between different iterations of the same analysis is recommended. For example, HierarchicalClustering can be run using several different clustering methods (complete-linkage, singlelinkage, centroid-linkage, or average-linkage). Including the method name in the output file name makes it easier to compare the results of the different methods. By default, the output file name for HierarchicalClustering is <input.filename basename>, which indicates that the module will use the input file name as the output file name. Alternative output file names might be <input.filename basename>.complete, <input.filename basename>.centroid, <input.filename basename>.average, or <input.filename basename>.single. By default, the GenePattern server stores analysis result files for 7 days. After that time, they are automatically deleted from the server. To save an analysis result file, download the file from the GenePattern server to a local directory. In the Web Client, to save an analysis result file, click the icon next to the file and select Save. To save all result files for an analysis, click the icon next to the analysis and select Download. In the Desktop Client, in the Result pane, click the analysis result file and select Results>Save To. Using GenePattern for Gene Expression Analysis tern Web site, http://www.genepattern.org, provides a current list of modules. To install the latest versions of all modules, from the GenePattern Web Client, select Modules>Install from Repository. When using GenePattern regularly, check the repository each month for new and updated modules. Literature Cited Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57:289-300. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, Calif. Brunet, J., Tamayo, P., Golub, T.R., and Mesirov, J.P. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. U.S.A. 101:4164-4169. Cover, T.M. and Hart, P.E. 1967. Nearest neighbor pattern classification, IEEE Trans. Info. Theory 13:21-27. D’haeseleer, P. 2005. How does gene expression clustering work? Nat. Biotechnol. 23:14991501. Getz, G., Monti, S., and Reich, M. 2006. Workshop: Analysis Methods for Microarray Data. October 18-20, 2006. Cambridge, MA. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression. Science 286:531537. Gould, J., Getz, G., Monti, S., Reich, M., and Mesirov, J.P. 2006. Comparative gene marker selection suite. Bioinformatics 22:1924-1925. Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A, Downing, J.R., Jacks, T., Horvitz, H.R., and Golub, T.R. 2005. MicroRNA expression profiles classify human cancers. Nature 435:834-838. MacQueen, J.B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1 (L. Le Cam and J. Neyman, eds.) pp. 281297. University of California Press, Berkeley, California. Suggestions for Further Analysis Monti, S., Tamayo, P., Mesirov, J.P., and Golub, T. 2003. Consensus clustering: A resamplingbased method for class discovery and visualization of gene expression microarray data. Functional Genomics Special Issue. Machine Learning Journal 52:91-118. Table 7.12.9 lists the modules available in GenePattern as of this writing; new modules are continuously being released. The GenePat- Quackenbush, J. 2002. Microarray data normalization and transformation. Nat. Genet. 32:496501. 7.12.38 Supplement 22 Current Protocols in Bioinformatics Slonim, D.K. 2002. From patterns to pathways: Gene expression data analysis comes of age. Nat. Genet. 32:502-508. Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., and Lander, E.S. 2000. Class prediction and discovery using gene expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB). (R. Shamir, S. Miyano, S. Istrail, P. Pevzner, and M. Waterman, eds.) pp. 263-272. ACM Press, New York. Specht, D.F. 1990. Probabilistic neural networks. Neural Netw. 3:109-118. Storey, J.D. and Tibshirani, R. 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 100:9440-9445. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting gene expression with selforganizing maps: Methods and application to hematopoeitic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907-2912. Vapnik, V. 1998. Statistical Learning Theory. John Wiley & Sons, New York. Westfall, P.H. and Young, S.S. 1993. ResamplingBased Multiple Testing: Examples and Methods for p-Value Adjustment (Wiley Series in Probability and Statistics). John Wiley & Sons, New York. Wit, E. and McClure, J. 2004. Statistics for Microarrays. John Wiley & Sons, West Sussex, England. Zeeberg, B.R., Riss, J., Kane, D.W., Bussey, K.J., Uchio, E., Linehan, W.M., Barrett, J.C., and Weinstein, J.N. 2004. Mistaken identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5:80. Key References Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo, P., and Mesirov, J.P. 2006. GenePattern 2.0. Nature Genetics 38:500-501. Overview of GenePattern 2.0, including comparison with other tools. Wit and McClure, 2004. See above. Describes setting up a microarray experiment and analyzing the results. Internet Resources http://www.genepattern.org Download GenePattern software and view GenePattern documentation. http://www.genepattern.org/tutorial/gp concepts.html GenePattern concepts guide. http://www.genepattern.org/tutorial/ gp web client.html GenePattern Web Client guide. http://www.genepattern.org/tutorial/ gp java client.html GenePattern Desktop Client guide. http://www.genepattern.org/tutorial/ gp programmer.html GenePattern Programmer’s guide. http://www.genepattern.org/tutorial/ gp fileformats.html GenePattern file formats. Analyzing Expression Patterns 7.12.39 Current Protocols in Bioinformatics Supplement 22 Data Storage and Analysis in ArrayExpress and Expression Profiler UNIT 7.13 Gabriella Rustici,1 Misha Kapushesky,1 Nikolay Kolesnikov,1 Helen Parkinson,1 Ugis Sarkans,1 and Alvis Brazma1 1 European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom ABSTRACT ArrayExpress at the European Bioinformatics Institute is a public database for MIAMEcompliant microarray and transcriptomics data. It consists of two parts: the ArrayExpress Repository, which is a public archive of microarray data, and the ArrayExpress Warehouse of Gene Expression Profiles, which contains additionally curated subsets of data from the Repository. Archived experiments can be queried by experimental attributes, such as keywords, species, array platform, publication details, or accession numbers. Gene expression profiles can be queried by gene names and properties, such as Gene Ontology terms, allowing expression profiles visualization. The data can be exported and analyzed using the online data analysis tool named Expression Profiler. Data analysis components, such as data preprocessing, filtering, differentially expressed gene finding, clustering methods, and ordination-based techniques, as well as other statistical tools are all available in Expression Profiler, via integration with the statistical package R. Curr. C 2008 by John Wiley & Sons, Inc. Protoc. Bioinform. 23:7.13.1-7.13.27. Keywords: gene expression r microarrays r transcriptomics r public repository r data analysis INTRODUCTION ArrayExpress (AE) resource consists of two databases: (1) AE Repository of Microarray and Transcriptomics Data, which archives well-annotated microarray data typically supporting journal publications, and (2) AE Warehouse of Gene Expression Profiles, which contains additionally curated subsets of data from the Repository and enables the user to query gene expression profiles by gene names, properties, and profile similarity (Brazma et al., 2003). In addition to the two databases, the resource includes an online data analysis tool named Expression Profiler (EP; Kapushesky et al., 2004), which allows the exploration, mining, analysis, and visualization of data exported from AE, as well as datasets uploaded from any other sources, such as user generated data. Further, the AE resource includes MIAMExpress and Tab2MAGE tools for data submission to the Repository. AE supports standards and recommendations developed by the Microarray Gene Expression Data (MGED) society, including the Minimum Information About a Microarray Experiment (MIAME; Brazma et al., 2001) and a spreadsheet-based data exchange format, MAGE-TAB (Rayner et al., 2006). AE is one of three international databases recommended by the MGED society (Ball et al., 2004) for storing MIAME-compliant microarray data related to publications (the other two being Gene Expression Omnibus and CIBEX; Edgar et al., 2002; Ikeo et al., 2003). As of January 2008, the AE Repository holds data from ∼100,000 microarray hybridizations, from over 3300 separate studies (experiments) related to over 200 different species. The data in the Repository tends to double every 14 months. Most of the data relates to transcription profiling experiments, Current Protocols in Bioinformatics 7.13.1-7.13.27, September 2008 Published online September 2008 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471250953.bi0713s23 C 2008 John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.13.1 Supplement 23 although the proportion of array based Comparative Genomics Hybridization (CGH) and chromatin immunoprecipitation (ChIP on chip) experiments is growing. The Repository also holds prepublication, password-protected data, mostly for reviewing purposes. The Repository allows the user to browse or query the experiments via free text search (e.g., experiment accession numbers, authors, laboratory, publication, and keywords), and filter the data by species or array design. Once the desired experiment is identified, the user can find more information about the samples, protocols used, experimental design, etc., and most importantly can export either all, or parts, of the data from the experiment. The AE Warehouse holds additionally curated gene expression data that can be queried and retrieved by gene names, identifiers (e.g., database accession numbers), or properties, such as Gene Ontology terms. The main source of data for the Warehouse is the Repository, although some in situ gene expression and protein expression data from external sources are also loaded in the Warehouse. The use of the AE Warehouse is straightforward: enter the name, ID, or a property of a gene or several genes, retrieve the list of experiments where the given gene has been studied, and zoom into its expression profile. EP is a Web-based gene expression data analysis tool; several of its components are implemented via integration with the statistical package R (Ihaka and Gentleman, 1996). Users can upload their own data in EP or data retrieved from AE. The users only need a Web browser to use EP from their local PCs. Data analysis components for gene expression data preprocessing, missing value imputation, filtering, clustering methods, visualization, significant gene finding, between-group analysis, and other statistical components are available in EP. The Web-based design of EP supports data sharing and collaborative analysis in a secure environment. Developed tools are integrated with the microarray gene expression database AE and form the exploratory analytical front-end to those data. In this unit we present six basic protocols: (1) how to query, retrieve, and interpret data from the AE Warehouse of Gene Expression Profiles; (2) how to query, retrieve, and interpret data and metadata from the AE Repository of Microarray and Transcriptomics Data; (3) how to upload, normalize, analyze, and visualize data in EP; (4) how to perform clustering analysis in EP; (5) how to calculate Gene Ontology term enrichment in EP; and (6) how to calculate chromosome co-localization probability in EP. BASIC PROTOCOL 1 QUERYING GENE EXPRESSION PROFILES This protocol describes how to query and analyze data from AE Warehouse of Gene Expression Profiles. Necessary Resources Hardware Suggested minimum requirements for a PC system: fast Internet connection (64K+), graphics card supporting at least 1024 × 768, optimal resolution 1280 × 1024 (65K+ colors) Software Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000, XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+ (Mac OS X) ArrayExpress and Expression Profiler 7.13.2 Supplement 23 Current Protocols in Bioinformatics Query for expression profiles of a particular gene 1. Open the AE homepage at http://www.ebi.ac.uk/arrayexpress. 2. In the Expression Profiles box, on the right-hand side of the page, type a gene name (e.g., nfkbia) into the Gene(s) box, and type leukemia as a keyword in the experiment or sample annotation box (Fig. 7.13.1). The Experiments box, on the left hand-side of the page, allows querying the AE Repository of Microarray Experiments. This is the focus of Basic Protocol 2. 3. Select species, e.g., Homo sapiens, in the drop-down menu and click the “query” button. The interface returns the list of all experiments (studies) in the AE Warehouse where the selected gene has been studied (Fig. 7.13.2). Experiments are ordered by “relevance,” with the most relevant experiment on top. The “relevance rank” is based on the correlation between experimental factors values and gene expression values and is calculated using several methods, including a linear model in the Bioconductor package limma (Smyth, 2004). For each experiment, a short description, a list of experimental factors (see step 2), and the experimental set up (type) are provided. In addition, a thumbnail image shows the behavior of the selected gene in each experiment retrieved. At a glance the user can now decide which experiment might be interesting for further viewing. Figure 7.13.1 The ArrayExpress query windows (http://www.ebi.ac.uk/arrayexpress). Figure 7.13.2 Output window after querying the AE Warehouse for the expression profiles of a particular gene (e.g., nfkbia). Analyzing Expression Patterns 7.13.3 Current Protocols in Bioinformatics Supplement 23 Figure 7.13.3 Zoomed-in view of a particular experiment. The main graph shows the expression profile of the selected gene (e.g., nfkbia), for all experimental samples, based on the selected experimental factor. Choose the experiment of interest and explore the expression profile of the chosen gene 4. Click on the thumbnail image of the expression profile in one of the experiments, e.g., E-AFMX-5. In the graph now showing (Fig. 7.13.3), the X axis represents all samples in this study, grouped by experimental factor, while the Y axis represents the expression levels for nfkbia in each sample. Explore the dependency of the expression levels on different experimental factors. Experimental factors are the main experimental variables of interest in a particular study. For instance, experiment E-AFMX-5 has three experimental factors: cell type, disease state, and organism part. Select “cell type.” Observe that nfkbia has notably higher expression values for the cell type CD33+ myeloid, than, for instance, CD4+ T cells (Fig. 7.13.3). The black line represents the expression value for the Affymetrix probe 201502 s at, for nfkbia. The dotted lines represent the mean expression values. Scroll down the page for more information about the sample properties. In the table provided, the sample number in the first column corresponds to the sample number on the X axis of the graph. The expression values are measured in abstract units as supplied by the submitter. For instance, E-AFMX-5 uses Affymetrix platform and MAS5 normalization method. For more information about the particular normalization protocols used in each individual experiment, click on the experiment accession number (i.e., E-AFMX-5). This will open the link to the respective dataset entry in the AE Repository, which contains all the information related to the selected study (described in Basic Protocol 2). Select other genes with expression profiles most similar to the chosen one 5. On the top right-hand side of the same page (Fig. 7.13.3), from the “similarity search” drop-down menu, select the “find 3 closest genes” option. ArrayExpress and Expression Profiler This will select the three most similarly expressed genes and add their expression profiles to the current selection, next to the nfkbia profiles (Fig. 7.13.4). You will find that the expression patterns for genes IER2, FOS, and JUN closely resemble the behavior of 7.13.4 Supplement 23 Current Protocols in Bioinformatics Figure 7.13.4 Similarity search output window. The expression profile of the selected gene (e.g., nfkbia) is plotted together with the ones of the 3 genes showing the closest similarity in expression pattern, within the same experiment. The corresponding gene symbols are listed on the right (Ier2, Fos, and Jun). For color version of this figure see http://www.currentprotocols.com. nfkbia. Click on “expand” next to Gene Properties (upper right-hand side of the page), and follow the links from the expanded view to retrieve additional information about these genes in ENSEMBL, Uniprot, and 4DXpress databases. Finally, by clicking on the “download the gene expression data matrix” link located below the graph (Fig. 7.13.4), the user can obtain the numerical expression values of the selected genes for further analysis. Query for expression profiles of several genes 6. Repeat the search as in step 1, but instead of a single gene, enter two or more comma separated gene names, e.g., Ephb3, Nfkbia, select species Mus musculus, and click the “query” button. If more than one gene is selected, the query tries to match the gene names exactly (Fig. 7.13.5). The user will be prompted to an intermediate window where a list of matching genes found is provided, together with a list of matching experiments. Toggle the genes of interest (in this case both of them) and then click display at the top of the page. On the thumbnail plot page that opens up, click to zoom into the expression profile of experiment E-MEXP-774. Note how the response of these two genes to the dexamethasone hormone treatment is opposite. Query for expression of genes of a particular Gene Ontology category 7. Repeat the search as in step 1 but instead of entering a gene name, enter a Gene Ontology category term or keyword, such as cell cycle, and select the species Schizosaccharomyces pombe. As this is a wide category, ∼260 genes are returned. The user can select any number for further exploration by ticking the respective boxes. For instance, one can select the rum1 gene and click on “display” at the top of the page. The familiar thumbnail plots will be returned. The user can then zoom in as described above. Analyzing Expression Patterns 7.13.5 Current Protocols in Bioinformatics Supplement 23 Figure 7.13.5 Gene selection page. When more than one gene matches the query, this window allows refining the search, querying for multiple genes or restricting the search to perfect matches only. BASIC PROTOCOL 2 QUERY THE AE REPOSITORY OF MICROARRAY AND TRANSCRIPTOMICS DATA This protocol describes how to browse, query, and retrieve information from AE Repository of Microarray and Transcriptomics Data. Necessary Resources Hardware Suggested minimum requirements for a PC system: fast Internet connection (64K+), graphics card supporting at least 1024 × 768, optimal resolution 1280 × 1024 (65K+ colors) Software Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000, XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+ (Mac OS X) Query the Repository 1. Open the AE homepage at http://www.ebi.ac.uk/arrayexpress (Fig. 7.13.1). 2. In the Experiments box on the left-hand side of the page, type in a word or a phrase by which you want to retrieve the experiments, e.g., cell cycle. Click the “query” button. The user can query the Repository by experimental attributes, such as keywords, species, array platform, publication details, or experiment accession numbers. Alternatively, the user can first click on the Browse experiments link (in the Experiments box) and browse the entire Repository content and subsequently apply additional filtering such as “filter on species” and “filter on array,” which allows narrowing down the search. ArrayExpress and Expression Profiler Note that when using Internet Explorer the drop-down menu for the “filter on array” option is not displayed properly so we strongly advise the use of Firefox 1.5+ to avoid this problem. 7.13.6 Supplement 23 Current Protocols in Bioinformatics 3. In the output window, filter on the species Schizosaccharomyces pombe using the “filter on species” drop-down menu at the top of the page. This will bring up a window with a list of experiments in the reverse order of their publication dates in the AE Repository (Fig. 7.13.6). One can increase the number of experiments per page by changing the default in the top-right corner up to 500 per page. The information displayed for each experiment is described in Table 7.13.1. 4. Expand an experiment by clicking on the experiment title line. For instance, expand the experiment E-MEXP-54. Additional information is provided in the new window together with extremely useful links to experiment annotation and data retrieval (Fig. 7.13.7). Excel spreadsheets for Sample annotation and Detailed sample annotation can be viewed or downloaded by clicking on the Tab-delimited spreadsheet links. A graphical representation of the experimental set up is also available by clicking the PNG or SVG links, under the Experiment design menu. Figure 7.13.6 Output window after querying the AE Repository for a particular set of experiments, using a word or phrase (e.g., cell cycle) and selecting a species (e.g., Schizosaccharomyces pombe). The total number of experiments and corresponding samples retrieved appears at the bottom of the page. Table 7.13.1 Information Displayed for Individual Experiments in the AE Repository Column header Description and comments ID This is the experiment accession number, a unique identifier assigned to each experiment by the AE curation staff; this ID can be used directly to query the Repository Title A brief description of the experiment Hybs Number of hybridizations associated with the experiment Species List of the species studied in the selected experiment Date The date when the experiment was loaded in the Repository Processed/Raw The data available is shown as processed or raw data. A yellow icon represents data available for download; a gray icon represents data which is unavailable. Affymetrix raw data has a dedicated Affymetrix icon. Data can be easily downloaded by clicking on the icons. More Link to the Advanced User Interface for experiment annotation and data retrieval Analyzing Expression Patterns 7.13.7 Current Protocols in Bioinformatics Supplement 23 Figure 7.13.7 Expanded view of a single experiment with links to several experiment annotation files and data retrieval page. Clicking on Experimental protocols, the user is prompted to a detailed description of all protocols used, including array manufacturing, RNA extraction and labeling, hybridization, scanning, and data analysis. Clicking on the array accession number (in this case A-SNGR-8) aligned with the Array menu, a new page with additional links to the array design used is available. The array annotation file can be downloaded in Excel or tab-delimited format. These files are used to define the annotation and the layout of reporters on the array. Clicking on the FTP server direct link prompts the user to the FTP directory from which all the annotation and data files available for the selected experiment can be downloaded. 5. Go back to the expanded experiment view (Fig. 7.13.7) and click on the View detailed data retrieval page (under the Downloads menu) link. The new page header provides information on data availability for the selected experiment. Two data formats are available in this case: Processed Data Group, which is the normalized data, and Measured Data Group, which is the raw data. Take a look at the Processed Data Group 1 (Fig. 7.13.8, top). A list of all hybridizations (or experimental conditions) is displayed together with the corresponding experimental factors, time intervals after cell synchronization, in this case. Each hybridization corresponds to a data file which contains expression levels for all genes in that experimental condition. The Detailed data retrieval page allows the user to generate a gene expression data matrix from these data files. This matrix is a single .TXT file which contains expression levels for all genes in all experimental conditions within an experiment. See Commentary for a definition of data matrix. To generate such matrix the user needs to select the experimental conditions to be included (all or a selection, as needed), the “quantitation type” which represents the expression levels, and additional gene annotation columns, which can later be useful in the interpretation of the results. 6. Select all experimental conditions. Now the user needs to select the “quantitation type” and the gene annotation. ArrayExpress and Expression Profiler 7. Scroll down to the Quantitation type and the Array Annotation sessions (Fig. 7.13.8, bottom). Select Software (Unknown):Sanger Rustici:normalized as quantitation type and select two array annotations: Database DB:genedb and Reporter name. 7.13.8 Supplement 23 Current Protocols in Bioinformatics Figure 7.13.8 Top: Data retrieval page, Processed data group detail—Experimental conditions. This section of the page allows the user to select the experimental conditions to be included in the data matrix for further analysis. Bottom: Data retrieval page, Processed data group detail— Quantitation Types and Design Element Properties. This section of the page allows the user to select the format of normalized data and the type of annotation to be included in the data matrix for further analysis. The Quantitation type session lists all data formats available. For this experiment, only one quantitation type is given (the normalized signal provided by the submitter) but for other array platforms (e.g., Affymetrix) more types are available. The Array Annotation session lists the annotation information available for the array platform used. 8. Scroll down and take a look at the Raw Data Group 1. Skip the Experimental conditions and go to Quantitation types. This experiment used two-channel microarrays so the data extracted from each individual feature is provided for both Cy3 and Cy5, including foreground and background intensities (mean, median, and standard deviation), as well as ratio values and background corrected intensities. Any combination of these parameters can be included in the final data matrix, whenever raw data is needed. 9. Go back to Processed Data Group 1 and click on Export data. A data matrix will be computed using all selected experimental conditions, the normalized signal from each condition and the selected annotation for each identifier present on the array. Analyzing Expression Patterns 7.13.9 Current Protocols in Bioinformatics Supplement 23 10. On the new page, click on See data matrix to view the generated file and on Download data matrix to save it onto your computer as .TXT file. Once the data has been retrieved, it can be analyzed using the online data analysis tool Expression Profiler. This will be the focus of Basic Protocol 3. 11. Open a new window in your browser and query the repository for experiment E-AFMX-5 and go to the data retrieval page to view an example of data associated with an Affymetrix experiment. For Affymetrix arrays, the .CHP file contains the processed/normalized expression levels of each gene on the array and the .CEL file contains the raw data for every feature on the chip. BASIC PROTOCOL 3 HOW TO UPLOAD, NORMALIZE, ANALYZE, AND VISUALIZE DATA IN EXPRESSION PROFILER This protocol describes how one can upload, normalize, analyze, and visualize data in EP. Necessary Resources Hardware Suggested minimum requirements for a PC system: fast Internet connection (64K+), graphics card supporting at least 1024 × 768, optimal resolution 1280 × 1024 (65K+ colors) Software Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000, XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+ (Mac OS X) Browse the repository and upload data in EP 1. Go to the AE homepage at http://www.ebi.ac.uk/arrayexpress. The user will now retrieve an experiment from AE and save the data, which will then be loaded and analyzed with EP. 2. In the Experiments box, on the left-hand side of the page, type in the experiment accession number E-MEXP-886 and click on the “query” button. 3. Expand the experiment view for E-MEXP-886 by clicking on the experiment title line and explore its properties. The experiment used transcription profiling of ataxin-null versus wild-type mice to investigate spinocerebellar ataxia type 1. A total of ten Affymetrix MOE430A arrays were used, five hybridized with wild-type and five with knock-out samples. 4. Download the raw data by clicking on the raw data icon and saving the E-MEXP-886.raw.zip file to your PC. This represents a quick way to export the dataset. Instead of generating a data matrix, the entire raw dataset can be saved as a compressed archive for direct upload. 5. Go to the EP main page at http://www.ebi.ac.uk/expressionprofiler (Fig. 7.13.9). 6. Create a login by clicking on the Register new user link. ArrayExpress and Expression Profiler Fill the “user registration” page with all required details, and choose a personal user name and password. You will be able to use it each time you want to login. All the data loaded and analysis history will be saved and stored under this user login. With a “guest login” all the data and analysis will be lost at the end of each session. 7.13.10 Supplement 23 Current Protocols in Bioinformatics Figure 7.13.9 The Expression Profiler main page (http://www.ebi.ac.uk/expressionprofiler/). Figure 7.13.10 Upload/Expression data windows in EP. The user can directly upload data in a variety of tabular formats (top) or in Affymetrix format (bottom). 7. Once registered, click on the EP:NG Login Page link, on the EP main page (Fig. 7.13.9), enter your username and password, and click Login. The user will then be prompted to the data upload page (Fig. 7.13.10). 8. In the Upload/Expression data page click on the Affymetrix tab (Fig. 7.13.10, bottom). The Data Upload component can accept data in a number of formats including basic tab-delimited files, such as those exported by Microsoft Excel (Tabular data option), and Analyzing Expression Patterns 7.13.11 Current Protocols in Bioinformatics Supplement 23 Affymetrix .CEL files (Affymetrix option). The .CEL files can be uploaded by placing them into an archive (e.g., a .ZIP file) and then uploading the archive. The .ZIP file should contain only .CEL files from the same type of Affymetrix arrays. Users can also select a published dataset from the AE database through the EP interface (ArrayExpress option). A particular dataset can also be directly uploaded from a specific URL, for both Affymetrix and tabular data. Except for .CEL files, uploaded expression datasets must be represented as data matrices, with rows and columns corresponding to genes and experimental conditions, respectively. 9. Browse and select the location of the E-MEXP-886.raw.zip file. Select the data species, e.g., Mus musculus, enter a name for the experiment, and click on the Execute button. When uploading tabular data, the user will need to select a more specific type of data format among tab-delimited, single-space delimited, any length white-space delimited, Microsoft Excel spreadsheet, or custom delimiter. According to the number of annotation columns included in the data matrix (as shown in Basic Protocol 2, when describing how to compute a data matrix from the Detailed data retrieval page), the user will also need to specify the position of the first data column and data row in the matrix (Fig. 7.13.10, top). After a successful microarray data import, the EP Data Selection view is displayed (Fig. 7.13.11). This view has three sections as described in Table 7.13.2. The Subselection component provides several basic mechanisms to select genes and conditions that have particular expression values. A way to sub-select a slice of the gene expression matrix by row or column names (partial word matching can be used for this filter) is provided by the Select rows and Select columns tabs. Other selecting options are: Missing values, Value ranges, Select by similarity, and “eBayes (limma).” See Table 7.13.3 for more information on how to use these options. ArrayExpress and Expression Profiler Figure 7.13.11 Data selection view in EP. This window is divided in 3 mains sections: current dataset (top), descriptive statistics (middle), and subselection menu (bottom). For color version of this figure see http://www.currentprotocols.com. 7.13.12 Supplement 23 Current Protocols in Bioinformatics Table 7.13.2 Sections in the EP Data Selection View Section Description Comments Current dataset Displays the user’s folder structure, EP stores all parameters, results, and graphics files current dataset selection, and the ongoing for every performed analysis step. These can be analysis history (see Fig. 7.13.11, top) retrieved at any stage in the analysis by clicking the View action output icon next to the respective analysis step; this is the button with the yellow arrow/magnifying glass combination. Additional icons allow the user to view the original data, as well as row and column headers, or delete a dataset previously loaded Descriptive statistics Provides some basic data visualization graphics Graphics may include a plot of perfect match (PM) probe intensities (log-scale) for Affymetrix arrays (see Fig. 7.13.11, middle) or distribution density histograms (one- and two-channel experiments, absolute, and log-ratio data) Subselectiona A number of subsections are available with various criteria for selecting data subsets (see Fig. 7.13.11, bottom). This portion of the page changes according to the EP component selected by the user from the top left-hand side menu a See Table 7.13.3 for more information. The menu on the top left-hand side (Fig. 7.13.11) provides links to all the EP components, which can be used for data transformation, analysis, and visualization. The following sections will give an overview on how to use them. Perform data normalization in EP 10. Click on Data Normalization, under the Transformations menu on the left-hand side of the page (Fig. 7.13.11). EP provides a graphical interface to four commonly used BioConductor data normalization routines, for Affymetrix and other microarray data: GCRMA, RMA, Li and Wong, and VSN (Li and Wong, 2001; Huber et al., 2002; Irizarry et al., 2003; Wu et al., 2004). When dealing with raw scanner output data, as in the case of Affymetrix CEL files, it is important to normalize these data to minimize the noise levels and to make the expression values comparable across arrays (Quackenbush, 2002). Of the four methods available in EP, GCRMA, RMA, and Li and Wong can only be applied to Affymetrix CEL file imports, while VSN can be applied to all types of data. An important difference between GCRMA, RMA, and VSN that influences subsequent analysis is that as their final step the former two algorithms take a base 2 logarithm of the data, while VSN takes the natural (base e) logarithm. 11. Click on the RMA tab and then click Execute. The results of data normalization will be displayed in a new window as a Dataset heatmap and a Post-normalization box plot of PM log intensities distribution (Fig. 7.13.12, top). Explore the expression value distribution plots after the normalization by going back to the previous window (Fig. 7.13.12, bottom). 12. Apply different normalization methods to the same dataset and compare outputs. Perform data transformation in EP 13. Click on Data Transformation, under the Transformations menu on the left-hand side of the page (Fig. 7.13.11). The Data Transformation component is useful when the data needs to be transformed to make it suitable for some specific analysis. For instance, if the starting data import was Analyzing Expression Patterns 7.13.13 Current Protocols in Bioinformatics Supplement 23 Table 7.13.3 Subselection Menu Components Selection option Description Comments Value ranges Allows selecting for genes above a specified number of standard deviations of the mean in a minimum percentage of experiments. Alternatively, a slightly easier to use option is to sub-select the top N genes with greatest standard deviations; an input box is provided to specify the value N. The value ranges option is fairly similar to the commonly applied fold change criterion (for example filtering those genes, whose expression is more than twice in a given condition than in another one), with the following main difference: it takes into account the variability of each gene across multiple conditions. Moreover, the standard deviation criterion can be easily applied to single-channel data. Both the number of standard deviations and the percentage of conditions used can be adjusted arbitrarily to obtain a sufficiently reasonable number of candidate genes for follow-up analysis. We have found that using 1.5 standard deviations in 20% of the conditions is a good starting point for this type of filtering. Missing values Filters out rows of the matrix with more than a specified percentage of the values marked as NA (Not Available). Select by similarity Provides the functionality to supply a list of genes and, for each of those, select a specified number of most similarly expressed ones in the same dataset, merging the results in one list. The performance of this method depends both on the initial gene selection, and on the choice of distance measure for computing the similarities (see Critical Parameters and Troubleshooting section for more on distance measures). eBayes (limma) This filter provides a simple interface to the eBayes function from the limma Bioconductor package (Smyth, 2004). It allows specifying groups of samples and searching for differentially expressed genes between the defined groups. To specify the sample groups (factors), one can click on the Define factors button and use the dialog window that opens up to define one or several factor groups. Applying the eBayes data selection method is then as simple as selecting which factor group to use and specifying how many genes to return. a set of Affymetrix CEL files, it may be desired to look for genes whose expression varies relative to a reference sample, i.e., to one of the imported CEL files. The transformations options listed in Table 7.13.4 are available. 14. Click on the Absolute-to-Relative tab and use “gene’s average value” as a reference. 15. Select No log: log 2 data (post RMA) transformation for this data since this has already been calculated by the RMA normalization algorithm and click Execute. Once again, the result of data transformation will be displayed in a new window as a Dataset heatmap. 16. Explore the expression value distribution plots after the transformation by going back to the previous window (Fig. 7.13.13). ArrayExpress and Expression Profiler Statistical analysis of microarray data can be significantly affected by the presence of missing values. Therefore, it is important to estimate these values as accurately as possible before performing any analysis. For this purpose, three methods are available under the Transformations menu, in the Missing Value Imputation session: “replace with zeros,” “replace with row averages,” and KNN imputation (Troyanskaya et al., 2001; Johansson and Hakkinen, 2006). 7.13.14 Supplement 23 Current Protocols in Bioinformatics Figure 7.13.12 Data normalization output graphs. The results of data normalization can be viewed as a box plot of Perfect Match (PM) log intensities distribution (top) or in the descriptive statistic view (bottom). Above the line graph, the post-normalization mean and standard deviation values are displayed. For color version of this figure see http://www.currentprotocols.com. Identification of differentially expressed genes Two statistical approaches are available for the identification of differentially expressed genes: t-test analysis and standard multivariate analysis methods, such as Principal Component Analysis and Correspondence Analysis. Via the t-test component 17. Go to Statistics and click on “t-test Analysis” on the left-hand side of the page (Fig. 7.13.11). The t-test component under the Statistics menu provides a way to apply this basic statistics test for comparing the means from 2 distributions in the following differentially expressed gene identification situations: looking for genes expressed significantly above background/control, or looking for genes expressed differentially between 2 sets of conditions. In the first case (“one class” option), the user specifies either the background level to compare against, or selects the genes in the dataset that are to be used as controls. In the second case (“two classes in one dataset” option), the user specifies which columns in the dataset represent the first group of conditions and which represent the second group. The user will now try an example of the latter case. Analyzing Expression Patterns 7.13.15 Current Protocols in Bioinformatics Supplement 23 Table 7.13.4 Transformation Menu Components Transformation option Description Intensity → Log-Ratio Takes a set of two-channel arrays, divides every channel 1 column by the respective channel 2 column, and then, optionally, takes a logarithm of the ratio Ratio → Log-Ratio Log-transforms the selected dataset Average Row Identifiers Replaces multiple rows containing the same identifier with a single row, containing the column-wise averages K-Nearest Neighbor Imputation Fills in the missing values in the data matrix, as in Troyanskaya et al. (2001) Transpose Data Switches the rows and columns of the matrix Absolute-to-Relativea Converts from absolute expression values to relative ones, either relative to a specified column of the dataset, or relative to the gene’s mean. Mean-centerb Rescales the rows and/or columns of the matrix to zero-mean. a This transformation is useful if there is no reference sample in the dataset, but relative values are still desired for some specific type of analysis. b It can be used for running ordination-based methods in order to standardize the data and avoid superfluous scale effects in Principal Components Analysis, for instance. Figure 7.13.13 Data transformation output graph. The transformed data is now shown in the descriptive statistic view. At the top of the graph, the post-transformation mean and standard deviation values are displayed. For color version of this figure see http://www.currentprotocols.com. 18. Click on Two classes in one dataset tab, type in 1-5 for Class 1 (wild-type mice) and 6-10 for Class 2 (knock-out mice), and click on Execute. ArrayExpress and Expression Profiler Upon execution, the t-test involves, for each gene, the calculation of the mean in both groups being tested (when testing against controls, the mean over all control genes is taken as the second group mean), and comparing the difference between the two means to a theoretical t-statistic (Manly et al., 2004). Depending on the number of samples in each group (this is the number of biological replicates), the test’s reliability is reflected in the confidence intervals of the p-values that are produced (in this case, the likelihood that the two means are significantly different, i.e., that the gene is differentially expressed). A table of p-values, confidence intervals, and gene names is output (Fig. 7.13.14, left-hand side), as well as a plot of the top 15 genes found (Fig. 7.13.14, right-hand side), as per the user-specified p-value cut-off, defaulting to 0.01. 7.13.16 Supplement 23 Current Protocols in Bioinformatics Figure 7.13.14 t-test analysis output graphs. The t-test analysis results are summarized in a table, where the genes are ranked according to the p-value, with the most significant genes at the top (left). The top 15 genes are also plotted in a graph (right). An issue that occurs with running the t-test on datasets with large numbers of genes is the multiple testing problem (Pounds, 2006). For example, performing 10,000 t-tests on a dataset of 10,000 rows, with low p-values occurring by chance, at a rate of, e.g., 5%, will result in 500 genes that are falsely identified as differentially expressed. A number of standard corrections are implemented, including the Bonferroni, Holm, and Hochberg corrections (Holm, 1979; Hochberg, 1988; Benjamini and Hochberg, 1995) for reducing the p-values in order to account for the possibly high numbers of false positives. The user can select any of them from the Multiple testing correction drop-down menu. 19. Go to Data selection and click on “select by row.” The user can now find out which gene corresponds to any of the top 15 Affymetrix probe IDs just identified running the t-test analysis. Click on the small top table icon, type in the text box an Affymetrix ID, and click search. The corresponding gene symbol, gene description, and chromosome location will be returned in the result window. Via the Ordination between group analysis 20. Go to Ordination-based menu and click on Between Group Analysis. The Between Group Analysis component under the Ordination menu provides a statistically rigorous framework for a more comprehensive multigroup analysis of microarray data. Between Group Analysis (BGA) is a multiple discriminant approach that is carried out by coordinating the specified groups of samples and projecting individual sample locations on the resulting axes (Culhane et al., 2002). The ordination step involved in BGA, as implemented in EP, can be either Principal Components Analysis (PCA), or Correspondence Analysis (COA), both standard statistical tools for reducing the dimensionality of the dataset being analyzed by calculating an ordered set of values that correspond to greatest sources of variation in the data and using these values to “reorder” the genes and samples of the matrix. BGA combined with COA is especially powerful, because it provides a simultaneous view of the grouped samples and the genes that most facilitate the discrimination between them. The BGA component’s algorithms are provided through an interface to the Bioconductor package made4 (Culhane et al., 2005), which, in turn, refers to the R multivariate data analysis package ade4. 21. Click on the Define new factors icon. In the new window, click on the Add factor button. In this example, we want to identify the genes which are differentially expressed between 2 conditions: wild-type and knock-out mice. The top 5 data files are wild type and the bottom 5 are knock-out. Select a name for the new experimental factor (e.g., WT/KO), fill the table as shown in Figure 7.13.15 and click Save factor. The newly created experimental factor will now be showing in the Factors box, in the BGA window, and can now be selected as parameter for the analysis (Fig. 7.13.16). Analyzing Expression Patterns 7.13.17 Current Protocols in Bioinformatics Supplement 23 Figure 7.13.15 Define new factor window. When running an ordination-based technique, the user might need to create a new experimental factor in order to identify the genes differentially expressed between 2 conditions. In this example, the genotype is the discriminating factor (wild type versus knock-out) and the new factor can be created filling the table as shown. Figure 7.13.16 Between Group Analysis window in EP. The user can select which factor determines the group for the analysis, the type of transformation to use, and the output options. 22. Select the WT/KO factor. From the top drop-down menu select either COA or PCA (Fig. 7.13.16). The user can also decide to replace the missing value with row averages or leave them in place. Different output graphics can also be added, if needed. For this example, we will leave the default parameters. Click Execute. ArrayExpress and Expression Profiler The “overall plot” provides a graphical representation of the most discriminating arrays and/or genes. In addition to the plot, BGA produces two numerical tables, the table of gene coordinates and the table of array coordinates. The gene coordinates table is of special interest, because it provides, for each gene, a measure of how variable that gene is in each of the identified strong sources of variation. The sources of variation (principal axes/components) are ordered from left to right. In this example we only have one main source of variation (component 1). Thus, genes that have the highest or lowest values 7.13.18 Supplement 23 Current Protocols in Bioinformatics in the first column of the gene coordinates table make up the likeliest candidates for differential expression. The user can also try using the “column ID” as a discriminating factor (Fig. 7.13.16) and run BGA as just described. The results page will now include some additional graphs, including the Eigenvalues histogram and a scatter plot showing how the data is separated in the tridimensional space. PCA and COA can also be run independently of BGA, selecting the Ordination option under the Ordination-based menu. HOW TO PERFORM CLUSTERING ANALYSIS IN EXPRESSION PROFILER This protocol describes all the clustering options available in EP. BASIC PROTOCOL 4 Necessary Resources Hardware Suggested minimum requirements for a PC system: fast Internet connection (64K+), graphics card supporting at least 1024 × 768, optimal resolution 1280 × 1024 (65K+ colors) Software Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000, XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+ (Mac OS X) 1. Explore the options available under the Clustering menu on the left-hand side of the main EP page (Fig. 7.13.11). Clustering analysis is an unsupervised/semi-supervised approach to looking for trends and structures in large, multi-dimensional datasets, such as microarray gene expression data. EP provides fast implementations of two classes of clustering algorithms: hierarchical clustering and flat partitioning, in the Hierarchical and K-means/K-medoids clustering components, respectively, as well as a novel method for comparing the results of such clustering algorithms in the Clustering Comparison component. The Signature Algorithm component is an alternative approach to clustering-like analysis, based on the method by Ihmels et al. (2002). All clustering algorithms are essentially aimed at grouping objects, such as genes, together, according to some measure of similarity, so that objects within one group or cluster are more similar to each other than to objects in other groups. Clustering analysis involves one essential elementary concept: the definition of similarity between objects, also known as a distance measure. EP implements a wide variety of distance measures for clustering analysis (all distance measures can be found in the Distance measure drop-down menu). The Euclidean distance and the Correlation-based distance represent the 2 most commonly applied measures of similarity. The Euclidean metric measures absolute differences in expression levels, while the Correlation-based captures similar relative trends in expression profiles. In time series data, for instance, where one is interested in finding clusters of genes that follow a similar pattern of expression over a period of time, the correlation distance often produces the most informative clusters, while in treatment comparison experiments, where one seeks for genes that changed significantly between treated samples, Euclidean distance may perform better. Note that the data normalization method used may also influence the outcome. The user can practice combining different clustering methods and distance measures to find the optimal combination for each dataset. 2a. Click Hierarchical Clustering. Hierarchical clustering is an agglomerative approach in which single expression profiles are joined to form groups, which are further joined until the process has been completed, forming a single hierarchical tree (Fig. 7.13.17, left-hand side). The user can perform hierarchical clustering by choosing a data set of interest from the project tree and then choosing the Hierarchical option from the Clustering menu. The user then needs to specify Analyzing Expression Patterns 7.13.19 Current Protocols in Bioinformatics Supplement 23 which distance measure and clustering algorithm to use to calculate the tree. Different algorithms can be used to calculate the distance between clusters: single, complete, average, or average group linkage (Quackenbush, 2001). Additionally, the user can choose whether to cluster only rows (genes), only columns (experimental conditions), or both. The output provides a visual display of the generated hierarchy in the form of a dendrogram or tree, attached to a heatmap representation of the clustered matrix (Fig. 7.13.17, left-hand side). Individual branches of the tree, corresponding to clusters of genes or conditions, can be sub-selected (by clicking on a node) and saved for further analysis. However, the hierarchical clustering tree produced for large datasets can be difficult to interpret. 2b. Click Flat Partitioning. The K-means/K-medoids clustering component provides two flat partitioning methods, similar in their design. Both approaches are based on the idea that, for a specified number K, K initial objects are chosen as cluster centers, the remaining objects in the dataset are iteratively reshuffled around these centers, and new centers are chosen to maximize the similarity within each cluster, at the same time maximizing the dissimilarity between clusters. The main practical difference between the two methods implemented in this component is that the K-medoids allows efficiently computing of any distance measure available in EP, while the K-means is limited to the Euclidean and Correlationbased measures. Once again, the user needs to select a distance measure to be used, the K number of clusters and the initializing method, choosing between initializing by most distant (average) genes, by most distant (minimum) genes, or by random genes. In the output, each cluster is visualized by a heatmap and a multi-gene lineplot (Fig. 7.13.17, right-hand side). 2c. Click Clustering Comparison. A commonly encountered problem with hierarchical clustering is that it is difficult to identify branches within the hierarchy that in some way form optimally tight clusters. Indeed, it is rare that one can clearly identify a definite number of distinct clusters from the dendrogram, in the real world data. Similarly, in the case of flat partitioning, the determination of the number of desired clusters is often arbitrary and unguided. The Clustering Comparison component aims to alleviate these difficulties by providing an algorithm and a visual depiction of a mapping between a dendrogram and a set of flat clusters (Torrente et al., 2005; Fig. 7.13.17). The clustering comparison component not only provides an Figure 7.13.17 A comparison between hierarchical clustering (correlation-based distance, average linkage) and k-means clustering (correlation-based distance, k = 5) in the S. pombe stress response dataset E-MEXP-29. The normalized data was retrieved from Array Express as described in Basic Protocol 2, step 2 and loaded into EP as tab-delimited file. Data were log transformed and the 140 most varying genes (>0.9 SD in 60% of the hybridization) selected for clustering comparison. For additional information refer to Torrente el al. (2005). Line thickness is proportional to the number of elements common to both sets. By placing the mouse cursor over a line, a Venn diagram is displayed showing the number of elements in the 2 clusters and the overlap. For color version of this figure see http://www.currentprotocols.com. 7.13.20 Supplement 23 Current Protocols in Bioinformatics informative insight into the structure of the tree by highlighting the branches that best correspond to one or more flat clusters from the partitioning, but also can be useful when comparing the hierarchical clustering to a predefined functionally meaningful grouping of the genes. For comparison between a pair of flat partitioning clusterings, it can be used as a process to establish the optimal parameter K by starting with a high number of clusters and letting the comparison algorithm identify the appropriate number of superclusters. Adjustable clustering comparison parameters are also provided. The user can specify the “number of steps to look ahead in the tree search,” the type of “scoring function” to use, and select an “overlapping index computation method” (Torrente et al., 2005). 2d. Click Signature algorithm. Signature Algorithm is the R implementation of an algorithm previously described (Ihmels et al., 2002). It identifies a co-expressed subset in a user-submitted set of genes, removes unrelated genes from the input, and identifies additional genes in the same dataset that follow a similar pattern of expression. Co-expression is identified with respect to a subset of conditions, which is also provided as the output of the algorithm. It is a fast algorithm useful for exploring the modular structure of expression data matrices. HOW TO CALCULATE GENE ONTOLOGY TERM ENRICHMENT IN EXPRESSION PROFILER BASIC PROTOCOL 5 This protocol describes how to calculate and visualize Gene Ontology terms enrichment in EP. Necessary Resources Hardware Suggested minimum requirements for a PC system: fast Internet connection (64K+), graphics card supporting at least 1024 × 768, optimal resolution 1280 × 1024 (65K+ colors) Software Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000, XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+ (Mac OS X) 1. Click on Gene Ontology, under the Annotation menu on the left-hand side of the main EP page (Fig. 7.13.11). Gene Ontology is a controlled vocabulary used to describe the biology of a gene product in any organism (Ashburner et al., 2000). There are 3 independent sets of vocabularies, or ontologies, which describe the molecular function of a gene product, the biological process in which the gene product participates, and the cellular component where the gene product can be found. Once a subset of genes of interest has been identified, through one or several of the approaches described so far, the user can look for GO terms enriched in the given gene list (e.g., a particular cluster obtained with flat partitioning). 2. Enter a list of gene identifiers in the Gene IDs box, select a GO category of interest (such as biological process) and enter a p-value cutoff (default is 0.05). A multiple testing correction can be selected from the drop-down menu to reduce the number of false positives. Click Execute. Results will be displayed as a tree view of GO terms and genes associated with each term. In addition, a table will summarize the results, showing for each GO term the observed and genomic frequencies of enrichment, the p-value associated with this enrichment, and the genes related to each category (Fig. 7.13.18). Analyzing Expression Patterns 7.13.21 Current Protocols in Bioinformatics Supplement 23 Figure 7.13.18 Gene ontology annotation output. The results of GO terms enrichment for a given gene list are summarized in this table. BASIC PROTOCOL 6 HOW TO CALCULATE CHROMOSOME CO-LOCALIZATION PROBABILITY IN EXPRESSION PROFILER This protocol describes how to use the ChroCoLoc application in EP for calculating the probability of chromosome co-localization of a set of co-expressed genes. Necessary Resources Hardware Suggested minimum requirements for a PC system: fast Internet connection (64K+), graphics card supporting at least 1024 × 768, optimal resolution 1280 × 1024 (65K+ colors) Software Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000, XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+ (Mac OS X) 1. Click on ChroCoLoc, under the Annotation menu on the left-hand side of the main EP page (Fig. 7.13.11). The ChroCoLoc feature allows the user to calculate the probability of groups of coexpressed genes co-localizing on chromosomes (Blake et al., 2006). This application uses a hypergeometric distribution to describe the probability of x co-expressed genes being located in the same region, when there are y genes in that region, from the total population of genes in the study. Karyotypes for human, mouse, rat, and Drosophila are currently provided. 2. Select a group of co-expressed genes previously identified and use any of the additional filters to remove chromosome regions which might not contain statistically significant features. Filters can be set for the minimum number of features per region or the minimum percentage of features regulated in the region. The calculated probabilities can be adjusted for multiple testing using the Bonferroni correction. ArrayExpress and Expression Profiler Regions that pass the filter criteria are plotted on the selected karyotype, colored accordingly to the probability of the observed co-localization occurring (Fig. 7.13.19). 7.13.22 Supplement 23 Current Protocols in Bioinformatics Figure 7.13.19 Chromosome co-localization output. Probabilities of co-localization of regulated genes are plotted onto a human karyogram. Chromosomal regions are colored according to decreasing probability of co-localization occurring by chance, with red < = 0.01, orange < = 0.02, yellow < = 0.03, light blue < = 0.04 and green < = 0.05. For color version of this figure see http://www.currentprotocols.com. A table shows the calculated probabilities, which can also be plotted as percentage of co-expressed genes in a particular region or as absolute number of genes co-expressed per region. GUIDELINES FOR UNDERSTANDING RESULTS The basic protocols in this unit describe: (1) how to query and interpret data from the AE Warehouse of Gene Expression Profiles (Basic Protocol 1); (2) how to query, interpret, and retrieve data from the AE Repository of Microarray and Transcriptomics data (Basic Protocol 2); and (3) how to upload and analyze gene expression data in EP (Basic Protocols 3, 4, 5, and 6). The aim of these protocols is to familiarize the user with the AE database content, showing that the database can be queried in different ways (e.g., at the experiment or at the gene level) and how to interpret the results of different queries. In addition, we have shown how microarray data can be exported from AE and subsequently imported into an external data analysis tool, such as EP. Different normalization, transformation, hypothesis testing, and clustering methods have been illustrated with the aim of giving the reader an overview of some of the most important steps in the analysis of gene expression data. It is extremely important for the user to have at least a basic understanding of these methods before drawing conclusions regarding the results. Few additional concepts have been left out but are worth mentioning in the context of microarray research and data analysis. 1. The most crucial aspect of any microarray experiment is the experimental design. If the experiment is not designed properly, no analysis method will be able to obtain valid conclusions from the data. When designing an experiment, one should always take into account: (i) which samples are compared; (ii) how many experimental factors are involved and what are they; (iii) how many biological/technical replications will be performed; and (iv) which reference would be more appropriate. 2. Array quality assessment is an aspect that should always be included among the goals of data analysis. Such quality measures will allow discarding the data coming from below standard arrays, as well as the identification of possible causes of failure in the microarray process. Analyzing Expression Patterns 7.13.23 Current Protocols in Bioinformatics Supplement 23 3. Any hypothesis generated from a microarray experiment needs to be validated independently in order to establish the “robustness” or “reliability” of a microarray finding. Experimental validation is essential especially when considering the inherently noisy nature of microarray data. COMMENTARY Background Information Central concepts: Experiment and dataset The highest level of organization in the AE Repository is the experiment, which consists of one or more hybridizations, usually linked to a publication. The query interface provides the ability to query the experiments via free text search (e.g., experiment accession numbers, authors, and publication details) and filter the experiments by species or array design. Once an experiment has been selected, the user can examine the description of the samples and protocols by navigating through the experiment, or can download the dataset associated with the experiment for analyzing it locally (see Basic Protocol 2). Once downloaded, the dataset can be visualized and analyzed online using Expression Profiler (see Basic Protocols 3, 4, 5, and 6) or other analysis tools. The dataset is the central object that the user provides as input for analyses in EP. A dataset can be one of two types: raw or normalized. The raw dataset is the spot-intensity data and can be input to normalization procedure giving, as a result, a gene expression data matrix (see Basic Protocol 3). The normalized dataset is already in the format of a gene expression data matrix. This can be input to a selection of analysis components, such as t-test and clustering. ArrayExpress and Expression Profiler Gene expression data matrix In a gene expression data matrix, each row represents a gene and each column represents an experimental condition or array. An entry in the data matrix usually represents the expression level or expression ratio of a gene under a given experimental condition. In addition to numerical values, the matrix can also contain additional columns for gene annotation or additional rows for sample annotation, as textual information. Gene annotation may include gene names, sequence information, chromosome location, description of the functional role for known genes, and links to the respective entries in sequence databases. Sample annotation may provide information about the organism part from which the sample was taken, or cell type used, or whether the sample was treated, and, if so, what the treatment was (e.g., compound used and concentration). An example of how to generate a gene-expression data matrix from an experiment deposited in the AE Repository is given in Basic Protocol 2. Gene expression profile Another important concept is that of the gene expression profile. An expression profile describes the (relative) expression levels of a gene across a set of experimental conditions (e.g., a row in the data matrix) or the expression levels of a set of genes in one experimental condition (e.g., a column in the data matrix). The AE Data Warehouse supports queries on gene expression profiles using (1) gene names, identifiers, or properties such as GO terms; (2) information on which family a gene belongs to or the motifs and domains it contains (InterPro terms); and (3) sample properties. The user can retrieve and visualize the gene expression values for multiple experiments. The expression profiles are visualized using line plots and a similarity search can be run to find genes with similar expression level within the same experiment (see Basic Protocol 1). Critical Parameters and Troubleshooting Many issues are related to microarray data analysis and it is beyond the scope of this unit to address them all. In this unit we only want to draw the attention of the reader to some of these issues and how they can be addressed within Expression Profiler. The reader should also refer to specific discussion provided in individual protocol steps. Data normalization and transformation Microarrays, like any other biological experiment, are characterized by systematic variation between experimental conditions unrelated to biological differences. For example, when dealing with gene expression data, true differential expression between two conditions (e.g., normal versus disease) might be masked by several biases introduced throughout the experiment (e.g., different amount of starting RNA, unequal labeling efficiency, and uneven hybridization). 7.13.24 Supplement 23 Current Protocols in Bioinformatics Normalization aims to compensate for systematic technical differences between arrays, to see more clearly the systematic biological differences between samples. Without normalization, data from different arrays cannot be compared. Although the aim of normalization stays the same, the algorithms used for normalizing 2-color arrays differ from those used for normalizing 1-color arrays (e.g., Affymetrix). Four normalization algorithms are implemented in EP, three specific for Affymetrix data (i.e., RMA, GCRMA, and Li and Wong) and one (VSN) that can be applied to all data types. It is advised to apply more than one normalization method to each dataset and compare the outputs in order to identify which method might be more appropriate for a given dataset (see Basic Protocol 3, step 2). In the context of microarray data analysis, expression ratio values are transformed to logratio values. This transformation is sometimes carried out by the normalization algorithm or might need to be carried out separately. Log transforming the data is of great importance. In the linear ratio space, the ratio values are centered around a value of 1. All genes that are up-regulated have values >1, with no upper bound. All genes that are down-regulated have ratio values compressed between 0 and 1. In this situation the distribution of linear ratio values is clearly not normal. By transforming all ratio values to log-space, the data becomes normally distributed around 0. Genes up-regulated by 2-fold are now at a value of +1, while 2-fold repression is quantified by a value of –1. Several data transformation options are given in EP and can be used to convert the data to the desired format. Following normalization, the user should always verify that the data is normally distributed and apply the appropriate transformation, if needed (see Basic Protocol 3, step 3). Identification of differentially expressed genes In many cases, the purpose of a microarray experiment is to compare gene expression levels in two different conditions (e.g., normal versus disease). A wide range of methods is available for the selection of differentially expressed genes but some methods (e.g., fold change) are arbitrary and inadequate. A more robust method for selecting differentially regulated genes is a classical hypothesis testing approach, such as a t-test. A t-test assesses whether the means of two groups are statistically different from each other, relative to the variability of the distributions. It essentially measures the signal-tonoise ratio and calculates a p-value for each gene. Consequently, those genes that are significantly different between two conditions, above a certain p-value cut-off, are selected and further analyzed. As explained in Basic Protocol 3, an issue that occurs with running the t-test on datasets with large numbers of genes is the multiple testing problem (Pounds, 2006). For example, performing 10,000 t-tests on a dataset of 10,000 rows, with low p-values occurring by chance, at a rate of, e.g., 5%, will result in 500 genes that are falsely identified as differentially expressed. Therefore, the user should always apply a multiple testing correction method to account for the possibly high numbers of false positives. A number of multiple testing corrections are implemented in EP, including Bonferroni, Holm, and Hochberg (Holm, 1979; Hochberg, 1988; Benjamini and Hochberg, 1995). Once again, the user should apply different correction methods and observe changes in the final result. In addition, this approach requires each experimental condition being represented by multiple biological replicates, for statistical significance. There is no ideal number of biological replicates per experiment but the user should bear in mind that biological replication substantially increases the reliability of microarray results. Another problem is the choice of the most appropriate p-value cut-off to be used in the t-test analysis. Once again, the most appropriate cut-off must be decided inspecting the data. The user should apply different cut-off values and observe how changes in the cutoff affect the final number of differentially expressed genes. Only by moving the cut-off up and down, the user will be able to decide on the most appropriate value to use. Clustering analysis The aim of clustering is to discover patterns of gene expression in the data by grouping genes together, according to a similarity measure, so that objects within one group are more similar to each other than to the objects in other groups. For this, the user needs to quantify to what degree two expression profiles are similar. Such measure of similarity is called distance; the more distant two expression profiles are, in the multidimensional space, the more dissimilar they are. Analyzing Expression Patterns 7.13.25 Current Protocols in Bioinformatics Supplement 23 ArrayExpress and Expression Profiler One can measure this distance in a number of different ways. The simplest measure is Euclidean distance, which is simply the length of the straight lines connecting two points in multidimensional space. Another measure is the Manhattan distance. This measure represents the distance that one needs to travel in an environment in which one can move only along directions parallel to the x or y axes. In comparison to Euclidean distance, Manhattan distance tends to yield a larger numerical value for the same relative position of the points. Other measures quantify the similarity in expression profile shape (e.g., if the genes go up and down in a coordinated fashion across the experimental conditions), and are based on measures of correlation (e.g., Pearson correlation). With so many distance measures, a natural question is when to use what? It is a good idea for the user to explore the different similarity measures separately to become familiar with the properties of each measure. Also, when clustering genes, the user needs to choose which clustering algorithm to use. Two algorithms are available in EP: an agglomerative method or hierarchical clustering and a partitioning method or K-means clustering. Hierarchical clustering starts by assigning each gene to its own cluster, so that if you have N genes, you now have N clusters, each containing just one gene. It then finds the closest (most similar) pair of genes, using the similarity measure chosen by the user, and merges them into a single cluster, so that now you have one less cluster. The distances between the new cluster and each of the old clusters are then computed and the process repeated until all genes are clustered. Four different algorithms can be used to calculate the inter-cluster distances in EP (i.e., single, complete, centroid, or average linkage—see Basic Protocol 4). The result is a hierarchy of clusters in the form of a tree or dendrogram (see Fig. 7.13.17, left-hand side). When using K-means clustering, the user needs to specify the number of clusters, K, in which the data will be split. Once the K number has been specified, the method assigns each gene to one of the K clusters, depending on the minimum distance. The centroids’ position is recalculated every time a gene is added to the cluster and this continues until all the genes are grouped into the final required K number of clusters. The initial choice of the number of K clusters is an issue that needs careful consideration. If it is known in advance that the patterns of gene expression to be clustered belong to several classes (e.g., normal and disease), the user should cluster using the known number of classes as K. If the analysis has an exploratory character, the user should repeat the clustering for several values of K and compare the results. Once again, the user is encouraged to compare the output of different clustering methods and apply several distance measure-clustering algorithms combinations to see how the clustering results change. This will help with finding the best visual representation for a dataset (Quackenbush, 2001). The Clustering comparison component in EP is particularly useful for this purpose since it provides side-by-side visualization of two clustering outputs. This might help the user deciding on a particular clustering method or on the optimal number of K in which to subdivide the dataset. Literature Cited Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., and Sherlock, G. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25:2529. Ball, C., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J.C., Icahn, C., Parkinson, H., Quackenbush, J., Ringwald, M., Sansone, S.A., Sherlock, G., Spellman, P., Stoeckert, C., Tateno, Y., Taylor, R., White, J., and Winegarden, N. 2004. An open letter on microarray data from the MGED Society. Microbiology 150:3522-3524. Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57:289-300. Blake, J., Schwager, C., Kapushesky, M., and Brazma, A. 2006. ChroCoLoc: An application for calculating the probability of co-localization of microarray gene expression. Bioinformatics 22:765-767. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. 2001. Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat. Genet. 29:365-371. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G.G., Oezcimen, A., Rocca-Serra, P., and Sansone, S.A. 2003. ArrayExpress-a public repository for 7.13.26 Supplement 23 Current Protocols in Bioinformatics microarray gene expression data at the EBI. Nucleic Acids Res. 31:68-71. Culhane, A.C., Perriere, G., Considine, E.C., Cotter, T.G., and Higgins, D.G. 2002. Betweengroup analysis of microarray data. Bioinformatics 18:1600-1608. Culhane, A.C., Thioulouse, J., Perriere, G., and Higgins, D.G. 2005. MADE4: An R package for multivariate analysis of gene expression data. Bioinformatics 21:2789-2790. Edgar, R., Domrachev, M., and Lash, A.E. 2002. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30:207-210. Hochberg, Y. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800-803. A. 2004. Expression Profiler: Next generationan online platform for analysis of microarray data. Nucleic Acids Res. 32:W465- W470. Li, C. and Wong, W.H. 2001. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. U.S.A. 98:31-36. Manly, K.F., Nettleton, D., and Hwang, J.T. 2004. Genomics, prior probability, and statistical tests of multiple hypotheses. Genome Res. 14:9971001. Pounds, S. 2006. Estimation and control of multiple testing error rates for microarray studies. Brief. Bioinform. 7:25-36. Ihaka, R. and Gentleman, R. 1996. R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5:299-314. Quackenbush, J. 2001. Computational analysis of microarray data. Nat. Rev. Genet. 2:418427. Quackenbush, J. 2002. Microarray data normalization and transformation. Nat. Genet. 32:496501. Rayner, T.F., Rocca-Serra, P., Spellman, P.T., Causton, H.C., Farne, A., Holloway, E., Irizarry, R.A., Liu, J., Maier, D.S., Miller, M., Petersen, K., Quackenbush, J., Sherlock, G., Stoeckert, C.J., White, J., Whetzel, P.L., Wymore, F., Parkinson, H., Sarkans, U., Ball, C.A., and Brazma, A. 2006. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7:489. Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y., and Barkai, N. 2002. Revealing modular organization in the yeast transcriptional network. Nat. Genet. 31:370-377. Smyth, G.K. 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3:Article3. Ikeo, K., Ishi-i, J., Tamura, T., Gojobori, T., and Tateno, Y. 2003. CIBEX: Center for information biology gene expression database. C. R. Biol. 326:1079-1082. Torrente, A., Kapushesky, M., and Brazma, A. 2005. A new algorithm for comparing and visualizing relationships between hierarchical and flat gene expression data clusterings. Bioinformatics 21:3993-3999. Holm, S. 1979. A simple sequentially rejective Bonferroni test procedure. Scand. J. Stat. 6:6570. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., and Vingron, M. 2002. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:S96-S104. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., and Speed, T.P. 2003. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31:e15. Johansson, P. and Hakkinen, J. 2006. Improving missing value imputation of microarray data by using spot quality weights. BMC Bioinformatics 7:306. Kapushesky, M., Kemmeren, P., Culhane, A.C., Durinck, S., Ihmels, J., Korner, C., Kull, M., Torrente, A., Sarkans, U., Vilo, J., and Brazma, Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17:520-525. Wu, Z., Irizarry, R., Gentleman, R., MartinezMurillo, F., and Spencer, F. 2004. A modelbased background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. 99:909-917. Analyzing Expression Patterns 7.13.27 Current Protocols in Bioinformatics Supplement 23 Analyzing Gene Expression Data from Microarray and Next-Generation DNA Sequencing Transcriptome Profiling Assays Using GeneSifter Analysis Edition UNIT 7.14 Sandra Porter,1, 2 N. Eric Olson,2 and Todd Smith2 1 2 Digital World Biology, Seattle, Washington Geospiza, Inc., Seattle, Washington ABSTRACT Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems C 2009 by John Wiley & Sons, SOLiD). Curr. Protoc. Bioinform. 27:7.14.1-7.14.35. Inc. Keywords: gene expression r microarray r RNA-Seq r transcriptome r GeneSifter Analysis Edition r next-generation DNA sequencing INTRODUCTION Transcriptome profiling is a widely used technique that allows researchers to view the response of an organism or cell to a new situation or treatment. Insights into the transcriptome have uncovered new genes, helped clarify mechanisms of gene regulation, and implicated new pathways in the response to different drugs or environmental conditions. Often, these kinds of analyses are carried out using microarrays. Microarray assays quantify gene expression indirectly by measuring the intensity of fluorescent signals from tagged RNA after it has been allowed to hybridize to thousands of probes on a single chip. Recently, next-generation DNA sequencing technologies (also known as NGS or Next Gen) have emerged as an alternative method for sampling the transcriptome. Unlike microarrays, which identify transcripts by hybridization and quantify transcripts by fluorescence intensity, NGS technologies identify transcripts by sequencing DNA and quantify transcription by counting the number of sequences that align to a given transcript. Although the final output from an NGS experiment is a digital measure of gene expression, with the units expressed as the numbers of aligned reads instead of intensity, the data and goals are similar enough that we can apply many of the statistical methods developed for working with microarrays to the analysis of NGS data. There are many benefits to using microarray assays, the greatest being low cost and long experience. Over the years, the laboratory methods for sample preparation and the statistical methods for analyzing data have become more standardized. As NGS becomes more commonplace, these new methods are increasingly likely to serve as a complement Current Protocols in Bioinformatics 7.14.1-7.14.35, September 2009 Published online September 2009 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471250953.bi0714s27 C 2009 John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.14.1 Supplement 27 or alternative to microarrays. Since these assays are based on DNA sequencing rather than hybridization, the background is low, the results are digital, the dynamic range is greater, and transcripts can be detected even in the absence of a pre-existing probe (Marioni et al., 2008; Wang et al., 2009). Furthermore, once the sequence data are available, they can be aligned to new reference data sets, making NGS data valuable for future experiments. Still, until NGS assays are better characterized and understood, it is likely that microarrays and NGS will serve as complementary technologies for some years to come. In this chapter, we describe using a common platform, GeneSifter Analysis Edition (GSAE; a registered trademark of Geospiza, Inc.), for analyzing data from both microarray and NGS experiments. GSAE is a versatile Web-based system that can already be used to analyze data from a wide variety of microarray platforms. We have added features for uploading large data sets, aligning data to reference sequences, and presenting results, which make GSAE useful for NGS as well. Both kinds of data analyses share several similar features. Data must be entered into the system and normalized. Statistical methods must be applied to identify significant differences in gene expression. Once significantly different expression patterns have been identified, there must be a way to uncover the biological meaning for those results. GSAE provides methods for working with ontologies and KEGG pathways, clustering options to help identify genes that share similar patterns of expression, and links to access information in public databases. Data-management capabilities and quality control measures are also part of the GSAE system. In both of the two basic protocols, we will present general methods for analyzing microarray data, follow those procedures with alternative procedures that can be used to analyze NGS data, and discuss the differences between the microarray protocol and the NGS alternative. Basic Protocol 1 presents a pairwise analysis of microarray data from mice that were fed different kinds of food (Kozul et al., 2008). The protocol uses data from the public Gene Expression Omnibus (GEO) database at the NCBI (Barrett et al., 2009), and demonstrates normalizing the data and the analyses. Alternate Protocol 1, for a pairwise comparison, also uses data from GEO; however, these are NGS data from the Applied Biosystems SOLiD instrument. In Alternate Protocol 1, we use a pairwise analysis to compare gene expression from single wild-type mouse oocytes with gene expression in mouse oocytes containing a knockout mutation for DICER, a gene involved in processing microRNAs (Tang et al., 2009). Basic Protocol 2 presents a general method for analyzing microarray data from samples that were obtained after multiple conditions were applied. In this study, mice were fed two kinds of food and exposed to increasing concentrations of arsenic in their water (Kozul et al., 2008). This protocol includes ANOVA and demonstrates options for Principal Component Analysis, clustering data by samples or genes, and identifying expression patterns from specific gene families. Alternate Protocol 2, a variation on Basic Protocol 2, describes an analysis of NGS data from the Illumina GA analyzer, comparing samples from three different tissues (Mortazavi et al., 2008). Cluster analysis is included in this procedure as a means of identifying genes that are expressed in a tissue-specific manner. As with Basic Protocol 1, these studies use data from public repositories, in this case, GEO and the NCBI Short Read Archive (SRA; Wheeler et al., 2008). It should be noted for both protocols that GSAE contains alternatives to the statistical tools used in these procedures and that other tools may be more appropriate, depending on the individual study. Analyzing Gene Expression Data Using GeneSifter 7.14.2 Supplement 27 Current Protocols in Bioinformatics COMPARING GENE EXPRESSION FROM PAIRED SAMPLE DATA OBTAINED FROM MICROARRAY EXPERIMENTS BASIC PROTOCOL 1 One of the most common types of transcriptome profiling experiments involves comparing gene expression from two different kinds of samples. These conditions might be an untreated and treated control, or a wild-type strain and a mutant. Since there are two conditions, we call this process a pairwise analysis. Often, the two conditions involve replicates as well. For example, we might have four mice as untreated controls and four mice that were subjected to some kind of experimental treatment. Comparing these two sets of samples requires normalizing the data so that we can compare expression within and between arrays. Next, the normalized results are compared and subjected to statistical tests to determine if any differences are likely to be significant. Procedures can also be applied at this stage to correct for multiple testing. Last, we use z scores, ontologies, and pathway information to explore the biology and determine if some pathways are significantly over-represented, and elucidate what this information is telling us about our samples. Figure 7.14.1 provides an overview of this process. In this analysis, we compare the expression profiles from the livers of five mice that were fed for 5 weeks with a purified diet, AIN-76A, with the expression profiles from the livers of five mice that were fed for the same period of time with LRD-5001, a standard laboratory mouse food. pairwise comparison mouse 1 mouse 2 mouse 3 mouse 4 mouse 5 mouse 6 treatment no treatment isolate RNA microarrays upload data, normalize identify differential expression - fold change - quality - statistics-e.g. t -test, others - multiple testing correction - Bonferroni - Benjamini and Hochberg - others explore biology-ontology, KEGG, scatter plot Figure 7.14.1 Overview of the process for a pairwise comparison. Analyzing Expression Patterns 7.14.3 Current Protocols in Bioinformatics Supplement 27 Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a trial account or license for GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) GSAE is accessed through the Web; therefore Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari. Files Data files from a variety of microarray platforms may be uploaded and analyzed in GSAE, including Affymetrix, Illumina, Codelink, or Agilent arrays, and custom chips. The example data used in this procedure were CEL files from an Affymetrix array and were obtained from the GEO database at the NCBI (Accession code GSE 9630). CEL files are the best file type for use in GSAE. To obtain CEL files, go to the GEO database at the NCBI (www.ncbi.nih.gov/geo/). Enter the accession number (in this case GSE 9630) in the section labeled Query and click the Go button. In this example, all the files in the data set are downloaded as a single tar file by selecting (ftp) from the Download column at the bottom of the page. After downloading to a local computer, the files are extracted, unzipped, then uploaded to GSAE as described in the instructions. Files used for the AIN-76 group: GSM243398, GSM243405, GSM243391, GSM243358, and GSM243376. Files used for the LRD-5001 group: GSM243394, GSM243397, GSM243378, GSM243382, and GSM243355. A demonstration site with the steps performed below and the same data files can be accessed from the data center at http://www.geospiza.com. Uploading data 1. Create a zip archive from your microarray data files. a. If using a computer with a Microsoft Windows operating system, a commonly used program is WinZip. b. If using Mac OS X, select your data files, click the right mouse button, and choose Compress # Items to create a zip archive. 2. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). A username and password are provided when a trial account is established. 3. Locate the Import Data heading in the Control Panel on the left-hand side of the screen and click Upload Tools. Several types of microarray data can be uploaded and analyzed in GSAE. Since different microarray platforms produce data in a variety of formats, each type of microarray data has its own upload wizard. In this protocol, we will be working with Affymetrix CEL data from the NCBI GEO database, and so we choose the option for “Advanced upload methods.” This option also allows you to normalize data during the upload process using standard techniques for Affymetrix data such as RMA, GC-RMA, or MAS5. Instructions for using other GSAE upload wizards are straightforward and are available in the GSAE user manual. Analyzing Gene Expression Data Using GeneSifter RMA and GC-RMA are commonly used normalization procedures (Millenaar et al., 2006). Both of these processes involve three distinct operations: global background normalization, across-array normalization, and log2 transformations of the intensity values. One 7.14.4 Supplement 27 Current Protocols in Bioinformatics point to note here is that if you plan to use RMA or GC-RMA, the across-array normalization step requires that all the data be uploaded at the same time. If you wish to compare data to another experiment at a later time, you will need to upload the data again, together with those data from the new experiment. 4. Click the Run Advanced Upload Methods button. 5. Next, select the normalization method and the array type from pull-down menus. Click the Next button (at the bottom of the screen). Choose GC-RMA normalization and the 430 2.0 Mouse array for in this example. 6. In the screen which now appears, browse to locate the data file created in step 1. 7. Choose an option (radio button): Create Groups, Create New Targets, or Same as File Name. Since a pairwise analysis involves comparing two groups of samples, choose Create Groups and set 2 as the value. If the experiment were to involve comparing more than two conditions, other options would be chosen. These are described in Basic Protocol 2. 8. Click the Next button. The screen for the next step will appear after the data are uploaded. 9. On the screen displayed in “step 3 of 4,” you will be asked to enter a title for your data set, assign a condition to each group, add labels to your samples if desired, and identify which sample(s) belong to which group. In this case, decide that the AIN-76A mice should be condition 1 and the LRD-5001 mice should be condition 2. Then, use the buttons to assign all the AIN-76 samples to group 1 and the LRD-5001 samples to group 2. Comparing paired groups of samples and finding differentially expressed genes 10. Begin by selecting Pairwise from the Analysis section of the control panel (Fig. 7.14.2). 11. Find the array or gene set that corresponds to your experiment. In this case, our array is named “Mouse food and arsenic.” 12. Select the spyglass to set up the analysis. A new page will appear with a list of all the samples in the array as well as the analysis options. 13. Use the checkboxes in the group 1 column to select the samples for group 1, and the checkbox in the group 2 column to select the samples for group 2. Usually, the control, wild-type, or untreated samples are assigned to group 1. Here, assign the AIN-76A sample to group 1 and the LRD-5001 samples to group 2. 14. Choose the advanced analysis settings. Since the data were normalized during the uploading process by the GC-RMA algorithm, we can use some of the default settings for the analysis. If you choose a setting that is not valid for RMA or GC-RMA normalized data, warnings will appear to let you know that the data are already normalized or already log transformed. a. Normalization: Use None with RMA or GC-RMA normalized data. This step has already been performed, since RMA and GC-RMA both perform quantile normalization during the upload process. b. Statistics: The statistical tests available from the pull-down menu are used to determine the probability that the differences between the mean values for intensity measurements, for each gene (or probe), from a set of replicate samples, are Analyzing Expression Patterns 7.14.5 Current Protocols in Bioinformatics Supplement 27 1 2 3 1 select Pairwise 2 select the gene set 3 assign samples to the two groups 4 choose analysis settings 5 click Analyze 4 5 Figure 7.14.2 Analyzing Gene Expression Data Using GeneSifter Setting up a pairwise comparison. significant. The significance level for each gene is reported as a p value. When multiple replicates of a sample are used, GSAE users can choose between the t test, Welch’s t test, a Wilcoxon test, and no statistical tests. The t test is commonly used for this step when samples from a controlled experiment are being compared. The t test assumes a normal distribution with equal variance. Other options that may be used are the Welch’s t test, which does not assume equal variance, and the Wilcoxon test, a nonparametric rank-sum test. Since all of these tests look at the variation between replicates, you must have at least two replicates for each group to apply these tests. For the Wilcoxon test, you must have at least four replicates. Use the t test for this example. c. Quality (Calls): The quality options in this menu are N/A, A (absent), M (marginal), or P (present). However, neither RMA nor GC-RMA produce quality values, so N/A is the appropriate choice when these normalization methods are used. d. Exclude Control Probes: Selecting this check box excludes positive and negative control probes from the analysis. This step can be helpful because it cuts down on the number of tests and minimizes the penalty from the multiple testing correction. For our example, check this box. 7.14.6 Supplement 27 Current Protocols in Bioinformatics e. Show genes that are up-regulated or down-regulated: Use the checkboxes to choose both sets of genes or one set. Check both boxes for this example. f. Threshold: The Lower threshold menu allows you to filter the results by the change in expression levels. For example, picking 1.5 as the lower threshold means that genes will only appear in the list if there is at least a 1.5-fold difference in expression between the two groups of samples. Use a setting of 1.5 as the Lower limit and None as the Upper limit. g. Correction: Every time gene expression is measured, in a microarray or Next Gen experiment, there is a certain probability that the results will be identified as significantly different, even though they are not. These kinds of results can be described as false positives or as type I errors. As we increase the number of the genes tested, we also increase the probability of seeing false positives. For example, if we have a p value of 0.05, we have a 5% chance that the gene expression difference between the two groups resulted from chance. When a large data set such as one generated by a microarray experiment is analyzed, with a list of 10,000 genes (an average-sized microarray), about 500 of those genes could be incorrectly identified as significant. The correction methods in this menu are designed to compensate for this kind of result. Four different options are available in GSAE to adjust the p values for multiple testing and minimize the false-discovery rate. Since these methods are used to correct the p values obtained from statistical tests, these corrections are only be applied if a statistical test, such as a t test, has been applied. GSAE offers the following correction methods: Bonferroni, Holm, Benjamini and Hochberg, and Westfall and Young maxT corrections. The Bonferroni and Westfall and Young corrections calculate a family-wise error rate. This is a very conservative requirement, with a 5% chance that you will have at least one error. The Benjamini and Hochberg correction calculates a False Discovery Rate. With this method, when the error rate equals 0.05%, 5% of the genes considered statistically significant will be false positives. Benjamin and Hochberg is the least stringent of the four choices, allowing for a greater number of false positives, and fewer false negatives. When it comes to choosing a correction method, we choose correction methods depending on our experimental goal. If our goal is discovery, we can tolerate a greater number of false-positive results in order to minimize the number of false negatives. If we choose a method to minimize false positives, we have to realize that some of the real positives may be missed. Genes with real differences in expression may appear to show an insignificant change after multiple testing corrections are applied. One of the most important reasons for using these tests is that they allow the user to rank genes in order of the significance of change, and make choices about which changes to investigate. For this example, choose Benjamini and Hochberg. h. Data Transformation: Our data are already log2 transformed, since RMA and GCRMA both carry out this step during the upload process. Choose Data Already Log Transformed for this example. i. Click the Analyze button. A page with results appears when the processing step is complete. Investigating the biology Figure 7.14.3 shows the results from our pairwise analysis of the microarray data—the differentially expressed genes. Pull-down menus in the middle of the page contain options Analyzing Expression Patterns 7.14.7 Current Protocols in Bioinformatics Supplement 27 7 6 view changes by ontology or the z score view changes in KEGG pathways 5 see the scatter plot 8 2 change to adjusted p 1 change to adjusted p 3 save your results click search to apply changes 539 genes show a significant change 4 this gene is expressed about 7-fold more highly in mice fed with LRD-5001 Figure 7.14.3 select a name to view the gene summary Analyzing the results from a pairwise comparison. for sorting and changing the views. You may increase the number of genes in the list, sort by the ratio, p value, or adjusted p value, choose a p-value cutoff so that genes are only shown if the p values are below a certain number, and change the presentation from the raw p value to the adjusted p value. After choosing selections from the menus, click the Search button to show the results. When this page first appears, our results show a list of 764 genes that are differentially expressed. Arrows on the left side of each gene ratio point up if a gene shows an increase in expression relative to the first group or down if a gene shows decreased expression. The ratio shows the extent of up- or down-regulation. When this page first appears, the list is filtered by the raw p value. 15. Filter based on the corrections for multiple testing by selecting “adjusted p” from the raw p value menu and clicking the Search button. By choosing “adjusted p” from the left pull-down menu to correct for the false discovery rate, calculated by the Benjamini and Hochberg correction, and clicking the Search button to show the p values for the differences between each gene, the number of genes is changed to 539. Analyzing Gene Expression Data Using GeneSifter 16. Next, it can be helpful to sort the data. Initially, the data are shown sorted by ratio so that genes with a larger-fold change appear earlier in the list. It can also be helpful to sort the data by the p value or the adjusted p value to see which genes show the most 7.14.8 Supplement 27 Current Protocols in Bioinformatics significant change. Choose “Adj. p” from the Sort By menu to sort by the adjusted p value. Sorting by the adjusted p value shows that the genes with the most significant changes are cytochrome p450, family 2, subfamily a, polypeptide 4, and glutathione S-transferase, alpha 2 gene. 17. We can learn more about any gene in the list by clicking its name. Clicking the top gene in the list brings us to a page where we can view summarized information for this gene and obtain links to more information in public databases. 18. Click Scatter Plot to view the differences in gene expression another way. A new window will open with the data presented as a scatter plot (Fig. 7.14.4). 2 click the zoom button cytochrome P450, family 2, subfamily a, LRD-5001, water 0 100000 1 drag the box over the genes you wish to view in detail up-regulated genes appear on this side of the diagonal line 10000 3 click a spot to see more details 1000 100 down-regulated genes appear on this side 10 >> Gene Info Cytochrome P450, family 2, subfamily a, polypeptide 4 1 1 10 100 1000 10000 100000 AIN-76A, water 0 Open static scatter plot Group N Mean SEM/Mean Quality AIN-76A, water 0 5 10.0036 0.1127 SEM 1.13% 0.0000 LRD-5001, water 0 5 12.7881 0.1339 1.05% 0.0000 gene summary information appears in this lower corner Each spot in the graph represents the expression measurements for one gene. The expression level for group 2 is plotted on the y axis and the value for group 1 is plotted on the x axis. Intensity 15 10 5 0 AIN-76A, water 0 LRD-5001, water 0 LRD-5001, water 0 up-regulated 6.9 fold compared to AIN-76A, water 0 Figure 7.14.4 Scatter plot. Analyzing Expression Patterns 7.14.9 Current Protocols in Bioinformatics Supplement 27 a. The scatter plot. The scatter plot gives us a visual picture of gene expression in the different samples. The levels of gene expression in group 1 (mice fed with AIN-76A) are plotted on the x axis and group 2 (mice fed with LRD-5001) on the y axis. Genes that are equally expressed in both samples fall on the diagonal line. Genes that are expressed more in one group or in the other appear either above the line (group 2) or below the line (group 1) depending on the group that shows the highest level of expression. If we used a method to correct for the false-discovery rate, then the points for genes showing nonsignificant changes would be colored gray, up-regulated genes showing a significant change would be colored red, and down-regulated genes showing a significant change would be colored blue or green. b. The zoom window and gene summary. To learn more about any gene in the graph, we drag the box on top of a spot and click the “zoom” button. After a short time (up to 30 sec), the highlighted spot and surrounding spots will appear in the top right window. If spots overlap, you may separate them by dragging them with the mouse. The name of each gene will appear when the mouse is moved over a spot, and clicking a spot will produce the gene summary information in the lower right corner. In our experimental example, clicking some of the spots will find genes that were seen earlier in the list, such as genes for members of the cytochrome p450 family and glutathioneS-transferase. 19. Return to the results window and click the KEGG link. a. The KEGG report. The KEGG report, as shown in Figure 7.14.5, presents a list of biochemical and regulatory pathways that contain members from the list of differentially expressed genes on the results page. Each row shows the name of the pathway, a link to a list of gene-list members that belong to that pathway, with arrowheads to show if a member is up- or down-regulated, a link to the KEGG pathway database, the number of genes from the list that belong to that pathway, the number of genes that are up-regulated, the number down-regulated, the total number from that pathway that were present in the array (or reference data set for Next Gen data), and the z scores for up- and down-regulated genes. b. z scores. z scores are used to evaluate whether genes from a specific pathway are enriched in your list of differentially expressed genes. If genes from a specific pathway are represented in your gene list more often than they would be expected to be seen by chance, the z-scores reflect that occurrence. A z score greater than 2 indicates that a pathway is significantly enriched in the list of differentially expressed genes, while a z-score below −2 indicates that a pathway is significantly under-represented in the list. The direction and color of the arrowheads show whether those genes are up- or down-regulated in the second group relative to the first group of samples. Clicking the arrows above a z score column will allow you to sort by z scores for up-regulated or down-regulated genes. Click the arrowhead that is pointed up in the z score column to sort by up-regulated genes. We can see at least 20 pathways are up-regulated when mice are fed LRD-5001. c. Genes. Pick one of the top listed pathways and click the corresponding icon in the Genes column. A new section will appear underneath the name of the pathway. Before proceeding, look at the values in the List, totals, and Array column. We can see in our analysis that the cytochrome P450 pathway for metabolizing xenobiotics is significantly up-regulated and contains 19 members from our 539-member gene list. We also see that those members are all up-regulated and that there are 53 genes on the array that belong to this pathway. Analyzing Gene Expression Data Using GeneSifter Now, look at the list of genes in the newly opened section. Where we had 19 genes shown as the value in the list column, there are 26 genes listed below the name of the pathway. 7.14.10 Supplement 27 Current Protocols in Bioinformatics click the KEGG icon to see a diagram of the pathway click the gene icon to see a list of the genes that belong to this pathway genes from the list are in green boxes genes that are up-regulated are in boxes with red numbers and a red border 1 2 3 4 5 6 1 The number of genes from the list on the analysis page, that belong to this pathway. 2 The number of genes in this pathway that are up-regulated in group 2 relative to group 1. 3 The number of genes in this pathway that are down-regulated in group 2 relative to group 1. 4 The number of genes on the array that belong to this pathway. 5 The z score for the number of genes that belong to this pathway and are up-regulated. Clicking the red arrow, will sort the list by z scores. 6 The z score for the number of genes that belong to this pathway and are down-regulated. Clicking the green arrow, will sort the list by z scores. Figure 7.14.5 KEGG pathway results. Most of the genes have different names, but some appear to be identical. For example, there are three listings for glutathione-S-transferase, mu 1. Are they really the same gene? Clicking the gene names shows us that two entries have the same accession number. One possible explanation for their duplication in the list could be that they are represented multiple times on the array. It could also be that the probes were originally thought to belong to different genes and now, with a better map, are placed in the same gene. We also see that one of the three genes has a different accession number. This entry might represent a different isoform that is transcribed from the same gene. Many arrays do not distinguish between alternative transcripts and count them all together. Affymetrix arrays can also have multiple probe sets for a single gene; in these cases, the gene will appear multiple times since intensity measurements will be obtained from each probe. It should also be noted that some genes may belong to multiple KEGG pathways (see below). d. KEGG pathways. Click the KEGG icon to access the KEGG database and view more details for a KEGG pathway. Once we have identified KEGG pathways with significant changes, we can investigate further by selecting the links to the individual genes in that pathway or we Analyzing Expression Patterns 7.14.11 Current Protocols in Bioinformatics Supplement 27 can select the KEGG icon to view the encoded enzymes in the context of a biochemical pathway. Clicking the boxes in the KEGG database takes us to additional information about each enzyme. In our experiment, we find that 19 of the 53 genes in the array are up-regulated and belong to the cytochrome P450 pathway for metabolizing xenobiotics. The KEGG pathway shows some of the possible substrates for these enzymes. It would be interesting to look more closely at LRD-5001 and see if it contains naphthalene or benzopyrene, or one of the other compounds shown in the KEGG pathway. Other pathways that are up-regulated, when mice are fed LRD-5001 instead of AIN-76A, are pathways for biosynthesis of steroids, fatty acid metabolism, arachidonic acid metabolism, etc. Down-regulated pathways include those for pyruvate metabolism and glycolysis. 20. Return to the results window and click Ontology (options described below). a. Ontology reports. An overview of the ontology reports and their features is shown in Figure 7.14.6. Three kinds of ontology reports are available from Ontology: a set organized by biological process, another by cellular component, and a third by molecular function. Each report shows a list of ontologies that contain up or down-regulated genes from the list of 539 genes. i. Ontology. Selecting the name of an ontology, allows you to drill down and view subontologies. ii. Genes. Clicking the icon in the genes column shows the genes from the gene list that belong to that ontology. iii. GO. Clicking the GO icon opens the record for the ontology in the AmiGO database. iv. List. The list column shows the total number of genes, from the gene list, both up- and down-regulated, that have that ontology as part of their annotation. v. Totals (up or down). One column contains the values for number of up-regulated genes in the list that belong to an ontology. The other column shows the number of down-regulated genes that belong to that ontology. vi. Array. This value shows the number of probes on a microarray chip that could correspond to genes in an individual ontology. vii. z-score. As with the KEGG report, the z-score provides a way to determine whether a specific ontology is over- or under-represented in the list of differentially expressed genes. Significant z scores are above 2 or below negative 2. We cannot sort by z scores on the ontology report pages, but we can sort by z scores from the z score report. viii. Pie graph. The pie graph depicts the ontologies in the list and the numbers of members. b. z-score reports. Each ontology report page contains a link to a z-score report. Where the ontology reports show ontologies through a hierarchical organization, the z-score report shows all the ontologies with significant z-scores, without the need to drill down into the hierarchy. This is helpful both because significant z scores can be hidden inside of a hierarchy, and because this report allows you to sort by z scores. It should also be noted that some genes may belong to multiple ontologies. Analyzing Gene Expression Data Using GeneSifter When we look at the ontology information for our experiment, we can see that the most significant ontologies in biological processes are metabolism, cellular processes, and regulation; for cellular components, we see that cells and cell parts are significant, and for the molecular function ontology, catalytic activity and electron carrier activity are significant. When we look at the z score report for molecular function and sort our results by up-regulated genes, we see that many genes show oxidoreductase and glutathione-Stransferase activity, which is consistent with our findings from the KEGG report. Selecting the Genes icon shows us that those genes are cytochrome P450s. Taking all of our data together, we can conclude that genes for breaking down substances like xenobiotics are expressed more highly when mice are fed LRD-5001 than when they are fed AIN-76A. 7.14.12 Supplement 27 Current Protocols in Bioinformatics Figure 7.14.6 Gene ontology reports. COMPARE GENE EXPRESSION FROM PAIRED SAMPLES OBTAINED FROM TRANSCRIPTOME PROFILING ASSAYS BY NEXT-GENERATION DNA SEQUENCING Several experiments have been published recently where NGS or “Next Gen” technologies were used for transcriptome profiling. NGS experiments have three phases for data analysis. First, there is a data-collection phase where the instrument captures information, performs base-calling, and creates the short DNA sequences that we refer to as “reads.” Next, there is an alignment phase, where reads are aligned to a reference data set and ALTERNATE PROTOCOL 1 Analyzing Expression Patterns 7.14.13 Current Protocols in Bioinformatics Supplement 27 counted. Last, there is a comparison phase where the numbers of read counts can be used to gain insights into gene expression. Many of the steps in the last phase are similar to those used in the analysis of microarray data. In this protocol, we will describe analyzing data from two NGS data sets and their replicates. These data were obtained from an experiment to assess the transcriptome from single cells (mouse oocytes) with different genotypes (Tang et al., 2009). In one case, wild-type mouse oocytes were used. In the other case, the mouse oocytes had a knock-out mutation for DICER, a gene required for processing microRNAs. We will discuss uploading data and aligning the data, view the types of information obtained from the alignment, and compare the two samples to each other, mentioning where the NGS data analysis process differs from a pairwise comparison of samples from microarrays. Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a license for the GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) GSAE is accessed over the Web; therefore, Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari Files Data files may be uploaded from a variety of sequencing instruments. For the Illumina GA analyzer, the data are text files, containing FASTA-formatted sequences. Data from the ABI SOLiD instrument are uploaded as csfasta files. The example NGS data used in this procedure were generated by the ABI SOLiD instrument and obtained as csfasta files from the GEO database at the NCBI (Accession number GSE14605). The csfasta files are obtained as follows. The accession number GSE14605 is entered in the data set search box at the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/) and the Go button is clicked. The csfasta files are downloaded for both wild-type mouse oocytes and DICER knockout mouse oocytes by clicking the links to the file names and clicking (ftp) for the gzipped csfasta files: GSM365013.filtered.csfasta.txt.gz, GSM365014.filtered.csfasta.txt.gz, GSM365015.filtered.csfasta.txt.gz, GSM365016.filtered.csfasta.txt.gz. 1. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). Uploading data 2. Locate the Import Data heading in the Control Panel and click Upload Tools. The uploading and processing steps described for Next Gen data sets require a license from Geospiza. However, you may access data that have already been uploaded and processed from a demonstration site. The demonstration site can be accessed from the data center at http://www.geospiza.com. 3. Click the Next Gen File Upload button to begin uploading Next Gen data. 4. Enter a name for a folder. Analyzing Gene Expression Data Using GeneSifter Folders are used to organize Next Gen data sets. 7.14.14 Supplement 27 Current Protocols in Bioinformatics 5. Click the Next button. 6. Two windows will appear for managing the upload process. Use the controls in the left window to locate your data files. Once you have found your data files, select them with your mouse and click the blue arrowhead to move those files into the Transfer Queue. 7. Once the files you wish to transfer are in the Transfer Queue, highlight those files and click the blue arrow beneath the Transfer Queue window to begin transferring data. Transferring data will take a variable amount of time depending on your network, the volume of network traffic, and the amount of data you are transferring. A 2-GB Next Gen data set will take at least 40 min to upload. Aligning Next Gen data to reference data Once the data have been uploaded to GSAE, the reads in each data set are aligned to a reference data source. During this process, the number of reads mapping to each transcript are counted and normalized to the number of reads per million reads (RPM) so that data may be compared between experiments. 8. Access uploaded Next Gen data sets by clicking Next Gen in the Inventories section of the control panel. 9. Use the checkboxes to select data sets for analysis, then click the Analyze button on the bottom right side of the table. A new page will appear. 10. Choose the Analysis Type, Reference species, and a Reference Type from the corresponding pull-down menus. a. Analysis Type: The Analysis Type is determined by the kind of data that were uploaded and the kind of experiment that was performed. For example, if you uploaded SOLiD data, analysis options specific to that data type would appear as choices in the menu. For SOLiD data, the alignment algorithm is specific for data in a csfasta format. Choose RNA-Seq (SOLiD, 3 passes). b. Reference Species: The Reference Species is determined by the source of your data. If your data came from human tissues, for example, you would select “Homo sapiens” as the reference species. Since our data came from mouse, choose “Mus musculus.” c. Reference Type: The choices for Reference Type are made available in the Reference Type menu after you have selected the analysis type and reference species. The Reference Type refers to the kind of reference data that will be used in the alignment. Since we are measuring gene expression, choose “mRNA” as the reference type. This reference data set contains the RNA sequences from the mouse RefSeq database at the NCBI. 11. Click the checkbox for “Create Experiment(s) upon completion.” This selection organizes your data as an experiment, allowing you to compare expression between samples after the analysis step is complete. In order to set up experiments, GeneSifter must already contain an appropriate Gene Set. A Gene Set is derived from the annotations that accompany the reference data source. 12. Click the Analyze button to queue the Next Gen data set for analysis. 13. The analysis step may take a few hours depending on the size of your data file and the number of samples that need to be processed. Analyzing Expression Patterns 7.14.15 Current Protocols in Bioinformatics Supplement 27 Read Mapping Statistics–RNA-Seq Analysis Gene List Total number of reads: Number of unmapped reads: 12537930 1993626 10544304 Number of mapped reads: Non-uniquely mapped reads: Uniquely mapped reads: 1759919 8784385 6163136 Uniquely mapped reads with 0 mismatches: Uniquely mapped reads with 1 mismatches: Uniquely mapped reads with 2 mismatches: Uniquely mapped to ribosomal, >> Analysis Job Details (15.90% of all reads) (84.10% of all reads) (14.04% of all reads) 2 mismatches: (70.06% of all reads) (70.16% of uniquely mapped reads) (19.88% of uniquely mapped reads) 1746203 875046 0 (9.96% of uniquely mapped reads) (0.00% of uniquely mapped reads) Uniquely Mapped Reads by Chromosome Job Info: Job ID: 51 Analysis Type: RNA-Seq (SOLID, 3 passes) Input File: GSM365013.filtered.csfasta.txt State: Complete Initiated: 2009-05-15 13:33:10 Completed: 2009-05-15 14:26:43 Comment: – Remote ID: 2032 Experiment: GSM365013 -1- 18 1617 15 14 19 X 1 2 3 13 4 12 5 11 6 10 9 8 7 RNA-Seq (SOLID, 3 passes) Info: Reference Species: Mus musculus Reference Type: mRNA Analysis Results: Read alignment statistics (HTML) Gene list (Text) Gene list (HTML) Job log file Standard error Standard output Figure 7.14.7 Gene List for genes.txt Summary Statistics Reads RPKM RPM Entrez Overview Download Gene List RefSeq ID Title Gene ID Chrom. Type 137275 12756.87 15627.16 12049 NM_013479 Bcl2-like 10 Bcl2l10 9 mRNA 115536 11806.49 13152.43 171506 NM_138311 H1 histone family, member O, oocyte-specific H1foo 6 mRNA 103361 8452.91 11766.45 21432 NM_009337 87655 2412.60 9978.50 20729 NM_011462, NM_146043 spidlin 1 80718 4851.53 9188.80 72114 NM_028106 T-cell lymphoma breakpoint Zinc finger, BED domain containing 3 Tcl 1 12 mRNA Spin1 13 mRNA Zbed3 13 mRNA Analysis results from NGS data, obtained from an ABI SOLiD instrument. Viewing the Next Gen alignment results 14. When the alignment step is complete, you will be able to view different types of information about your samples. Click the file name to get to the analysis details page for your file, then click the Job ID to get the information from the analysis. 15. The exact kinds of information will depend on the data type and the algorithms that were used to align the reads to the reference data source (Fig. 7.14.7). The types of information seen from Illumina data will be described in the next protocol. For SOLiD data, you will see information that includes: Analyzing Gene Expression Data Using GeneSifter a. Read alignment statistics: These include the total number of reads and the numbers that were mapped, unmapped, or mapped to multiple positions. Sets of reads can also be downloaded from the links on this page. b. Gene list (text): A gene list can be downloaded as a text file after the alignment is complete. 7.14.16 Supplement 27 Current Protocols in Bioinformatics c. Gene list (html): The gene list (html) shows a table with information for all the transcripts identified in this experiment. i. Reads: A read is a DNA sequence obtained, together with several other reads, from a single sample. Typical reads from Next Gen instruments such as the ABI SOLiD and the Illumina GA are between 25 and 50 bases long. The number of reads in the first column equals the number of reads from a single sample that were aligned to the reference data set, in this case, RefSeq RNAs. ii. RPKM: Reads per thousand bases, per million reads. This column shows the number of reads for a given transcript divided by the length of the transcript and normalized to 1 million reads. Dividing the number of reads by the transcript length corrects for the greater number of reads that would be expected to align to a longer molecule of RNA. iii. RPM: Reads per million reads. iv. Entrez: This column contains links to the corresponding entries in the Entrez Gene database. v. Image maps: Image maps are used to show where reads align to each transcript. The transcripts in these images are all different lengths. vi. RefSeq ID: The RefSeq accession number for a given transcript. vii. Title: The name of the gene from RefSeq. viii. Gene ID: The symbol for that gene. ix. Chrom: The chromosomal location for a gene. x. Type: The type of RNA molecule. Comparing paired samples and finding differentially expressed genes In the next step, the numbers of reads mapping to each transcript are compared in order to quantify differential gene expression between the samples. This process is similar to the process that we used in Basic Protocol 1; we will set up our analysis, apply statistics to correct for multiple testing, then view the results from the scatter plot, KEGG pathways, and ontology results to explore the biology. 16. Locate the Analysis section in the GSAE Control Panel and select Pairwise. A list of potential array/gene sets will appear. The gene sets correspond to the results from analyzing Next Gen data. Clicking the name of a gene set will allow you to view the samples that belong to that set. 17. To set up the analysis, either click the spyglass on the left of a gene set, or click the name of the gene set and choose “Analyze experiments from this array” from the middle of the window. A page will appear where you can assign samples to a group and pick the analysis settings. 18. Use the checkboxes to assign one sample (or set of samples) to group 1 (these are often the control samples) and the other sample (or set of samples) to group 2. Assign the two sets from wild-type mouse oocytes to group 1 and the two sets from the DICER knock-out oocytes to group 2. 19. Use the pull-down menus to select the advanced analysis settings. a. Normalization. This step involves normalizing data for differences in signal intensity within and between arrays. This type of normalization process does not apply to Next Gen data since Next Gen measurements are derived from the number of reads that map to a transcript instead of the intensity of light. Next Gen sequence data are normalized by GSAE but this happens during the alignment phase. During the alignment process, the number of reads from each experiment is Analyzing Expression Patterns 7.14.17 Current Protocols in Bioinformatics Supplement 27 normalized to the number of mapped reads per million reads (RPM). This allows data from different experiments to be compared. For this example, choose “None” from the menu. b. Statistics. The statistical tests available from this menu are used to determine if the differences between the mean numbers of read counts (or intensity measurements, in the case of microarrays), from a set of replicate samples, are significant. The significance levels are reported as p values, i.e., the probability of seeing a result by chance. For this example, choose “t test” for the statistics. c. Quality. For Next Gen data, the quality values correspond to the number of reads per million transcripts and range from 0.5 to 100. For this example, set the quality at “1”, meaning that we will only look at transcripts where there the RPM value is at least 1 in one of the samples being compared. d. Show genes that are up-regulated and/or down-regulated. Selecting the checkboxes allows you to choose whether to limit the view to up-regulated or down-regulated genes, or to show both types. For this example, check both boxes. e. Threshold, Lower. The threshold corresponds to the fold-change. For this example, choose 1.5 as the lower threshold. f. Threshold, Upper. This option is usually set to “none,” however, if you wish to filter out highly expressed genes, you might wish to set an upper threshold. For this example, leave the upper threshold at “none.” g. Correction. For this example, choose the Benjamini and Hochberg correction. h. Data Transformation. Use these buttons to choose whether the data will be log transformed or not. Log transformations are often used with microarray data to make the data more normally distributed. For this example, apply a log transformation to the data. 20. Click the Analyze button. When the analysis is complete, the results page will appear. Viewing the results The results page shows the two groups of samples that were compared and the conditions that were used for the comparison. All the genes that varied in expression by at least 1.5 fold are listed in a table on this page. 21. Choose “adjusted p” from the last menu and click the Search button. Adjusted p values are the p values obtained after the multi-test correction (in our case, Benjamini and Hochberg) has been applied. Analyzing Gene Expression Data Using GeneSifter In this analysis, choosing the adjusted p value decreases the number of differentially expressed genes from 1449 to 28. As noted earlier, although the multiple testing correction provides a way to sort genes by the significance, genes that truly change may be missed when these corrections are applied. To view additional genes that may be candidates for study, you can raise the cut-off limit for the adjusted p values, using the pull-down menu, or skip the multiple test correction altogether. 7.14.18 Supplement 27 Current Protocols in Bioinformatics Interpreting the results After adjusting the p value, only 28 genes in our set show significant changes. It is helpful at this point to save our results before proceeding on to further analyses. Since the reports that we would use next (the scatter plot, KEGG pathway information, and ontology reports) are the same as in Basic Protocol 1, we will leave it to the reader to refer to the earlier protocol for instruction. The one point we would like to discuss here is interpreting the gene summary and the differences between the gene summaries for microarray and Next Gen data. Each gene in the list is accompanied by a summary that can be accessed by clicking the gene name. The summary page presents information about expression levels at the top and links to external databases in the bottom half. Summaries from both microarray data and Next Gen data (Fig. 7.14.8) show the number of samples (N), along with the values for each sample and the standard error of the mean. Where the two kinds of summaries differ is in intensity and quality values. For microarray data, the columns labeled “intensity values” do show the intensity data. If the data were log transformed during the upload process or the analysis, then the log-transformed values are reported. For Next Gen data, however, the values in the “intensity values” column are not intensity values. When Next Gen data are used, these values refer to the normalized number of Figure 7.14.8 Gene summaries for microarray and NGS data. A gene summary from a microarray sample is shown in the top half of the image and a summary for a sample analyzed by NGS is shown in the bottom half. Note the difference between the intensity and quality values. Analyzing Expression Patterns 7.14.19 Current Protocols in Bioinformatics Supplement 27 reads that were mapped to a gene (RPM). If the data were log transformed during the analysis, then these values are the log-transformed values. The other difference between these data for the two systems is in the quality column. For Next Gen data, the quality column shows the RPM value for that gene. In the quality column for the Next Gen data, two of the samples show quality values of zero. This means that zero transcripts were detected. The other two samples show values around 6, indicating that approximately 6 transcripts were detected, per million reads, for the Drebrin-like gene. BASIC PROTOCOL 2 COMPARING GENE EXPRESSION FROM MICROARRAY EXPERIMENTS WITH MULTIPLE CONDITIONS GSAE has two modes for analyzing data, depending on the number of factors that are tested. If two factors are compared, such as treated and untreated, or wild-type and mutant samples, then a pairwise analysis, as described in Basic Protocol 1, is used to compare the results. If an experiment involves multiple conditions, such as a time course, different drug dosing regimes, and perhaps even different genotypes, then the analysis is considered a project. GSAE projects have additional capabilities for analyzing these projects as well as different statistical procedures for identifying significant changes in expression. Some of the tests that can be performed with GSAE are a one-way ANOVA, a two-way balanced ANOVA, and a non-parametric Kruskal-Wallis test. Corrections for multiple testing such as those from Bonferroni, Holm, and Benjamini and Hochberg can also be applied. Additional analyses are clustering, or using the Pearson coefficient to look for patterns of expression. Specific searches for genes by name, characteristic, or function can also be performed. In Basic Protocol 2, we describe a general procedure (shown in Fig. 7.14.9) for analyzing microarray data from specimens that were obtained from different treatments. An alternative procedure (Alternate Protocol 2) follows in which we will demonstrate a multiple-condition analysis with Next Gen data from the Illumina GA analyzer. The samples used in Basic Protocol 2 were obtained from the GEO database. These samples came from the same study described in Basic Protocol 1. RNA was isolated from mouse livers where two factors were examined: diet and arsenic in the drinking water. Over a 5-week period, the mice were fed two kinds of food, AIN-76A, a purified diet, or LRD-5001, a standard laboratory mouse food, and given arsenic in their water at three different concentrations (0, 10 ppb, or 100 ppb). There were four to five biological replicates (mice) for each treatment. We will demonstrate setting up the analysis, applying statistics and multiple testing corrections, and using some of the clustering tools. Some of the clustering methods, PAM and CLARA, will be discussed in Alternate Protocol 2 rather than Basic Protocol 2. Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a license for the GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) GSAE is accessed over the Web; therefore, Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari. Files Analyzing Gene Expression Data Using GeneSifter 7.14.20 Supplement 27 Data files from a variety of microarray platforms may be uploaded and analyzed in GSAE, including Affymetrix, Illumina, Codelink, or Agilent arrays, and custom chips Current Protocols in Bioinformatics multiple sample comparison mouse 1 mouse 2 mouse 3 mouse 4 mouse 5 treatment, 10 ppb As no treatment condition 1 mouse 7 mouse 8 condition 2 mouse 9 mouse 10 mouse 11 mouse 12 treatment, 1000 ppb As treatment, 100 ppb As condition 3 upload data, normalize identify differential expression - fold change - quality - ANOVA - multiple testing correction - Bonferroni - Benjamini and Hochberg - others Figure 7.14.9 mouse 6 condition 4 isolate RNA microarrays visualize results hierarchical clustering PCA explore biology ontology KEGG scatter plot partitioning PAM silhouettes Overview of an experiment comparing multiple conditions. The example data used in this procedure were CEL files from an Affymetrix 430 2.0 Mouse array and were obtained from the GEO database at the NCBI (Accession code GSE 9630). CEL files are the best file type for use in GSAE. To obtain CEL files, go to the GEO database at the NCBI (www.ncbi.nih.gov/geo/). Enter the accession number (in this case GSE 9630) in the section labeled “Query” and click the Go button. In this example, all the files in the data set are downloaded as a single tar file by selecting (ftp) from the Download column at the bottom of the page. After downloading to a local computer, the files are extracted, unzipped, and uploaded to GSAE as described in the instructions. Files used for the AIN-76 group with 0 ppb arsenic: GSM243398, GSM243405, GSM243391, GSM243358, and GSM243376; for the AIN-76 group with 10 ppb arsenic: GSM243359, GSM243400, GSM243403, GSM243406, GSM243410; for the AIN-76 group with 100 ppb arsenic: GSM243353, GSM243365, GSM243369, GSM243377; for the LRD-5001 group with 0 ppb arsenic: GSM243394, GSM243397, GSM243378, GSM243382, and GSM243355; for the LRD-5001 group with 10 ppb arsenic: GSM243374, GSM243380, GSM243381, GSM243385, GSM243387; and for the LRD-5001 group with 100 ppb arsenic: GSM243354, GSM243356, GSM243383, GSM243390, GSM243392. A demonstration site with the same files and analysis procedures can be viewed from the data center at http://www.geospiza.com. Analyzing Expression Patterns 7.14.21 Current Protocols in Bioinformatics Supplement 27 Uploading data 1. Create a zip archive from your microarray data files. a. If using a computer with a Microsoft Windows operating system, a commonly used program is WinZip. b. If using Mac OS X, select your data files, click the right mouse button, and choose “Compress # Items” to create a zip archive. The resulting archive file will be called Archive.zip. 2. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). 3. Locate the Import Data heading in the Control Panel on the left-hand side of the screen and click Upload Tools. Several types of microarray data can be uploaded and analyzed in GSAE. See Basic Protocol 1 for detailed descriptions. Our data were uploaded using the option for “Advanced upload methods” and normalized with GC-RMA. 4. On the page that appears, click the Run Advanced Upload Methods button. 5. Select the normalization method and the array type from pull-down menus. For this example, use GC-RMA and the 430 2.0 Mouse array. Click the Next button. 6. In the screen which now appears, browse to locate the data file on your computer. 7. Choose the option (radio button) for Create Groups, Create New Targets, or Same as File Name. Our data came from mice that were given two different kinds of food and drinking water with three concentrations of arsenic, so six groups were created. Therefore, set 6 as the value next to Create Groups. 8. Click the Next button. The screen for the next step will appear after the data are uploaded. 9. On the screen displayed in “step 3 of 4,”, you will be asked to enter a title for your data set, assign a condition to each group, add labels to your samples if desired, and identify which sample(s) belong to which group. In our case, we have six conditions (see Table 7.14.1), with four to five biological replicates (targets) for each condition. We kept the original file names as the target or sample names. Setting up a project for analysis 10. We begin the analysis process by creating a project. Select New Project from the Create New section, add a title for the project, click the checkbox next to the array that contains the samples, and click the Continue button. Table 7.14.1 Conditions Used for the Example in Basic Protocol 2 Analyzing Gene Expression Data Using GeneSifter Condition Mouse food Arsenic in water (ppb) 1 AIN-76A 0 2 AIN-76A 10 3 AIN-76A 100 4 LRD-5001 0 5 LRD-5001 10 6 LRD-5001 100 7.14.22 Supplement 27 Current Protocols in Bioinformatics 11. Enter the name for the control group as the group name and any descriptive information in the Description field. 12. Choose a Normalization option. Leave the setting at None because our data were log transformed and normalized when we used GC-RMA during the upload process. 13. Choose a Data Transformation option. Leave the setting at “Data already log transformed” because our data were log transformed and normalized when we used GC-RMA during the upload process. 14. Select a group for a control sample and use the arrow button to move that group to the box on the right side. Choose AIN-76A, with 0 as the control sample. 15. Select the other groups that will be part of the analysis and move them to the right side by clicking the arrow button. 16. Click the Create Group button. 17. Next, select the samples for each condition. Select all the experiments and click the Create Group button. A new page will appear with a list of all the conditions and all the samples for each condition. 18. Choose the samples that will be used in the analysis. You may choose the samples one by one, or if all the samples will be used, click Select All Experiments. Click Select All Experiments. 19. Click the Create Group button. A small window will appear while data are processing. When the processing step is complete, a new page will appear stating that your project has been created. From this point, you can continue the analysis by selecting Analyze This Project or you can analyze the project at a later time. Identifying differential gene expression 20. Select Projects from the Analysis section of the Control Panel. 21. Choose the project name to review the box plots for the samples and replicates in the project. When we analyze multiple samples, GSAE creates box plots that allow us to evaluate the variation between experimental groups and the replicate samples within each group. The box plot, also known as a “box and whiskers plot,” shows the averaged data either from a group of replicate samples or from the intensity values for a single sample. The line within the box represents the median value for the data set. The ends of the whiskers show the highest and lowest values. If a box and whiskers graph is made from data with a normal distribution, the graph would look like the box plot in Figure 7.14.10. Box plots are helpful for quality control. If we find a box plot with a different median value from the other samples, it could indicate a problem with that sample or array. a. Locate the Project Info section in the Project Details page and click Boxplot. A box plot will appear, as shown in Figure 7.14.11A, with plots representing all six of the different conditions. Notice that all six of the plots have similar shapes and similar values. b. Return to the Project Details page. Locate the bottom section, entitled Group Info. Each of the conditions in this section has between four and five replicates and a box plot (Fig. 7.14.11B). The box plot link for this section opens a window for Analyzing Expression Patterns 7.14.23 Current Protocols in Bioinformatics Supplement 27 a box-and-whiskers plot showing a normal distribution of data highest value 3rd quartile range median 1st quartile lowest value Figure 7.14.10 Box plot. a box for each replicate. Click the box plot link for some of the replicates to see if the replicates are similar or if any of the replicates appear to be different from other members of the group. 22. Click Analyze This Project to begin the analysis. The Pattern Navigation page appears. From the Pattern Navigation page, you may view all the genes, or limit the genes to those that meet certain criteria for fold change, statistics, or a certain pattern of expression. Additional options from the Gene Navigation link allow genes to be located by name, chromosome, or accession number, and options from the Gene Function link allow them to be located by ontology or KEGG pathway. Statistics can also be applied to limit the results. 23. Locate the Search by Threshold section and set the threshold choices. Choose 1.5 for the Threshold, ANOVA for the statistics, and Benjamini and Hochberg to correct for the false discovery rate. Click the Exclude Control Probes checkbox, then click the Search button. Clicking Show All Genes gives 45,101 results. Returning to this page and making the choices listed here cuts the number of genes to 921. The results page appears after the threshold filtering is complete. At this point, you can either save these results and return or continue the analysis. There are a variety of paths we can follow from this point, as shown in Figure 7.14.12. We can view the ontology or KEGG reports, as discussed in Basic Protocol 1. We can also use clustering to group related genes, or we can change the p cutoff value to limit the number of genes even further. Using clustering to identify patterns of differential gene expression 24. Choose PCA from the Cluster options. The PCA option performs a type of clustering known as Principal Component Analysis (PCA). PCA allows you to evaluate the similarities between samples by identifying the directions where variation is maximal. The idea behind PCA is that much of the variation in a data set can be explained by a small number of variables. Analyzing Gene Expression Data Using GeneSifter In Figure 7.14.12, we can see that principal component analysis breaks our conditions up into three groups. One group contains all of the LRD-5001 samples, one group contains the AIN-76A samples, and another group contains the AIN-76A sample that was treated with 100 ppb arsenic. These results tell us that the greatest difference between the groups resulted from the food. 7.14.24 Supplement 27 Current Protocols in Bioinformatics A 18 16 14 12 10 8 6 4 2 b 0 10 10 er er w at w at 00 D -5 LR LR D -5 00 1, 1, 00 LR D -5 pp b pp 0 w at 1, er 6A ,w at AI N -7 10 er 6A ,w at er 10 b pp 0 er AI N -7 A, w at AI N -7 6 B 0 0 18 16 14 12 10 8 6 4 2 0 AIN-76A -1- AIN-76A -5- AIN-76A -7- AIN-76A -8- AIN-76A -11- Figure 7.14.11 Box plots from a multiple-condition experiment. (A) Box plots from the six conditions that were compared in Basic Protocol 2. Each plot represents the averaged data from the four to five replicates from each treatment. (B) Box plots from biological replicates. Replicates from the AIN-76, 0 lead samples are shown. 25. Return to the results page and choose Samples from the Cluster options. These results also show us that the groups of samples are divided by the kinds of food they received. The mice that ate the LRD-5001 show patterns of gene expression more similar to each other than to the patterns from the mice that ate AIN-76A. We also see that the AIN-76A samples that had 100 ppb of arsenic were more different from the AIN-76A samples without arsenic than the LRD-5001 samples were from each other. Analyzing Expression Patterns 7.14.25 Current Protocols in Bioinformatics Supplement 27 clustered by genes clustered by sample PCA summary ontology KEGG Figure 7.14.12 Analyzing the results from comparing multiple samples. 26. Return to the results page and select Genes from the Cluster options. Clustering by genes produces an image consistent with our earlier results. On the right half, where the mice were fed LRD-5001, the three conditions show similar patterns of expression. The samples in the left half are also similar to each other, although it appears that some genes have changed in the sample with 100 ppb arsenic. If we look more closely at the genes that appear to be up-regulated in the LRD-5001 mice, we can see that many of the genes belong to the cytochrome P450 family. Examining differential gene expression in a specific gene family The user may decide to look further at the cytochrome P450s that were induced by LRD-5001 to see if patterns of expression can be discerned. 27. Click Pattern Navigation, located on the right top corner of the page. Analyzing Gene Expression Data Using GeneSifter 28. Locate the Project Analysis section in the bottom half of the page and click Gene Navigation. 7.14.26 Supplement 27 Current Protocols in Bioinformatics enter part of a gene name cluster by gene Figure 7.14.13 a. b. c. d. Gene-specific navigation. Enter the gene symbol in the Name textbox as shown in Figure 7.14.13. Choose Match Any Word from the Option pull-down menu. Choose ANOVA from the Statistics menu. Click the Search button. A page will appear when the filtering is complete. It will indicate that 20 genes matched this query. At this point, clustering with the Gene option lets us see which of the cytochrome P450 genes are up-regulated in the presence of LRD-5001. To understand this phenomenon further, we could use the ontology reports and KEGG pathways to learn about the specific roles that these cytochrome P450s play in metabolism and why they might be up-regulated when mice are fed LRD-5001. We could also use a 2-way ANOVA. COMPARE GENE EXPRESSION FROM NEXT-GENERATION DNA SEQUENCING DATA OBTAINED FROM MULTIPLE CONDITIONS ALTERNATE PROTOCOL 2 This protocol discusses a general method for analyzing samples from Next Generation DNA sequencing experiments that represent different conditions. In this example, we will compare replicate samples (n = 3) from three different tissues: brain, liver, and muscle. We will also discuss using partitioning to cluster data by the pattern of gene expression. Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a license for the GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) Analyzing Expression Patterns 7.14.27 Current Protocols in Bioinformatics Supplement 27 GSAE is accessed over the Web, therefore, Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari Files Data files may be uploaded from a variety of sequencing instruments. Illumina GA analyzer data are text files, containing FASTA-formatted sequences. Data from the ABI SOLiD instrument are uploaded as csfasta files. The example data used in this procedure were generated by the Illumina GA Analyzer and obtained from the SRA database at the NCBI (Accession code SRA001030). The data files are obtained as follows. The accession number SRA001030 is entered in the data set search box at the NCBI Short Read Archive (http://www.ncbi.nih.gov/sra), and the Go button is clicked. The files are downloaded for each tissue type by clicking “Download data” for this experiment link. After downloading the data files, the text files containing the fasta sequences are uploaded to GSAE and processed as described in the instructions. 1. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). Uploading data 2. Locate the Import Data heading in the Control Panel and click Upload Tools. 3. Click the Next Gen File Upload button to begin uploading Next Gen data. 4. Enter a name for a folder. Folders are used to organize Next Gen data sets. 5. Click the Next button. 6. Two windows will appear for managing the upload process. Use the controls in the left window to locate your data files. Once you have found your data files, select them with your mouse and click the blue arrowhead to move those files into the Transfer Queue. 7. Once the files you wish to transfer are in the Transfer Queue, highlight those files and click the blue arrow beneath the Transfer Queue window to begin transferring data. Transferring data will take a variable amount of time depending on your network, the volume of network traffic, and the amount of data you are transferring. Illumina GA data sets are approximately 250 MB and take at least 10 min to transfer. Aligning Next Gen data to reference data Once the data have been uploaded to GSAE, the expression levels for each gene are measured by aligning the read sequences from the data set to a reference data source and counting the number of reads that map to each transcript. 8. Access uploaded Next Gen data sets by clicking Next Gen in the Inventories section of the control panel. 9. Use the checkboxes to select the data sets then click the Analyze button on the bottom right side of the table. Analyzing Gene Expression Data Using GeneSifter 10. A new page will appear where you can choose analysis settings from pull-down menus. These settings include the File Type, Analysis Type, Reference Species, and Reference Type. Choose the appropriate Analysis Type, Reference Species, and Reference Type. 7.14.28 Supplement 27 Current Protocols in Bioinformatics a. File Type: The file type is determined by the instrument that was used to collect the data. Since our read data were generated by an Illumina Genome Analyzer, choose Genome Analyzer. b. Analysis Type: The Analysis Type is determined by the kind of data that were uploaded and the kind of experiment that was performed. This setting also allows you to choose which algorithm to use for the alignment. Choose RNA-Seq (BWA, 2 MM). This setting uses the Burroughs Wheeler algorithm (Li and Durbin, 2009) to align the reads with a tolerance setting of 2 mismatches. c. Reference Species: The Reference Species is determined by the source of your data. Since our data came from mouse, choose “Mus musculus.” d. Reference Type: The choices for Reference Type are made available in the Reference Type menu after you have selected the analysis type and reference species. The Reference Type refers to the reference data that will be used in the alignment. Since we are measuring gene expression, pick “mRNA” as the reference type. This reference data set corresponds to the current build for mouse RefSeq RNA. 11. Click the checkbox for “Create Experiment(s) upon completion.” This selection organizes your data as an experiment, allowing you to compare expression between samples after the analysis step is complete. 12. Click the Analyze button to queue the Next Gen data set for analysis. The analysis step may take a few hours depending on the size of your data file and the number of samples waiting to be processed. When the analysis has finished, the information on the right side of the table, in the Analysis State column, will change to Complete. When the analysis step is complete, you will be able to view different types of information about your samples. Viewing the Next Gen alignment results 13. Click the file name to get to the analysis details page for your file, then click the Job ID to get the information from the analysis. The kinds of analysis results obtained depend on the alignment algorithms. The results from processing data from the AB SOLiD instrument are described in Alternate Protocol 1. For Illumina data, processed with the BWA, we obtain the following kinds of results: gene lists (text and html), a base composition plot, a list of genes formatted for GSAE, a transcript coverage plot, and an analysis log (Fig. 7.14.14). a. Gene lists (text and html): The gene lists show the number of reads that map to each transcript, and the number mapping per transcript, normalized per million. The html version of the gene list includes a graph showing where the reads map, which is linked to a more detailed map with each base position. Links are provided to the NCBI RefSeq record. b. Base composition plot: This graph shows the numbers of each base at each position and can be helpful for quality control. If sequencing DNA, we would expect the ratios to be fairly similar. If sequencing single-stranded RNA, we would expect to see more differences. c. Transcript coverage plot: The transcript coverage plot shows the number of reads that map to different numbers of transcripts. For example, in each case, you can see there are a large number of transcripts that only have one mapping read. Analyzing Expression Patterns 7.14.29 Current Protocols in Bioinformatics Supplement 27 Figure 7.14.14 Illumina data. Setting up a project 14. To compare multiple samples, begin by setting up a project. Find the Create New section in the Control Panel and click Project. 15. Give the project a title and add a description. Use “Mouse tissues” for this project. 16. Use the checkboxes to select the arrays that contain your data. These names correspond to the Array/Gene Set names that you assigned to the data sets during the upload process. If you checked the correct box, you will see the sample names appear in the Common Conditions box. The conditions that appear should match your experimental treatments. 17. Click the Continue button. 18. Assign a name to this group. 19. Select a normalization method. Choose None for this example. Analyzing Gene Expression Data Using GeneSifter 20. Use the “Data transformation” menu to select a method for data transformation. Data transformation options are “no transformation,” “log transformation,” or “already transformed.” Log transformations smooth out the data and produce a more Gaussian distribution. For this example, choose “Log transform data” from the menu. 7.14.30 Supplement 27 Current Protocols in Bioinformatics 21. In this next step, you will set the condition order. The first group selected acts as a reference or control group. Changes in gene expression in the other groups are all measured relative to first group that is chosen. a. Decide which group is group 1 and enter the name of that group in the Group Name box. To do this, select that group in the Conditions box on the left side, and use the arrow key to move it to the right-hand box. b. Select the other conditions that you wish to analyze and use the arrow key to move those to the left condition box as well. c. Click the Create Group button. A new page will appear with a list of all the groups and samples. 22. Select the samples for each condition. We will use all the samples, so click Select All Experiments, then click the Create Group button. The processing window will appear while the data are being processed. 23. Once a project has been created, you may analyze the project or create a new project or new group. These steps can also be completed at a later time. Comparing samples 24. Locate the Analysis section in the Control Panel, select Projects, and find your project in the list. Once you have found your project in the list, you may wish to select the project name to view some of the project details. You may also wish to view the box plots for these data as discussed in Basic Protocol 2. Identifying differential expression 25. To begin the analysis, select Projects from the Analysis section, locate your project in the list, and either select the spyglass or click the name or your project and then click on Analyze this Project. The Project summary page appears. From this page, we can choose to view all the genes or apply filters to locate specific genes by name, chromosome, function, or other distinguishing features. 26. Choose Show All Genes. It will take a few moments for the results to appear, especially with large data sets. The Project results appear. At this point, we see there are 40,009 results. We will need to apply a threshold and some statistics to select genes that are differentially expressed. The threshold filter allows us to choose the genes that show at least a minimum change in expression. Use a threshold of 1.5 for this project. 27. GSAE offers three types of statistical tests (described below) that can be applied at this point. At least three replicates per group are recommended. A balanced ANOVA can also be carried out when only one factor is varied (such as time or dose) and there are equal numbers of replicates for each sample. a. A standard 1-way ANOVA: This method is used when there is a normal distribution, the samples show equal variance, and the samples are independent. b. A 1-way ANOVA for samples with unequal variance: Like the standard 1-way ANOVA, this method assumes a normal distribution and independent, random samples. c. The Kruskal-Wallis test (nonparametric): This method assumes independent random samples but does not make assumptions about the distribution or variance. Choose the standard 1-way ANOVA for this analysis. Analyzing Expression Patterns 7.14.31 Current Protocols in Bioinformatics Supplement 27 28. After choosing a statistical method, click the Search button. At this point, there are still over 17,000 results. The advanced analysis methods in GSAE work best with gene numbers under 5000; consequently, we will use some additional filters to reduce the number of genes in the list. a. Apply a correction to limit false discoveries. The options are the Bonferroni, Holm, and Benjamini and Hochberg. Bonferroni is the most stringent, followed by Holm, with Benjamini and Hochberg allowing more false positives in order to minimize false negatives. Multiple testing corrections are discussed in detail in Basic Protocol 1. Used Benajmini and Hochberg in this example. b. Apply a p Cutoff. This sets a threshold for the minimum p value. Set the p-value cutoff at 0.01. c. Set the quality. For NGS data, the quality corresponds to the number of reads per million reads. Set the quality level at 100 to view highly expressed genes that differ between these three tissues. A quality value of 100 for NGS data corresponds to 100 reads per million sampled. 29. Click the Search button. 30. Now, we have limited the number of genes to 3293. At this point, it is helpful to save the results so that we can easily return to this point. To do this, click Save and enter a name and description for this subset of our project. When saving your project, it is helpful to enter information about the data transformations or statistical tests that were used during the analysis. For example, if your data were log transformed, or statistical tests or corrections were applied, it helps to enter this information in the description field. 31. A page will appear asking if you wish to continue the analysis or analyze the newly created project. Select “Analyze newly created project” and select Show All Genes from the Project Summary page. Visualizing the results Now, we can begin use some of the other analysis features in GSAE. The ontology and KEGG reports were discussed earlier in Basic Protocol 1, and some of the clustering options such as PCA and clustering by samples or genes were described earlier in this protocol. We will use clustering by genes here as well, in order to gain insights into the possible numbers of genes with related expression patterns. In this case, clustering by genes suggests that there may be three to four different expression patterns. Partition clustering Two of the advanced clustering methods provided in GeneSifter are PAM (Partitioning Around Medoids) and CLARA (Clustering Around Large Applications). Both of these options are variations of K-means clustering. K-means clustering is used to break a set of objects in this case, genes, into set of k groups. The clusters are formed by locating samples at the medoids (median values) to act as the seeds and clustering the other genes around the medoids. Analyzing Gene Expression Data Using GeneSifter In order to use the advanced clustering methods such as PAM or CLARA, filters must be applied in order to limit the number of the genes to below 5000. Two ways to limit the gene number are to set a lower p value as a cutoff and to raise the threshold. These filters can be used separately or in combination. 7.14.32 Supplement 27 Current Protocols in Bioinformatics To use the advanced clustering methods 32. Choose Pattern Navigation from the analysis path. 33. Choose Cluster. 34. Choose a method for clustering and set the options (as described below). The two options for advanced cluster analysis are PAM (Partitioning Around Medoids) and CLARA (Clustering Around Large Applications). The difference between the two methods is that PAM will try to group the samples into the number of clusters that you assign, while CLARA will try to find the optimum number of groups. PAM is recommended for data sets smaller than 3500 genes, while CLARA is more suited to larger data sets. PAM is also more robust; it tries all possible combinations of genes for k and picks the best clusters. CLARA does a sampling (100) and picks the best from that sample. a. Clusters: The number chosen here determines the number of gene groups. Often people try different values to see which gives the best results. b. Row Center: The values in this set, Row Mean, None, or Control are used to determine the centers of each row. c. Distance: The Distance choices are Euclidean, which corresponds to a straight line distance, Manhattan, which is a sum of linear distances, and Correlation. As a starting point for this example, choose PAM with 4 clusters based on our Gene cluster pattern, a Euclidean distance, and the Row Center at the Row Mean. 35. Click the Search button to begin. Silhouettes When the clustering process is complete, a page appears with multiple graphs, one for each cluster group. At the top of the page and under each graph are values called “silhouettes.” Silhouette widths are scores that indicate how well the expression of the genes within a cluster matches that graph. Values between 0.26 and 0.50 indicate a weak structure, between 0.50 and 0.70 a reasonable structure, and above 0.70 a strong structure. The mean silhouette value for all the silhouettes appears at the top of the page, with the individual values appearing below each graph along with the number of genes that show that pattern (Kaufman and Rousseeuw, 1990). The graphs showing the average expression pattern within each cluster and the silhouette values for our clusters are shown in Figure 7.14.15. When a graph in GSAE is clicked, the heat map containing the genes represented by the graph will appear. The first graph shows a pattern that seems a bit different from the results we might expect. Instead of showing the brain samples with a higher level of expression and liver and muscle lower, our first 20 liver and muscle samples appear instead to up-regulated. This result is puzzling until we look more closely at the results and see that the first silhouette contains 1920 genes, and that the variations in expression levels are small. It is likely that looking at more genes would show us that they do follow the pattern of expression seen in the graph. The other three graphs, with 397, 717, and 277 genes, respectively, match the results that we see in their respective heat maps. These groups also make biological sense. If we look at the genes and read about their function in the ontology and KEGG reports, we can see that, as expected, brain genes are expressed in brain, liver genes in liver, muscle genes in muscle, and some genes in two or more of tissues examined. It should be noted that clustering is not a definitive analytical tool. Clustering is used to try and group genes by the expression patterns that we see, and we will often try multiple values for k and different ways of making the clusters. Although the silhouette scores are helpful for evaluating the strength of the group, ultimately, we want to see if the cluster makes biological sense, with genes in a common pathway showing a pattern of coordinate control. Current Protocols in Bioinformatics Analyzing Expression Patterns 7.14.33 Supplement 27 Figure 7.14.15 Partitioning and silhouette data from a Next Gen experiment. LITERATURE CITED Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., and Edgar, R. 2009. NCBI GEO: Archive for high-throughput functional genomic data. Nucleic Acids Res. 37:D885-D890. Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York. Kozul, C.D., Nomikos, A.P., Hampton, T.H., Warnke, L.A., Gosse, J.A., Davey, J.C., Thorpe, J.E., Jackson, B.P., Ihnat, M.A., and Hamilton, J.W. 2008. Laboratory diet profoundly alters gene expression and confounds genomic analysis in mouse liver and lung. Chem. Biol. Interact. 173:129-140. Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics E-pub May 18. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. 2008. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18:1509-1517. Analyzing Gene Expression Data Using GeneSifter Millenaar, F.F., Okyere, J., May, S.T., van Zanten, M., Voesenek, L.A., and Peeters, A.J. 2006. How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results. BMC Bioinformatics 7:137. 7.14.34 Supplement 27 Current Protocols in Bioinformatics Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5:621-628. Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B.B., Siddiqui, A., Lao, K., and Surani, M.A. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 5:377-382. Wang, Z., Gerstein, M., and Snyder, M. 2009. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10:57-63. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., Dicuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L.Y., Helmberg, W., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D.J., Madden, T.L., Maglott, D.R., Miller, V., Ostell, J., Pruitt, K.D., Schuler, G.D., Shumway, M., Sequeira, E., Sherry, S.T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R.L., Tatusova, T.A., Wagner, L., and Yaschenko, E. 2008. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36:D13-D21. INTERNET RESOURCES http://www.geospiza.com/Support/datacenter.shtml The microarray data center at Geospiza, Inc. A diverse set of microarray data sets and tutorials on using GSAE are available from this page. http://www.ncbi.nlm.nih.gov/geo/ The NCBI GEO (Gene Expression Omnibus) database. GEO is a convenient place to find both microarray and Next Gen transcriptome datasets. http://www.ebi.ac.uk/microarray/ The ArrayExpress database from the European Bioinformatics Institute. Both microarray and Next Gen transcriptome data can be obtained here. http://www.ncbi.nlm.nih.gov/sra/ The NCBI SRA (Short Read Archive) database. Some Next Gen transcriptome data can be obtained here. Analyzing Expression Patterns 7.14.35 Current Protocols in Bioinformatics Supplement 27