Supplemental Information: Materials and Methods and Tables
Transcription
Supplemental Information: Materials and Methods and Tables
PLoS Pathogens Westenberger, et al. “Plasmodium vivax transcriptome analysis reveals divergent in vivo blood stage profiles and species-specific sporozoite gene expression” Supporting Information Supplemental Figure Legends Supplemental Figure S1. Expression of glyceraldehyde 3-phosphate dehydrogenase. Chromosome synteny alignment between P. falciparum and P. vivax taken from PlasmoDB Genome Browser. The probe intensities across all P. vivax expression datasets were averaged to create a synthetic baseline array in the absence of a genomic DNA hybridization. The log ratios of probe intensities for sample CM013-2 versus the synthetic array clearly demonstrate the high expression of the glyceraldehyde 3phosphate dehydrogenase gene in the expected syntenic region with identical exon structure to the P. falciparum ortholog. Only sense probes are displayed. The lines were drawn by visual inspection. Supplemental Figure S2. Fold change in probe intensity between sporozoite and asexual stage. Visualization of probe intensity ratios for representative examples of the genes listed in Supplemental Table S4. A) Sporozoite Conserved Orthologous Transcript (SCOT) genes. The gene PF11_0545 was previously annotated on the opposite strand, and both gene models are displayed in the figure. PVX_100850, PVX_111215, and PVX_123120 are highly expressed in sporozoites, similar to their P. falciparum orthologs, but were not included in Table S4, since they have no annotated P. yoelii orthologs. B) Upregulated in P. vivax Sporozoites (UVS) genes. PVX_PF11_0140 is an un-annotated P. vivax ortholog of the P. falciparum gene PF11_0140. This gene was discovered by searching for highly transcribed regions, but was not included in our gene expression analysis. C) Downregulated in P. vivax Sporozoites (DVS) genes. For each gene represented in Figure 3 and Figure S2, the intensity value for each probe in the sporozoite array hybridization was divided by the intensity value in asexual sample CM013 of asexual Group 2. Then the log (base 2) of this ratio was plotted on the y-axis along the chromosome region on the x-axis surrounding the gene. Therefore these values represent the log of the fold change in sporozoites relative to asexual samples. While 20 probes of similar GC content are used for expression analysis, all unique probes are displayed in this visualization, and some gaps occur in low complexity regions. Probes were colored based on a sliding window analysis of 100bp, such that windows with average probe log ratio values near 0 are white and values above 1 are bright red or blue. Variability in probe intensity is a function of GC content, and 3’ bias due to RNA amplification with oligo-dT primers. PKH_141170 refers to a small (200bp) P. knowlesi gene with BLAST identity to highly expressed regions of P. falciparum and P. vivax, suggesting the presence of an unannotated sporozoite transcript. Supplemental Materials and Methods Expression data analysis All probes on the P. vivax tiling array were BLASTed against the coding sequences for the P. vivax genes and then we filtered out all probes that perfectly match with no mismatches to more than one place in the P. vivax genome. From the set of uniquely mapped probes we selected probes for evaluation of expression analysis. Because the array contains probes from both strands, we selected only probes that match to the CDS (sense). The mRNA is sense, the cRNA created from the Affymetrix RNA amplification is antisense. Therefore the sense probes detect the antisense cRNA. Many overlapping probes on the P. vivax tiling array cover each gene. A probe selection algorithm was designed based on the following rules: A) probes with GC content closer to the optimal GC content of 9 is preferred based on comparison of signals between P. vivax probes and Affymetrix background probes; B) given probes of the same GC content, the probe closer to the 3' end is preferred; C) once a probe has been selected, its overlapping neighboring probes are deprioritized to minimize redundancy. Let us use gene Pv090050 as an example for the purpose of illustration. Pv090050 consists of 31 probes with their GC counts ranging from 5 to 15, and 5'-distance ranging from 6 to 378. The selection algorithm scores all probes and the 30 best-scoring probes are shown in Supplemental Info. The best 20 probes are used in later MOID calculation. The algorithm starts with 7 probes of GC count of 9 (Rule A) in the order of decreasing 5' distance (Rule B). Then it chooses 3 probe of GC count of 8 and 2 probes of GC count 10 (Rule A, B). The algorithm continues with probes of GC counts deviated further from 9. At the end, the best 20 non-overlapping probes has GC counts ranging from 7 to 11, with 5'-distance ranging from 6 to 378 covering most of the gene regions. To validate the robustness of our probe selection algorithm against sources of bias or variability in GC content, we re-ran our algorithm using multiple different optimal GC contents and correlated the expression results from the different mechanism using Pearson Correlation. We found that using probes with GC=9 gave the most reproducible results with probes of similar GC contents from 8-10 GC per 25mer oligo probe (r>0.92). Whereas GC=12 or greater produced high variability (low Pearson correlations <0.6) that increased with increasing average GC content of the highly expressed genes in each sample. This is due to the nonlinear signal from higher GC content probes, resulting in much higher signal from more GC rich genes. Therefore, we decided to use the GC9 probe selection algorithm for all future expression analysis to minimize this potential bias. This is similar to our previous probe selection for P. falciparum expression arrays with average GC count of 7 in each 25mer. Additional information and analysis can be found on our website http://carrier.gnf.org/publications/Pv Background Subtraction For a probe of given GC content, the probability density function of its background noise BGC is measured by all the background probes sharing the same GC content. That is, the probability of the background noise occurs at the level between x and x+dx is denoted by BGC(x)dx. Any observed probe intensity E is the sum of both true signal E0 and noise E-E0. Assuming all possible E0 values are equally likely, the expected value of true signal is: E0 = E E x 0 x 0 E xBGC xdx BGC xdx . For probe intensity much higher than the typical background signal, the above formula is equivalent to subtracting E by the mean of background signal, as found in many other background subtraction methods. However, for probe intensity closer to background intensity, the formula provides a probabilistic interpretation of the observation and guarantees a positively defined E0. The same background subtraction model has been previously validated(1, 4). Expression level, standard deviation, and gene-specific noise For a given gene, signals from probes selected by the probe selection algorithm form an integral intensity distribution. According to the MOID algorithm, the 70 percentile of the intensity distribution is defined as the expression level of the gene(5). For this study signal standard deviation (STDV) and noise level are defined based on bootstrapping approach. In an example bootstrapping run of Pv090050 consisting of 20 selected probes, signals from the 20 probes are sampled with replacement to form a new probe set (Figure 2, red shaded region) and the MOID intensity is calculated. In this study such a bootstrapping calculation are repeated 25 times and results in a group of MOID intensities S, from which STDV is calculated. In a bootstrapping run of the noise calculation, each original signal probe is replaced by a randomly selected background probe of the same GC content, i.e., a virtual gene of the same GC distribution but only containing non-specific cross hybridization readings is constructed and the MOID intensity is calculated. In this study we construct 25 virtual genes for any given gene and obtain a group of noise estimation N. S and N are then log-transformed and a t-test is performed to obtain a p-value indicating the statistical significance of how the signal distribution deviating from the gene-specific background noise. Normalization is performed as previously described(2). First, genes with at least 6 probes and intensity value at least 1.5 above its noise estimation are retained. The normalization factor is determined so that the average for gene intensities between 30 and 90 percentile are scaled to 200. Quantile-based normalization algorithm was also tested, but not chosen because of its slightly inferior correlation within replicate samples. Based on the RNA hybridization intensity, we identified genes that were expressed above background in each sample. To determine the criteria for calling a gene as “present” or expressed above background, we created sets of virtual control genes with a similar GC content to actual genes. We found that a signal to noise ratio >3.2 and a log10P value <-2 were the optimal cutoff criteria which excluded most virtual genes. This criteria resulted in between 4708 and 5347 of the 5417 total genes detected above background in each sample. (Table S2). As an additional test of the robustness of our MOID interpretation of the expression values, we performed correlations between expression values interpreted at the 65th, 70th, and 75th percentile of probes. We found that the GC9 dataset that was chosen due to its high correlation with probe selection at GC8 and GC10, was also very highly correlated when different probe percentiles were used. Spearman Rank and Pearson correlations greater than 0.9 were found between 65th and 70th percentile for all samples. Slightly lower correlations were found for comparisons with 75th percentile, which reflects the higher variability of GC rich probes, and the non-linear hybridization intensity of probes in the higher percentiles. Therefore, we believe that the use of the GC9 probe picking algorithm combined with MOID interpretation at the 70th percentile is robust and reproducible with similar GC contents and probe percentiles, thus we used this interpretation of gene expression for all P. vivax RNA hybridizations. Influence of gametocytes on asexual gene expression profile clustering We examined the contribution of gametocyte on the clustering of samples and to the definition of asexual groups. All asexual blood samples were filtered to remove gametocytes, but this process resulted in a minority of gametocyte cells remaining in the sample. The samples that contained the highest percentage of gametocyte cells were CM115 (14%), CM012 (12%), and CM101(10%). These samples all show completely different expression profiles as determined by genome-wide Pearson correlations (see above) and OPI clustering (Figure 1), and were found to be unique (CM115), in Group 1(CM101) and Group 2 (CM012). Therefore, the contribution of 10-14% gametocytes is not sufficient to result in similar expression phenotypes for these samples. We are confident that the differences we see reflect differences in the majority asexual stage parasites’ gene expression profile. We performed an analysis of gametocyte gene expression profiles, using a set of 257 genes that were previously assigned to a sexual development cluster (GO:GNF0004) by gene expression analysis of P. falciparum gametocytes in vitro development(6). These genes did not show consistent upregulation in either of the two asexual clusters. Some genes were higher in Asexual Group 1, and some in Asexual Group 2. The differences in expression between groups were generally small (<150 units) for the majority of genes, compared to the massive upregulation of some genes observed in the P. falciparum in vitro gametocytogenesis expression profiles. Genome-wide Pearson and Spearman rank correlations showed that the Asexual Group 2 samples have slightly positive correlation (0.2-0.3) with the P. falciparum in vitro gametocytogenesis samples than asexual Group 1 samples, which show little to no correlation, similar to the sporozoite samples. So in the absence of a purified gametocyte gene expression profile to which we could compare our asexual samples with a minority of gametocytes, we have no way to determine the exact contribution of gametocyte mRNA transcripts from the less than 15% of gametocyte cells to the pool of total RNA hybridized to our microarrays. Further work to characterize gene expression from a variety of P. vivax cell types may address this question. Comparison of P. vivax in vivo to P. falciparum in vitro asexual expression patterns P. vivax gene expression data were compared to P. falciparum gene expression data from in vitro synchronized asexual stage parasites as well as gametocytes and sporozoites reported previously(2). Pearson correlations were calculated for 4459 orthologous genes that have expression data for both species. This comparison shows that the P. vivax asexual group 2 samples CM012 and CM013 are the most similar to P. falciparum in vitro asexual stages, particularly rings (Pearson r=0.35-0.41) and late schizont stages (Pearson r=0.38-0.42). The similarity to late schizont stages may be due to contamination of the late schizont samples with some rapidly developing newly invaded rings late in the timecourse. Our P. vivax sporozoite sample is also among the most highly correlated with the P. falciparum sporozoite sample, with positive correlation (Pearson r=0.44). Asexual Group 1 samples CM101, CM109 and CM108 showed the lowest correlation values with P. falciparum in vitro (Pearson r<0.12). Samples CM114 were lowly correlated (Pearson r<0.19). CM008 was moderately correlated with early rings (Pearson r=0.33) and late schizonts (Pearson r=0.42). CM115 was moderately correlated with early rings (Pearson r=0.27) and late schizonts (Pearson r=0.28). Therefore, despite the species specific differences resulting in low overall correlation values, and despite some variation among asexual Group 1 samples, with some showing no correlation and others showing moderate correlation, the highest correlations were between the P. vivax asexual Group 2 samples and the P. falciparum in vitro ring stages, which confirms the microscopic descriptions of the blood stages as predominantly rings. All data are available for download from our website. http://carrier.gnf.org/publications/Pv Comparison of P. vivax in vivo to P. falciparum in vivo patient sample data We calculated Pearson correlation coefficients of gene expression values of 4470 orthologous genes from our P. vivax dataset and the P. falciparum in vivo patient gene expression samples analyzed by Daily, et al.(3). We found that asexual group 2 samples CM012 and CM013 had the highest correlation values overall, with moderate correlation with Daily et al. cluster 2 (Average Pearson r=0.3), and very low correlation with Cluster 1 (Average Pearson r=0.11). Asexual group 1 samples CM101, CM108 and CM109 had low correlation with Daily, et al Cluster 1 (Average Pearson r=0.14), but practically no correlation with Daily Cluster 2 (Average Pearson r=0.04). Sample CM114 was equally poorly correlated with both Daily et al. cluster 1 and 2 (Average Pearson r=0.13). Samples CM008 and CM115 were lowly correlated with Daily Cluster 1 (Average Pearson r=0.12), similar to our other P. vivax Group 1 samples, but showed higher correlation to Daily Cluster 2 (Average Pearson r=0.23), which was lower than the correlation of our P. vivax asexual Group 2 samples with the Daily Cluster 2 samples. Therefore, the greatest similarities were between the glycolytic profile observed in the P. falciparum cluster 2 and the P. vivax Group 2. The very low correlations of the P. falciparum cluster 1 and P. vivax Group 2 demonstrate the species-specific differences in this non-glycolytic profile, which are evident in Figure 2, wherein we saw that aerobic respiration genes are not as highly or as consistently upregulated in P. vivax in vivo, as they are in P. falciparum in vivo. The variation in correlation between the various P. vivax asexual Group 1 samples with the P. falciparum Cluster 1 samples displays the variety of cellular responses, and differences in gene expression, for many uncharacterized genes, many of which do not contain GO annotation, and thus were not adequately represented in our Figure 1 displaying the OPI clustered gene expression profiles. All data are available for download from our website. http://carrier.gnf.org/publications/Pv Transcription of metabolic genes of P. vivax asexual blood stages To generate the gene expression visualization presented in Figure 2, gene expression values were normalized by subtracting the average expression value across all samples and dividing by the standard deviation of expression values across all samples. The resulting normalized expression values were colored on a scale ranging from -1 to +2, from black for the lowest, red for the middle, and white for the highest values. The gene ID numbers for glycolysis genes displayed in Figure 2 are: lactate dehydrogenase (Pv116630, PF13_0141); enolase (Pv095015, PF10_0155); glyceraldehyde 3-phosphate dehydrogenase (Pv117321, PF14_0598); fructose 1,6-bisphosphate aldolase, putative (Pv118255, PF14_0425); 2,3-bisphosphoglycerate-dependent phosphoglycerate mutase (Pv091640, PF11_0208); triosephosphate isomerase (Pv118495, PF14_0378). Gene ID numbers for TCA cycle and aerobic respiration genes are: flavoprotein subunit of succinate dehydrogenase (Pv111005, PF10_0334); malate:quinone oxidoreductase (Pv113980, PFF0815w); IRP-like protein (iron regulatory protein-like) (Pv083005, PF13_0229); fumarate hydratase (Pv099805, PFI1340w); ATP-specific succinyl-CoA synthetase beta subunit (Pv084960, PF14_0295); 2-oxoglutarate dehydrogenase E1 component (Pv089325, PF08_0045); dihydrolipoamide acyltransferase (Pv119310, PFC0170c); succinyl-CoA synthetase alpha subunit (Pv091100, PF11_0097); iron-sulfur subunit of succinate dehydrogenase (Pv123345, PFL0630w); cytochrome C oxidase (Pv099845, PFI1375w); cytochrome c oxidase copper chaperone (Pv111430, PF10_0252); cytochrome c oxidase subunit II precursor (Pv084995, PF14_0288); cytochrome c oxidase assembly protein (heme A: farnesyltransferase) (Pv080280, PFE0970w); cytochrome c oxidase assembly protein (Pv084785, PF14_0331); NADH dehydrogenase reaction protein (Pv119700, PFC0505c). Comparison of P. vivax in vivo to P. vivax synchronized in vitro culture timecourse gene expression data We compared our dataset of patient samples and sporozoite sample RNA expression data to previous gene expression data for synchronized P. vivax intraerythrocytic stages cultured in vitro reported by Bozdech et al(9). Since our dataset is in the form of absolute expression values determined by MOID normalization of hybridization intensity of 20 probes of an Affymetrix microarray, and the Bozdech et al. dataset is in the form of log ratios of the intensity of labeled RNA from cultured P. vivax cells from each timepoint divided by the intensity of labeled RNA derived from a pool of equal amounts of RNA from all timepoints that are simultaneously hybridized to a spotted oligonucleotide array, the absolute values of the two datasets are not directly comparable by using Pearson correlations. Since we do not have representative expression values for all timepoints throughout the intra-erythrocytic life cycle of P. vivax, we cannot determine the average expression of each gene so as to convert our absolute expression values to similar log ratio values. Therefore, we used Spearman rank correlations to compare the relative rank order of the expression value of each gene relative to all the genes in each dataset. Since each of the three datasets labeled SMRU1, SMRU2, and SMRU3 report different log ratio values and different total numbers of genes, we determined the Spearman Rank correlations separately for each dataset. We found that there were 4113, 2753, and 3879 genes with expression values in datasets of SMRU1, SMRU2, and SMRU3, respectively which we also determined gene expression values for using our microarray. Spearman rank correlations were determined using the full set of all these genes. Additionally, to focus on genes that are significantly expressed above background on our array and differentially expressed in the Bozdech et al timecourse datasets, we filtered these gene sets to examine only those genes that showed a maximum expression in any of our asexual patient samples above 500 units, and also showed a change of log ratio greater than 1 log across all timepoint samples of all Bozdech et al. timecourse datasets. This resulted in a set of 425 genes for which Spearman rank correlations were determined between our expression samples and all timepoints of all Bozdech et al. datasets. The Spearman rank correlation values were higher for the set of 425 genes that were significantly differentially expressed, compared to the full datasets, however the samples showing the highest correlation were the same by both analyses. The sporozoite and asexual Group 1 samples showed little to no correlation to the in vitro timepoints with Spearman rank correlation values less than 0.13 for the full dataset, and less than 0.17 for the differentially expressed dataset. The sporozoites do show positive correlation values of 0.21 to 0.29 with SMRU1 TP7, SMRU2 TP6 and SMRU3 TP7 for the 425-gene dataset, which may reflect similar expression of gene needed for cell growth and replication in late trophozoites and early schizonts. The asexual group 2 samples showed much higher positive correlations with the in vitro timepoints, with CM 13 showing highest correlations with TP3 (Spearman rank =0.3 for full dataset, and 0.5 for differentially regulated dataset) representing late rings and early trophozoites, while CM12 showed highest correlations with TP2 ring stages or TP6 trophozoites stages, for the three datasets (Spearman rank =0.3). This data is consistent with microscopic evidence showing that our patient blood samples contained mostly rings and early trophozoite stage parasites. All data are available for download from our website. http://carrier.gnf.org/publications/Pv Comparison to P. vivax cDNA and EST sequencing data We sought to evaluate patterns of differential gene expression. Previous efforts to assess gene expression utilized sequencing of ESTs or full-length cDNAs from P. vivax infected patient blood samples. However, these methods are not comprehensive and suffer from incomplete depth of coverage of the transcriptome, whereas our analysis provides a quantitative estimate of the global gene expression patterns in P. vivax. Sequencing full-length cDNAs by oligo-capping methods, Watanabe et al. found 11,262 sequences corresponding to 1566 P. vivax genes(10) expressed in asexual and gametocyte stages found in infected patient blood. Our current microarray analysis confirms expression of a majority of these cDNAs in blood samples. Almost all (98%) of these 1566 genes were found to be present above background levels in at least one of the blood samples, whereas 81% were expressed in either of the two sporozoite samples. The samples CMM12 and CMM13 in Group 2 had higher percentage (98%) expressed compared to the remaining blood samples representing Group 1 (94%). An advantage of microarrays with multiple probes per gene is that through analysis of the distribution of probe intensities for each gene an estimate of transcript abundance can be obtained, which is comparable to read number for EST sequencing, given sufficient depth of sequencing to provide accurate quantitation of all transcripts. For example, Cui et al generated 22,236 EST sequences from Thai blood samples that contained a mixture of asexual and gametocyte forms(11). We found good concordance between our data and this EST dataset. Since the P. vivax genome was not sequenced at the time of this publication, ESTs were previously assigned to GenBank sequences. We have re-analyzed these sequences by performing a BLAST search of all ESTs against all P. vivax annotated transcripts. We identified the gene represented by the EST as the topscoring hit with a match greater than 50bp long (Table S2). We identified 3543 genes with at least one EST, 2508 with at least two ESTs, 1110 with at least 5 ESTs and 463 with at least 10 ESTs. The numbers of ESTs for genes with fewer ESTs may not accurately reflect the gene expression due to the stochastic sampling of ESTs by shotgun sequencing. Therefore, we performed Pearson and Spearman rank correlations between the Cui, et al. data and our expression data for asexual blood stages. The Pearson and Spearman Rank correlations were almost the same for all the above subsets of genes. The Group 2 samples consistently produced higher Pearson (average r=0.52) and Spearman rank (r=0.35) correlation coefficients than Group 1 for all subsets of the EST data analyzed. While CM101, 108 and 108 had the lowest correlations (Average Pearson r=0.11, 0.04, and 0.04; Average Spearman Rank r=0.2, 0.07, 0.08). CM114 showed slightly higher correlations (Spearman Rank r=0.24), and CM115 and CM008 were the best correlated of the asexual Group 1 samples (Pearson r=0.3 and 0.4; Spearman Rank r=0.25 and 0.21). All data are available for download from our website. http://carrier.gnf.org/publications/Pv Of the top 25 ESTs, four could not be linked to any current gene annotation, but 17 of the remaining 20 with 6 or more probes on the array (note that data for genes with fewer than 6 probes should not be included in this calculation) were found in the top 5% of genes expressed in asexual blood samples. One exception was the gene annotated as encoding a senescence protein (Pv088865) that had the highest number of ESTs assigned to it from the Thai EST dataset, but which showed only moderate expression in our Peruvian samples. This gene is adjacent to an un-annotated ribosomal RNA in P. falciparum on its 3’ end and BLAST analysis of a highly transcribed region from the 5’ end shows identity to the 35S RNA transcript. A ribosomal RNA gene has recently been annotated in this syntenic region adjacent to Pv088865. Because we find that ribosomal RNA transcripts are often very abundant, even in poly-A primed cDNA preparations it is likely that the adjacent rRNA promoter contributed to the large number of ESTs assigned to this gene in the Thai samples. Interestingly, the EST that ranks 4th in the EST project matches P. falciparum glyceraldehyde 3-phosphate dehydrogenase. The P. vivax ortholog was not annotated in the PlasmoDB version 5.4 and thus we looked for evidence that it would be found in the region syntenic to the P. falciparum ortholog. Indeed, we found a very highly transcribed region showing the predicted exon structure in the syntenic location on contig 7179 (Figure S1). Sporozoite-specific gene expression analysis. Statistical enrichment of motifs in upstream regions of sporozoite specific genes was determined using GeneSpring 7.3 software, using the “search for regulatory sequences” with parameters of searching for motifs of 5 to 8 bp within 1200 bp upstream of the coding region, and probability scores were calculated relative to the upstream regions of all genes in the genome, and normalized relative to the local nucleotide frequency. We ran this search for all the genes in the sporozoite-specific OPI cluster (GNF0006) (Table S3). This resulted in 68 of 210 genes with the 7-bp TGCATGC motif (p=2.7e-9) and 106 of 210 genes with the 6-bp TGCATG motif (p=1.1e-4). So as to identify all potential functional occurrences of this motif that may regulate sporozoite specific genes, we also searched the upstream regions of the top 200 genes expressed in sporozoites, and all 245 genes with fold change greater than 2.5 fold in sporozoites relative to average asexual expression. In the top 200 genes expressed, we found 33 with the 8-bp TGCATGCA motif (p=1.1e-7) and 100 with 6-bp TGCATG motif (1.4e-4). In the genes with greater than 2.5 fold increase in sporozoites, we found 36 of 245 with the 6-bp TGCATG motif (p=4.0e-7), 71 of 245 with the 7-bp TGCATGC motif and 36 of 245 with the 8-bp TGCATGCA motif (p=7.1e-7). So the total number of occurrences of these motifs in all 430 genes examined was 109 with the 7-bp TGCATGC motif (p=2.0e7) and 207 with the 6-bp TGCATG motif (p=2.7e-6). Searching the set of 65 SCOT genes (Table S4) found the 6-bp motif in 54 of 65 genes (p=7.2e-14), the 7-bp motif in 30 genes (p=1.5e-8) and the 8-bp motif in 19 genes (p=7.5e-9). The eight base version of the motif is identical to one identified by Carlton, et al.(12) in their analysis of 15 genes, and similar to a sporozoite-specific motif identified by Young et al(7) in P. falciparum. We have listed all occurrences of all versions of this motif in all these genes in Supplemental Table S6. The distance of the motif relative to the start of the gene open reading frame is listed. Examination of the distribution of these distance values showed no positionspecific enrichment of this motif relative to the translational start site. The list of P. vivax sporozoite-specific genes used to seed the OPI cluster includes: S13, MAC/Perforin (Pv000810, PFD0430c); SIAP-1 (Pv000815, PFD0425w); pf52 protein (Pv001015, Pv001020, PFD0215c); ECP1, cysteine protease (Pv003790, PFB0325c); asparagine-rich antigen Pfa35-2 (Pv081485, PFA0280w); S24, hypothetical protein (Pv081555, PFA0205w); TRSP (Pv081560, PFA0200w); TRAP (Pv082735, PF13_0201); S14, hypothetical protein (Pv084410, PFL0370w); S25, kinesin-related protein (Pv084580, PFL0545w); MAEBL (Pv092975, PF11_0486); S1, hypothetical protein (Pv094625, PF10_0083); kinesin-related protein (Pv094710, PFL0545w); conserved hypothetical protein (Pv097795, PFE0230w); hypothetical protein (Pv118360, PF14_0404); circumsporozoite (CS) protein (Pv119355, PFC0210c); early transcribed membrane protein 13, ETRAMP13 (Pv121950, PF13_0012); S23, conserved hypothetical protein (Pv123155, PF08_0088); S4, conserved hypothetical protein (Pv123510, PFL0800c); conserved hypothetical protein (Pv123750, PFL1075w). Comparison of P. vivax Sporozoite gene expression with P. falciparum and P. yoelii To compare our P. vivax sporozoite gene expression data to previously published data on P. falciparum (2) and P. yoelii sporozoites(1), we performed Pearson and Spearman rank correlations on orthologous genes using the Matlab statistical software package. Orthologous gene mapping was determined using OrthoMCL version 2. There are a total of 3498 genes with orthologs in all three species for which we have expression data. These gene sets show equal Pearson correlations of 0.5 for all comparisons of P. vivax versus P. falciparum and P. yoelii and P. falciparum verus P. yoelii sporozoite expression datasets. Filtering this set for genes that are significantly differentially expressed among our P. vivax samples (pANOVA<0.05) and show maximum expression in sporozoites, leaves 490 genes with gene expression values for orthologs in both of the other two species. Comparison of this filtered list of 490 orthologous genes is most appropriate for comparison of sporozoite gene expression since they are differentially expressed and highest in the sporozoite sample. The 490 gene set shows nearly equal Pearson correlations of 0.60 for P. vivax versus P. falciparum, 0.65 for P. vivax versus P. yoelii and 0.64 for P. falciparum versus P. yoelii sporozoite expression datasets. Therefore, despite species specific gene expression differences, the overall pattern of gene expression is similar for sporozoites of all three species. P. falciparum salivary gland sporozoites were obtained from Sanaria, Inc. For Fig. 3 and Fig. S1, P. falciparum sporozoite RNA was isolated and amplified using Affymetrix kits as described for P. vivax sporozoite samples. P. falciparum 3D7 strain RNA from in vitro synchronized trophozoite stage parasites was isolated and amplified as described for P. vivax samples. Amplified cRNA was hybridized to the Pftiling array described previously(13). Raw hybridization data for all unique probes was visualized using custom scripts written in Matlab to prepare Figures 3 and S2. Quantitative RT-PCR of Sporozoite cDNA To validate the expression comparison of genes that are differentially expressed in sporozoites of P. vivax and P. falciparum, we performed quantitative reverse transcriptase polymerase chain reaction (QRT-PCR) on 22 genes with orthologs in both species, and two P. vivax specific genes. Primers used are listed below. All primer sets were optimized using genomic DNA from 3D7 strain P. falciparum and Salvador I strain P. vivax at three dilutions of 10ng/ul, 1ng/ul and 0.1ng/ul to ensure that the amplification threshold values accurately reflected the difference in DNA template concentration. Primer sets for the two different species produced similar threshold Ct values (+/- 1.5 Ct) for all primer sets for both species. An additional aliquot of 150,000 sporozoites for both P. falciparum and P. vivax from Sanaria, Inc, were used to isolate total RNA using Trizol as described previously. This total RNA sample was split into equally into three reactions to produce single stranded cDNA using reverse transcriptase and a T7-Oligo dT primer from the cDNA synthesis kit (Affymetrix) according to manufacturers instructions. The single stranded cDNA was used as template for QRT-PCR reactions using the primer sets presented. To account for variability in the input cDNA between different cDNA reactions from the same species, we normalized the threshold Ct values by the average difference between reactions across all genes. When comparing P. vivax and P. falciparum threshold Ct values, we found that the highest expressed gene (CSP) and lowest expressed gene (Pv117045 zinc finger) in both species showed very similar threshold Ct values within 1.5 Ct cycles, which was the within the error observed by DNA optimization. QRT-PCR reactions were prepared using SYBR GREEN PCR Master Mix (Applied Biosystems) according to manufacturers instructions, and were run on Applied Biosystems TaqMan machine using SDS 2.2.1 software. Threshold Ct values were determined using default settings and automatic threshold determination. All amplification results were manually inspected to ensure that threshold levels were determined within the logarithmic amplification phase of the reaction for accurate determination of Ct values. Fold difference between P. falciparum and P. vivax QRTPCR determined expression values are equal to 2 raised to the power of the difference in Ct values between the two species. QRT-PCR results and primers used are listed in Supplemental Table S5. Comparison to P. falciparum Sporozoite proteome dataset Lasonder et al, 2008 identified 478 proteins with at least one peptide, and 349 proteins with at least two peptides in P. falciparum salivary gland sporozoites(14). We compared these mass spectral counts with our P. falciparum sporozoite gene expression values reported in Zhou et al, 2008, and the current dataset of orthologous P. vivax sporozoite gene expression. Spearman rank correlation of the full set of 478 proteins with P. falciparum and P. vivax expression was 0.2 and 0.099, respectively. Spearman rank correlation of the set of 349 proteins with at least two peptides with P. falciparum and P. vivax expression was 0.121 and 0.125, respectively. While numerous genes that are in the top 1% of P. falciparum genes expressed in sporozoites are represented by more than 15 spectra, there are many examples of genes expressed at background levels represented by more than 15 spectra. Comparisons with our P. vivax sporozoite transcriptome show less correlation, due to species specific differences in gene expression and regulation. This low correlation between is not unusual given the limited depth of mass spectrometry and large numbers of proteins represented by only one or two spectra. Mass spectrometry is not as directly quantitative for all proteins expressed in a single timepoint compared to microarray transcriptional analysis, due to the stochastic nature of ionization and difficulty in obtaining sufficient mass spectra to assay proteins in moderate or low abundance. There is also not a direct correlation between transcript abundance and protein abundance due to post-transcriptional regulation of gene expression, and delays in translation often result in genes expressed in one life stage appearing as abundant proteins in the following life stage. Previous analysis from our group and others has shown better correlation between transcript abundance in one stage and protein abundance in the following life stage(15, 16). Therefore, a more appropriate comparison would be between midgut sporozoite gene expression and salivary gland sporozoite protein spectra, or between salivary gland sporozoite gene expression and liver stage protein spectra. There is no gene expression data available for P. vivax or P. falciparum midgut sporozoites or protein expression for liver stages of P. vivax or P. falciparum. However, we have gene expression data available for midgut sporozoites of P. yoelii(1), for 419 proteins detected by at least one peptide in the Lasonder, et al salivary gland sporozoite protein dataset, and for 314 proteins detected by at least two peptides, and 197 proteins with at least 5 peptides. We calculate Spearman rank correlation of 0.24 for the full set of 419 proteins with at least one spectra, and 0.2 for the set of 314 proteins with at least two spectra, and 0.174 for the set of 197 proteins with at least 5 spectra. The low correlation may be due to comparison of different types of datasets from different stages of different species. Some of the highest expressed genes do show a good correlation, for example, Thrombospondin repeat associated protein (TRAP/SSP2)(PF13_0201, PY03052) is the 4th highest expressed gene overall and the in P. yoelii midgut sporozoites and is the 7th highest expressed in the P. falciparum salivary gland sporozoite peptide dataset. Also, SIAP-1, Sporozoite Invasion-Associated Protein 1(17) (PFD0425w, PY00455) is the 20th highest overall expressed gene in P. yoelii midgut sporozoites and is the 6th highest in the Lasonder P. falciparum salivary gland sporozoite peptide dataset. Additional data from these stages of P. falciparum or P. vivax may provide better correlation. Discovery of un-annotated gene expression in P. vivax To infer gene expression of un-annotated genes in P. vivax, we performed a BLAST search of all P. falciparum and P. knowlesi annotated genes against the P. vivax genome to identify all putative orthologous genes that may not be annotated in P. vivax. The BLAST similarity coordinates were used to define the coding region in P. vivax. We have not validated the coding sequence for proper gene translation nor have we defined intron-exon boundaries for these genes. These gene boundary definitions were used to pick probes to evaluate the level of gene expression from these regions in the same way as all other annotated P. vivax genes described earlier. These genes were originally named using the GeneID numbers of their P. falciparum and P. knowlesi orthologs. We have included these gene expression values for these putative genes in Table S2. We also performed an analysis of all P. vivax RNA microarray hybridization data to identify highly transcribed regions of 50bp that do not overlap with existing gene annotations. We found a few of these regions, but they appeared to correspond to additional exons, intronic regions, or 5’ or 3’ untranslated regions of existing genes. One additional gene identified by this method is the Pv_PF11_0140 gene displayed in Figure S2. We provide putative P. vivax Gene ID numbers for these genes based on their position relative to existing flanking genes. We provide a list of these new putative gene coordinates in Supplemental Table S7. We provide the putative gene coding region and amino acid sequence alignment of the hypothetical orthologs of PKH_141170, a SCOT gene highly expressed in sporozoites of P. vivax and P. falciparum. The gene is currently annotated in P. knowlesi and P. chabaudi, but all other genes are new predictions based on BLAST identity to the PKH_141170 gene. >MAL13 | | 527065 to 527262 (reverse-complement) ATGGAAACGATAATATCTCCAGTAATTACTTTACAACAAGCCCCTGTCGTTTATACAACG ACATATAGAGTTGTACCACAAACAGTTGTATACACCTTTCCAAATAATATCCCTGTCGTT AAAAATATACATGTGGTTCCTGCACAACAATTATGTCTTAGTTACGCCTATACTTCACCG GTTACTGTAATAATATAA >MALPY00640 | | 10172 to 10366 (reverse-complement) ATGATTACAACAGTTGTATCACGATGGTTTACATTACAGTCAGCCCCAGTTGTTTACACA ACTACATATAATGTAGTACCGCAAACGGTTGTTTATACTTTTCCTCAAAGTATACCGGTT ATCAAAAACATTCAAGTATTCCGGCGCCACAGGTTTGTCTTAGCTATGCCTATACTTCTC CTGTAACAGTAATAA >PB_RP2745 | | 7929 to 8134 (reverse-complement) ATGATTACAACAGTCGTATCACCGATGGTTACATTACAGTCAGCCCCAGTTGTTTACACA ACTACATATAATGTAGTACCGCAAACGGTTGTCTATACTTTTCCTCAAAGCATACCAATT ATCAAAAACATTCAAGTTATCCCGGCTCCCCAGGTTTGCCTTAGCTATGCCTATACTTCT CCCGTAACAGTAATAATATAACATAA >CM000455 | | 578725 to 578929 (reverse-complement) ATGTTCGCAACAGTGATATCCCCCGTGGTGACGGTGCAGCCCGCGCCAGTTGTTTACACA ACTACCTACAGTGTCGTGCCACAAACAGTTGTGTGCACGATTCCACAGACCATACCGATT ATTAAAAATATTCAAGTTATCCCTTCCCAACAAGTATGTCTTAGCTACGCGTACGCCGCG CCCGTAACGACTTTTATCCTTTAA Clustal 2.0.10 multiple sequence alignment of PKH_141170 putative orthologs PKH_141170 P. vivax P. falciparum PC102342.00.0 P. berghei P. yoelii MFATVISPVVTVRPAPVVYTTTYSVVPQTVVYTIPQTIPIIKNIQVIPS MFATVISPVVTVQPAPVVYTTTYSVVPQTVVCTIPQTIPIIKNIQVIPS METIISPVITLQQAPVVYTTTYRVVPQTVVYTFPNNIPVVKNIHVVPA MITTVVSPMVTLQSTPVVYTTTYNVVPQTVVYTFPQTIPIIKNIQVIPA MITTVVSPMVTLQSAPVVYTTTYNVVPQTVVYTFPQSIPIIKNIQVIPA MITTVVSRWFTLQSAPVVYTTTYNVVPQTVVYTFPQSIPVIKNIQVFRR : *::* .*:: :******** ******* *:*:.**::***:*. PKH_141170 PQVCLSYSYAAPVTTVIL 67 P. vivax QQVCLSYAYAAPVTTFIL 67 P. falciparum QQLCLSYAYTSPVTVII- 65 PC102342.00.0 PQVCLSYAYTSPVTVII- 66 P. berghei PQVCLSYAYTSPVTVII- 66 P. yoelii HRFVL--AMETPILLL— 63 50 50 49 50 50 48 :. * : :*: . We provide the putative DNA and protein sequence of the newly annotated gene Pv096306, and the alignment of its amino acid sequence with putative orthologs in other Plasmodium species. Pv096306 Pv_PKH_031410_MAL7P1.105 2 exons of 86 and 258bp, similar to Pf with exons = 86,229 bp. >CM000444 | | 608130 to 608645 (reverse-complement) ATGCCGAACCATAAGACGTCCAGGGGCGAATGCTCCGACTACAACCGATCCAGGTGCTAC AACCCGAAGGTGCATGTCTCCGGCTGG (splice) TAAGAAGCACCCGCACGTGCGTGATTAGCTGCA TCGCACTTTACATACACACATGTGCGTTGTGTATATATCCCTCCCAACTGTGTGTCCCCC CCTTCCTTTTGCAACCAACCTCCCCCTAACACTCTCCTTTAAGAGAAAGGAACAAATAAG CATTGCTCCCTCAAACCACCTTGCAGG (splice) CACAACATTCAACACGATGAGGCCTACATACAA AGCTACAACCGAATGCGTGAGTTCTACATGGAGGCGTACCCAACGGAGAGCATCAGCCAG AAGTACCAGAGTGCCAGGGGGGGTGGGGCTCGAAAGAACCTGTCAGACAAGCGGGTGATT TTCTACGAAGAGGGCGGGGAGGGCCACTGGGTCACCGAAAGCAGGCGCGCCTTTCGGGAG GGGGCGCGACGGAGCGGTGAAGTCCACCAAGCAGTGGCAGAGCGGTGA >Pv096306 Pv_PKH_031410_MAL7P1.105 coding sequence (introns spliced out) ATGCCGAACCATAAGACGTCCAGGGGCGAATGCTCCGACTACAACCGATCCAGGTGCTAC AACCCGAAGGTGCATGTCTCCGGCTGGCACAACATTCAACACGATGAGGCCTACATACAA AGCTACAACCGAATGCGTGAGTTCTACATGGAGGCGTACCCAACGGAGAGCATCAGCCAG AAGTACCAGAGTGCCAGGGGGGGTGGGGCTCGAAAGAACCTGTCAGACAAGCGGGTGATT TTCTACGAAGAGGGCGGGGAGGGCCACTGGGTCACCGAAAGCAGGCGCGCCTTTCGGGAG GGGGCGCGACGGAGCGGTGAAGTCCACCAAGCAGTGGCAGAGCGGTGA >Pv096306 Pv_PKH_031410_MAL7P1.105 putative protein sequence MPNHKTSRGECSDYNRSRCYNPKVHVSGW HNIQHDEAYIQSYNRMREFYMEAYPTESISQKYQSARGGGARKNLSDKRVIFYEEGGEGHWVTESRRAFRE GARRSGEVHQAVAER CLUSTAL 2.0.10 multiple sequence alignment of MAL7P1.105 putative orthologs PB000685.00.0 MAL7P1.105 Pv096306new PKH_031410 MSKENKFRDECLDYRRSQNYNPKVHVSGWYDIQNDEDYLEKYNKTKEFYIREYPSNPLEE MSNINTVKIDYSEYRTKRVYNPKVHVSGWYDIQNDLDYIEHYNKEKEFYIAEYPKNKFEK MPNHKTSRGECSDYNRSRCYNPKVHVSGWHNIQHDEAYIQSYNRMREFYMEAYPTESISQ MSNHKMSRSEYSDYNRSRCYNPKVHVSGWYNIQHDDAYIQSYNRMRDFYMEAYPTERINQ *.: : : : :*. .: **********::**:* *:: **: ::**: **.: :.: PB000685.00.0 MAL7P1.105 Pv096306new PKH_031410 KYQNMS--KYRKNISDKEIIFYQD-HDTTYWETENKSSYKKNN------------RYKNTNR-KSTKNISDKKVIFYQEGYESDSWLTENKESYKVDETK----------KYQSARGGGARKNLSDKRVIFYEEGGEGH-WVTESRRAFREGARRSGEVHQAVAER KYQSAKGDGARKNLSDKRVIFYEEGGACN-WITENRRAFKEEKRRSEDVCKAVTTE :*:. **:***.:***:: * **.: ::: 60 60 60 60 100 104 115 115 References 1. Zhou Y, et al. (2008) Evidence-Based Annotation of the Malaria Parasite's Genome Using Comparative Expression Profiling. PLoS ONE 3(2):e1570. 2. Le Roch KG, et al. (2003) Discovery of gene function by expression profiling of the malaria parasite life cycle. Science (New York, N.Y 301(5639):1503-1508. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Daily JP, et al. (2007) Distinct physiological states of Plasmodium falciparum in malaria-infected patients. Nature 450(7172):1091-1095. Kidgell C, et al. (2006) A systematic map of genetic variation in Plasmodium falciparum. PLoS pathogens 2(6):e57. Zhou Y & Abagyan R (2002) Match-only integral distribution (MOID) algorithm for high-density oligonucleotide array analysis. BMC Bioinformatics 3:3. Young JA, et al. (2005) The Plasmodium falciparum sexual development transcriptome: a microarray analysis using ontology-based pattern identification. Molecular and biochemical parasitology 143(1):67-79. Young JA, et al. (2008) In silico discovery of transcription regulatory elements in Plasmodium falciparum. BMC genomics 9:70. Zhou Y, et al. (2005) In silico gene function prediction using ontology-based pattern identification. Bioinformatics (Oxford, England) 21(7):1237-1245. Bozdech Z, et al. (2008) The transcriptome of Plasmodium vivax reveals divergence and diversity of transcriptional regulation in malaria parasites. Proceedings of the National Academy of Sciences of the United States of America 105(42):16290-16295. Watanabe J, Wakaguri H, Sasaki M, Suzuki Y, & Sugano S (2007) Comparasite: a database for comparative study of transcriptomes of parasites defined by fulllength cDNAs. Nucleic acids research 35(Database issue):D431-438. Cui L, et al. (2005) Gene discovery in Plasmodium vivax through sequencing of ESTs from mixed blood stages. Molecular and biochemical parasitology 144(1):1-9. Carlton JM, et al. (2008) Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature 455(7214):757-763. Dharia NV, et al. (2009) Use of high-density tiling microarrays to globally identify mutations and elucidate mechanisms of drug resistance in Plasmodium falciparum. Genome biology 10(2):R21. Lasonder E, et al. (2008) Proteomic profiling of Plasmodium sporozoite maturation identifies new proteins essential for parasite development and infectivity. PLoS pathogens 4(10):e1000195. Le Roch KG, et al. (2004) Global analysis of transcript and protein levels across the Plasmodium falciparum life cycle. Genome research 14(11):2308-2318. Mair GR, et al. (2006) Regulation of sexual development of Plasmodium by translational repression. Science (New York, N.Y 313(5787):667-669. Siau A, et al. (2008) Temperature shift and host cell contact up-regulate sporozoite expression of Plasmodium falciparum genes involved in hepatocyte infection. PLoS pathogens 4(8):e1000121. Singh AP, et al. (2007) Plasmodium circumsporozoite protein promotes the development of the liver stages of the parasite. Cell 131(3):492-504. Simmons D, Woollett G, Bergin-Cartwright M, Kay D, & Scaife J (1987) A malaria protein exported into a new compartment within the host erythrocyte. The EMBO journal 6(2):485-491. Doolan DL, et al. (1996) Circumventing genetic restriction of protection against malaria with multigene DNA immunization: CD8+ cell-, interferon gamma-, and 21. nitric oxide-dependent immunity. The Journal of experimental medicine 183(4):1739-1746. Aly AS & Matuschewski K (2005) A malarial cysteine protease is necessary for Plasmodium sporozoite egress from oocysts. The Journal of experimental medicine 202(2):225-230.