Supplemental Information: Materials and Methods and Tables
Westenberger, et al. “Plasmodium vivax transcriptome analysis reveals divergent in vivo
blood stage profiles and species-specific sporozoite gene expression”
Supplemental Figure Legends
Supplemental Figure S1. Expression of glyceraldehyde 3-phosphate dehydrogenase.
Chromosome synteny alignment between P. falciparum and P. vivax taken from
PlasmoDB Genome Browser. The probe intensities across all P. vivax expression datasets
were averaged to create a synthetic baseline array in the absence of a genomic DNA
hybridization. The log ratios of probe intensities for sample CM013-2 versus the
synthetic array clearly demonstrate the high expression of the glyceraldehyde 3phosphate dehydrogenase gene in the expected syntenic region with identical exon
structure to the P. falciparum ortholog. Only sense probes are displayed. The lines were
drawn by visual inspection.
Supplemental Figure S2. Fold change in probe intensity between sporozoite and asexual
Visualization of probe intensity ratios for representative examples of the genes listed in
Supplemental Table S4. A) Sporozoite Conserved Orthologous Transcript (SCOT) genes.
The gene PF11_0545 was previously annotated on the opposite strand, and both gene
models are displayed in the figure. PVX_100850, PVX_111215, and PVX_123120 are
highly expressed in sporozoites, similar to their P. falciparum orthologs, but were not
included in Table S4, since they have no annotated P. yoelii orthologs. B) Upregulated in
P. vivax Sporozoites (UVS) genes. PVX_PF11_0140 is an un-annotated P. vivax ortholog
of the P. falciparum gene PF11_0140. This gene was discovered by searching for highly
transcribed regions, but was not included in our gene expression analysis. C)
Downregulated in P. vivax Sporozoites (DVS) genes. For each gene represented in Figure
3 and Figure S2, the intensity value for each probe in the sporozoite array hybridization
was divided by the intensity value in asexual sample CM013 of asexual Group 2. Then
the log (base 2) of this ratio was plotted on the y-axis along the chromosome region on
the x-axis surrounding the gene. Therefore these values represent the log of the fold
change in sporozoites relative to asexual samples. While 20 probes of similar GC content
are used for expression analysis, all unique probes are displayed in this visualization, and
some gaps occur in low complexity regions. Probes were colored based on a sliding
window analysis of 100bp, such that windows with average probe log ratio values near 0
are white and values above 1 are bright red or blue. Variability in probe intensity is a
function of GC content, and 3’ bias due to RNA amplification with oligo-dT primers.
PKH_141170 refers to a small (200bp) P. knowlesi gene with BLAST identity to highly
expressed regions of P. falciparum and P. vivax, suggesting the presence of an unannotated sporozoite transcript.
Supplemental Materials and Methods
Expression data analysis
All probes on the P. vivax tiling array were BLASTed against the coding
sequences for the P. vivax genes and then we filtered out all probes that perfectly match
with no mismatches to more than one place in the P. vivax genome. From the set of
uniquely mapped probes we selected probes for evaluation of expression analysis.
Because the array contains probes from both strands, we selected only probes that match
to the CDS (sense). The mRNA is sense, the cRNA created from the Affymetrix RNA
amplification is antisense. Therefore the sense probes detect the antisense cRNA.
Many overlapping probes on the P. vivax tiling array cover each gene. A probe
selection algorithm was designed based on the following rules: A) probes with GC
content closer to the optimal GC content of 9 is preferred based on comparison of signals
between P. vivax probes and Affymetrix background probes; B) given probes of the same
GC content, the probe closer to the 3' end is preferred; C) once a probe has been selected,
its overlapping neighboring probes are deprioritized to minimize redundancy.
Let us use gene Pv090050 as an example for the purpose of illustration. Pv090050
consists of 31 probes with their GC counts ranging from 5 to 15, and 5'-distance ranging
from 6 to 378. The selection algorithm scores all probes and the 30 best-scoring probes
are shown in Supplemental Info. The best 20 probes are used in later MOID calculation.
The algorithm starts with 7 probes of GC count of 9 (Rule A) in the order of decreasing 5'
distance (Rule B). Then it chooses 3 probe of GC count of 8 and 2 probes of GC count 10
(Rule A, B). The algorithm continues with probes of GC counts deviated further from 9.
At the end, the best 20 non-overlapping probes has GC counts ranging from 7 to 11, with
5'-distance ranging from 6 to 378 covering most of the gene regions.
To validate the robustness of our probe selection algorithm against sources of bias
or variability in GC content, we re-ran our algorithm using multiple different optimal GC
contents and correlated the expression results from the different mechanism using
Pearson Correlation. We found that using probes with GC=9 gave the most reproducible
results with probes of similar GC contents from 8-10 GC per 25mer oligo probe (r>0.92).
Whereas GC=12 or greater produced high variability (low Pearson correlations <0.6) that
increased with increasing average GC content of the highly expressed genes in each
sample. This is due to the nonlinear signal from higher GC content probes, resulting in
much higher signal from more GC rich genes. Therefore, we decided to use the GC9
probe selection algorithm for all future expression analysis to minimize this potential
bias. This is similar to our previous probe selection for P. falciparum expression arrays
with average GC count of 7 in each 25mer. Additional information and analysis can be
found on our website http://carrier.gnf.org/publications/Pv
For a probe of given GC content, the probability density function of its
background noise BGC is measured by all the background probes sharing the same GC
content. That is, the probability of the background noise occurs at the level between x and
x+dx is denoted by BGC(x)dx. Any observed probe intensity E is the sum of both true
signal E0 and noise E-E0. Assuming all possible E0 values are equally likely, the expected
value of true signal is:
E xBGC xdx BGC xdx .
For probe intensity much higher than the typical background signal, the above formula is
equivalent to subtracting E by the mean of background signal, as found in many other
background subtraction methods. However, for probe intensity closer to background
intensity, the formula provides a probabilistic interpretation of the observation and
guarantees a positively defined E0. The same background subtraction model has been
previously validated(1, 4).
Expression level, standard deviation, and gene-specific noise
For a given gene, signals from probes selected by the probe selection algorithm
form an integral intensity distribution. According to the MOID algorithm, the 70
percentile of the intensity distribution is defined as the expression level of the gene(5).
For this study signal standard deviation (STDV) and noise level are defined based on
bootstrapping approach. In an example bootstrapping run of Pv090050 consisting of 20
selected probes, signals from the 20 probes are sampled with replacement to form a new
probe set (Figure 2, red shaded region) and the MOID intensity is calculated. In this study
such a bootstrapping calculation are repeated 25 times and results in a group of MOID
intensities S, from which STDV is calculated. In a bootstrapping run of the noise
calculation, each original signal probe is replaced by a randomly selected background
probe of the same GC content, i.e., a virtual gene of the same GC distribution but only
containing non-specific cross hybridization readings is constructed and the MOID
intensity is calculated. In this study we construct 25 virtual genes for any given gene and
obtain a group of noise estimation N. S and N are then log-transformed and a t-test is
performed to obtain a p-value indicating the statistical significance of how the signal
distribution deviating from the gene-specific background noise.
Normalization is performed as previously described(2). First, genes with at least 6
probes and intensity value at least 1.5 above its noise estimation are retained. The
normalization factor is determined so that the average for gene intensities between 30 and
90 percentile are scaled to 200. Quantile-based normalization algorithm was also tested,
but not chosen because of its slightly inferior correlation within replicate samples.
Based on the RNA hybridization intensity, we identified genes that were
expressed above background in each sample. To determine the criteria for calling a gene
as “present” or expressed above background, we created sets of virtual control genes with
a similar GC content to actual genes. We found that a signal to noise ratio >3.2 and a
log10P value <-2 were the optimal cutoff criteria which excluded most virtual genes. This
criteria resulted in between 4708 and 5347 of the 5417 total genes detected above
background in each sample. (Table S2).
As an additional test of the robustness of our MOID interpretation of the
expression values, we performed correlations between expression values interpreted at
the 65th, 70th, and 75th percentile of probes. We found that the GC9 dataset that was
chosen due to its high correlation with probe selection at GC8 and GC10, was also very
highly correlated when different probe percentiles were used. Spearman Rank and
Pearson correlations greater than 0.9 were found between 65th and 70th percentile for all
samples. Slightly lower correlations were found for comparisons with 75th percentile,
which reflects the higher variability of GC rich probes, and the non-linear hybridization
intensity of probes in the higher percentiles. Therefore, we believe that the use of the
GC9 probe picking algorithm combined with MOID interpretation at the 70th percentile is
robust and reproducible with similar GC contents and probe percentiles, thus we used this
interpretation of gene expression for all P. vivax RNA hybridizations.
Influence of gametocytes on asexual gene expression profile clustering
We examined the contribution of gametocyte on the clustering of samples and to
the definition of asexual groups. All asexual blood samples were filtered to remove
gametocytes, but this process resulted in a minority of gametocyte cells remaining in the
sample. The samples that contained the highest percentage of gametocyte cells were
CM115 (14%), CM012 (12%), and CM101(10%). These samples all show completely
different expression profiles as determined by genome-wide Pearson correlations (see
above) and OPI clustering (Figure 1), and were found to be unique (CM115), in Group
1(CM101) and Group 2 (CM012). Therefore, the contribution of 10-14% gametocytes is
not sufficient to result in similar expression phenotypes for these samples. We are
confident that the differences we see reflect differences in the majority asexual stage
parasites’ gene expression profile.
We performed an analysis of gametocyte gene expression profiles, using a set of
257 genes that were previously assigned to a sexual development cluster (GO:GNF0004)
by gene expression analysis of P. falciparum gametocytes in vitro development(6). These
genes did not show consistent upregulation in either of the two asexual clusters. Some
genes were higher in Asexual Group 1, and some in Asexual Group 2. The differences in
expression between groups were generally small (<150 units) for the majority of genes,
compared to the massive upregulation of some genes observed in the P. falciparum in
vitro gametocytogenesis expression profiles. Genome-wide Pearson and Spearman rank
correlations showed that the Asexual Group 2 samples have slightly positive correlation
(0.2-0.3) with the P. falciparum in vitro gametocytogenesis samples than asexual Group 1
samples, which show little to no correlation, similar to the sporozoite samples. So in the
absence of a purified gametocyte gene expression profile to which we could compare our
asexual samples with a minority of gametocytes, we have no way to determine the exact
contribution of gametocyte mRNA transcripts from the less than 15% of gametocyte cells
to the pool of total RNA hybridized to our microarrays. Further work to characterize gene
expression from a variety of P. vivax cell types may address this question.
Comparison of P. vivax in vivo to P. falciparum in vitro asexual expression patterns
P. vivax gene expression data were compared to P. falciparum gene expression
data from in vitro synchronized asexual stage parasites as well as gametocytes and
sporozoites reported previously(2). Pearson correlations were calculated for 4459
orthologous genes that have expression data for both species. This comparison shows that
the P. vivax asexual group 2 samples CM012 and CM013 are the most similar to P.
falciparum in vitro asexual stages, particularly rings (Pearson r=0.35-0.41) and late
schizont stages (Pearson r=0.38-0.42). The similarity to late schizont stages may be due
to contamination of the late schizont samples with some rapidly developing newly
invaded rings late in the timecourse. Our P. vivax sporozoite sample is also among the
most highly correlated with the P. falciparum sporozoite sample, with positive correlation
(Pearson r=0.44). Asexual Group 1 samples CM101, CM109 and CM108 showed the
lowest correlation values with P. falciparum in vitro (Pearson r<0.12). Samples CM114
were lowly correlated (Pearson r<0.19). CM008 was moderately correlated with early
rings (Pearson r=0.33) and late schizonts (Pearson r=0.42). CM115 was moderately
correlated with early rings (Pearson r=0.27) and late schizonts (Pearson r=0.28).
Therefore, despite the species specific differences resulting in low overall correlation
values, and despite some variation among asexual Group 1 samples, with some showing
no correlation and others showing moderate correlation, the highest correlations were
between the P. vivax asexual Group 2 samples and the P. falciparum in vitro ring stages,
which confirms the microscopic descriptions of the blood stages as predominantly rings.
All data are available for download from our website.
Comparison of P. vivax in vivo to P. falciparum in vivo patient sample data
We calculated Pearson correlation coefficients of gene expression values of 4470
orthologous genes from our P. vivax dataset and the P. falciparum in vivo patient gene
expression samples analyzed by Daily, et al.(3). We found that asexual group 2 samples
CM012 and CM013 had the highest correlation values overall, with moderate correlation
with Daily et al. cluster 2 (Average Pearson r=0.3), and very low correlation with Cluster
1 (Average Pearson r=0.11). Asexual group 1 samples CM101, CM108 and CM109 had
low correlation with Daily, et al Cluster 1 (Average Pearson r=0.14), but practically no
correlation with Daily Cluster 2 (Average Pearson r=0.04). Sample CM114 was equally
poorly correlated with both Daily et al. cluster 1 and 2 (Average Pearson r=0.13).
Samples CM008 and CM115 were lowly correlated with Daily Cluster 1 (Average
Pearson r=0.12), similar to our other P. vivax Group 1 samples, but showed higher
correlation to Daily Cluster 2 (Average Pearson r=0.23), which was lower than the
correlation of our P. vivax asexual Group 2 samples with the Daily Cluster 2 samples.
Therefore, the greatest similarities were between the glycolytic profile observed in the P.
falciparum cluster 2 and the P. vivax Group 2. The very low correlations of the P.
falciparum cluster 1 and P. vivax Group 2 demonstrate the species-specific differences in
this non-glycolytic profile, which are evident in Figure 2, wherein we saw that aerobic
respiration genes are not as highly or as consistently upregulated in P. vivax in vivo, as
they are in P. falciparum in vivo. The variation in correlation between the various P.
vivax asexual Group 1 samples with the P. falciparum Cluster 1 samples displays the
variety of cellular responses, and differences in gene expression, for many
uncharacterized genes, many of which do not contain GO annotation, and thus were not
adequately represented in our Figure 1 displaying the OPI clustered gene expression
profiles. All data are available for download from our website.
Transcription of metabolic genes of P. vivax asexual blood stages
To generate the gene expression visualization presented in Figure 2, gene
expression values were normalized by subtracting the average expression value across all
samples and dividing by the standard deviation of expression values across all samples.
The resulting normalized expression values were colored on a scale ranging from -1 to
+2, from black for the lowest, red for the middle, and white for the highest values. The
gene ID numbers for glycolysis genes displayed in Figure 2 are: lactate dehydrogenase
(Pv116630, PF13_0141); enolase (Pv095015, PF10_0155); glyceraldehyde 3-phosphate
dehydrogenase (Pv117321, PF14_0598); fructose 1,6-bisphosphate aldolase, putative
(Pv118255, PF14_0425); 2,3-bisphosphoglycerate-dependent phosphoglycerate mutase
(Pv091640, PF11_0208); triosephosphate isomerase (Pv118495, PF14_0378). Gene ID
numbers for TCA cycle and aerobic respiration genes are: flavoprotein subunit of
succinate dehydrogenase (Pv111005, PF10_0334); malate:quinone oxidoreductase
(Pv113980, PFF0815w); IRP-like protein (iron regulatory protein-like) (Pv083005,
PF13_0229); fumarate hydratase (Pv099805, PFI1340w); ATP-specific succinyl-CoA
synthetase beta subunit (Pv084960, PF14_0295); 2-oxoglutarate dehydrogenase E1
component (Pv089325, PF08_0045); dihydrolipoamide acyltransferase (Pv119310,
PFC0170c); succinyl-CoA synthetase alpha subunit (Pv091100, PF11_0097); iron-sulfur
subunit of succinate dehydrogenase (Pv123345, PFL0630w); cytochrome C oxidase
(Pv099845, PFI1375w); cytochrome c oxidase copper chaperone (Pv111430,
PF10_0252); cytochrome c oxidase subunit II precursor (Pv084995, PF14_0288);
cytochrome c oxidase assembly protein (heme A: farnesyltransferase) (Pv080280,
PFE0970w); cytochrome c oxidase assembly protein (Pv084785, PF14_0331); NADH
dehydrogenase reaction protein (Pv119700, PFC0505c).
Comparison of P. vivax in vivo to P. vivax synchronized in vitro culture timecourse
gene expression data
We compared our dataset of patient samples and sporozoite sample RNA
expression data to previous gene expression data for synchronized P. vivax intraerythrocytic stages cultured in vitro reported by Bozdech et al(9). Since our dataset is in
the form of absolute expression values determined by MOID normalization of
hybridization intensity of 20 probes of an Affymetrix microarray, and the Bozdech et al.
dataset is in the form of log ratios of the intensity of labeled RNA from cultured P. vivax
cells from each timepoint divided by the intensity of labeled RNA derived from a pool of
equal amounts of RNA from all timepoints that are simultaneously hybridized to a
spotted oligonucleotide array, the absolute values of the two datasets are not directly
comparable by using Pearson correlations. Since we do not have representative
expression values for all timepoints throughout the intra-erythrocytic life cycle of P.
vivax, we cannot determine the average expression of each gene so as to convert our
absolute expression values to similar log ratio values. Therefore, we used Spearman rank
correlations to compare the relative rank order of the expression value of each gene
relative to all the genes in each dataset. Since each of the three datasets labeled SMRU1,
SMRU2, and SMRU3 report different log ratio values and different total numbers of
genes, we determined the Spearman Rank correlations separately for each dataset. We
found that there were 4113, 2753, and 3879 genes with expression values in datasets of
SMRU1, SMRU2, and SMRU3, respectively which we also determined gene expression
values for using our microarray. Spearman rank correlations were determined using the
full set of all these genes. Additionally, to focus on genes that are significantly expressed
above background on our array and differentially expressed in the Bozdech et al
timecourse datasets, we filtered these gene sets to examine only those genes that showed
a maximum expression in any of our asexual patient samples above 500 units, and also
showed a change of log ratio greater than 1 log across all timepoint samples of all
Bozdech et al. timecourse datasets. This resulted in a set of 425 genes for which
Spearman rank correlations were determined between our expression samples and all
timepoints of all Bozdech et al. datasets. The Spearman rank correlation values were
higher for the set of 425 genes that were significantly differentially expressed, compared
to the full datasets, however the samples showing the highest correlation were the same
by both analyses. The sporozoite and asexual Group 1 samples showed little to no
correlation to the in vitro timepoints with Spearman rank correlation values less than 0.13
for the full dataset, and less than 0.17 for the differentially expressed dataset. The
sporozoites do show positive correlation values of 0.21 to 0.29 with SMRU1 TP7,
SMRU2 TP6 and SMRU3 TP7 for the 425-gene dataset, which may reflect similar
expression of gene needed for cell growth and replication in late trophozoites and early
schizonts. The asexual group 2 samples showed much higher positive correlations with
the in vitro timepoints, with CM 13 showing highest correlations with TP3 (Spearman
rank =0.3 for full dataset, and 0.5 for differentially regulated dataset) representing late
rings and early trophozoites, while CM12 showed highest correlations with TP2 ring
stages or TP6 trophozoites stages, for the three datasets (Spearman rank =0.3). This data
is consistent with microscopic evidence showing that our patient blood samples contained
mostly rings and early trophozoite stage parasites. All data are available for download
from our website. http://carrier.gnf.org/publications/Pv
Comparison to P. vivax cDNA and EST sequencing data
We sought to evaluate patterns of differential gene expression. Previous efforts to
assess gene expression utilized sequencing of ESTs or full-length cDNAs from P. vivax
infected patient blood samples. However, these methods are not comprehensive and
suffer from incomplete depth of coverage of the transcriptome, whereas our analysis
provides a quantitative estimate of the global gene expression patterns in P. vivax.
Sequencing full-length cDNAs by oligo-capping methods, Watanabe et al. found
11,262 sequences corresponding to 1566 P. vivax genes(10) expressed in asexual and
gametocyte stages found in infected patient blood. Our current microarray analysis
confirms expression of a majority of these cDNAs in blood samples. Almost all (98%) of
these 1566 genes were found to be present above background levels in at least one of the
blood samples, whereas 81% were expressed in either of the two sporozoite samples. The
samples CMM12 and CMM13 in Group 2 had higher percentage (98%) expressed
compared to the remaining blood samples representing Group 1 (94%).
An advantage of microarrays with multiple probes per gene is that through
analysis of the distribution of probe intensities for each gene an estimate of transcript
abundance can be obtained, which is comparable to read number for EST sequencing,
given sufficient depth of sequencing to provide accurate quantitation of all transcripts.
For example, Cui et al generated 22,236 EST sequences from Thai blood samples that
contained a mixture of asexual and gametocyte forms(11). We found good concordance
between our data and this EST dataset. Since the P. vivax genome was not sequenced at
the time of this publication, ESTs were previously assigned to GenBank sequences. We
have re-analyzed these sequences by performing a BLAST search of all ESTs against all
P. vivax annotated transcripts. We identified the gene represented by the EST as the topscoring hit with a match greater than 50bp long (Table S2). We identified 3543 genes
with at least one EST, 2508 with at least two ESTs, 1110 with at least 5 ESTs and 463
with at least 10 ESTs. The numbers of ESTs for genes with fewer ESTs may not
accurately reflect the gene expression due to the stochastic sampling of ESTs by shotgun
sequencing. Therefore, we performed Pearson and Spearman rank correlations between
the Cui, et al. data and our expression data for asexual blood stages. The Pearson and
Spearman Rank correlations were almost the same for all the above subsets of genes. The
Group 2 samples consistently produced higher Pearson (average r=0.52) and Spearman
rank (r=0.35) correlation coefficients than Group 1 for all subsets of the EST data
analyzed. While CM101, 108 and 108 had the lowest correlations (Average Pearson
r=0.11, 0.04, and 0.04; Average Spearman Rank r=0.2, 0.07, 0.08). CM114 showed
slightly higher correlations (Spearman Rank r=0.24), and CM115 and CM008 were the
best correlated of the asexual Group 1 samples (Pearson r=0.3 and 0.4; Spearman Rank
r=0.25 and 0.21). All data are available for download from our website.
Of the top 25 ESTs, four could not be linked to any current gene annotation, but
17 of the remaining 20 with 6 or more probes on the array (note that data for genes with
fewer than 6 probes should not be included in this calculation) were found in the top 5%
of genes expressed in asexual blood samples. One exception was the gene annotated as
encoding a senescence protein (Pv088865) that had the highest number of ESTs assigned
to it from the Thai EST dataset, but which showed only moderate expression in our
Peruvian samples. This gene is adjacent to an un-annotated ribosomal RNA in P.
falciparum on its 3’ end and BLAST analysis of a highly transcribed region from the 5’
end shows identity to the 35S RNA transcript. A ribosomal RNA gene has recently been
annotated in this syntenic region adjacent to Pv088865. Because we find that ribosomal
RNA transcripts are often very abundant, even in poly-A primed cDNA preparations it is
likely that the adjacent rRNA promoter contributed to the large number of ESTs assigned
to this gene in the Thai samples. Interestingly, the EST that ranks 4th in the EST project
matches P. falciparum glyceraldehyde 3-phosphate dehydrogenase. The P. vivax
ortholog was not annotated in the PlasmoDB version 5.4 and thus we looked for evidence
that it would be found in the region syntenic to the P. falciparum ortholog. Indeed, we
found a very highly transcribed region showing the predicted exon structure in the
syntenic location on contig 7179 (Figure S1).
Sporozoite-specific gene expression analysis.
Statistical enrichment of motifs in upstream regions of sporozoite specific genes
was determined using GeneSpring 7.3 software, using the “search for regulatory
sequences” with parameters of searching for motifs of 5 to 8 bp within 1200 bp upstream
of the coding region, and probability scores were calculated relative to the upstream
regions of all genes in the genome, and normalized relative to the local nucleotide
frequency. We ran this search for all the genes in the sporozoite-specific OPI cluster
(GNF0006) (Table S3). This resulted in 68 of 210 genes with the 7-bp TGCATGC motif
(p=2.7e-9) and 106 of 210 genes with the 6-bp TGCATG motif (p=1.1e-4). So as to
identify all potential functional occurrences of this motif that may regulate sporozoite
specific genes, we also searched the upstream regions of the top 200 genes expressed in
sporozoites, and all 245 genes with fold change greater than 2.5 fold in sporozoites
relative to average asexual expression. In the top 200 genes expressed, we found 33 with
the 8-bp TGCATGCA motif (p=1.1e-7) and 100 with 6-bp TGCATG motif (1.4e-4). In
the genes with greater than 2.5 fold increase in sporozoites, we found 36 of 245 with the
6-bp TGCATG motif (p=4.0e-7), 71 of 245 with the 7-bp TGCATGC motif and 36 of
245 with the 8-bp TGCATGCA motif (p=7.1e-7). So the total number of occurrences of
these motifs in all 430 genes examined was 109 with the 7-bp TGCATGC motif (p=2.0e7) and 207 with the 6-bp TGCATG motif (p=2.7e-6). Searching the set of 65 SCOT
genes (Table S4) found the 6-bp motif in 54 of 65 genes (p=7.2e-14), the 7-bp motif in 30
genes (p=1.5e-8) and the 8-bp motif in 19 genes (p=7.5e-9). The eight base version of the
motif is identical to one identified by Carlton, et al.(12) in their analysis of 15 genes, and
similar to a sporozoite-specific motif identified by Young et al(7) in P. falciparum. We
have listed all occurrences of all versions of this motif in all these genes in Supplemental
Table S6. The distance of the motif relative to the start of the gene open reading frame is
listed. Examination of the distribution of these distance values showed no positionspecific enrichment of this motif relative to the translational start site.
The list of P. vivax sporozoite-specific genes used to seed the OPI cluster
includes: S13, MAC/Perforin (Pv000810, PFD0430c); SIAP-1 (Pv000815, PFD0425w);
pf52 protein (Pv001015, Pv001020, PFD0215c); ECP1, cysteine protease (Pv003790,
PFB0325c); asparagine-rich antigen Pfa35-2 (Pv081485, PFA0280w); S24, hypothetical
protein (Pv081555, PFA0205w); TRSP (Pv081560, PFA0200w); TRAP (Pv082735,
PF13_0201); S14, hypothetical protein (Pv084410, PFL0370w); S25, kinesin-related
protein (Pv084580, PFL0545w); MAEBL (Pv092975, PF11_0486); S1, hypothetical
protein (Pv094625, PF10_0083); kinesin-related protein (Pv094710, PFL0545w);
conserved hypothetical protein (Pv097795, PFE0230w); hypothetical protein (Pv118360,
PF14_0404); circumsporozoite (CS) protein (Pv119355, PFC0210c); early transcribed
membrane protein 13, ETRAMP13 (Pv121950, PF13_0012); S23, conserved
hypothetical protein (Pv123155, PF08_0088); S4, conserved hypothetical protein
(Pv123510, PFL0800c); conserved hypothetical protein (Pv123750, PFL1075w).
Comparison of P. vivax Sporozoite gene expression with P. falciparum and P. yoelii
To compare our P. vivax sporozoite gene expression data to previously published
data on P. falciparum (2) and P. yoelii sporozoites(1), we performed Pearson and
Spearman rank correlations on orthologous genes using the Matlab statistical software
package. Orthologous gene mapping was determined using OrthoMCL version 2. There
are a total of 3498 genes with orthologs in all three species for which we have expression
data. These gene sets show equal Pearson correlations of 0.5 for all comparisons of P.
vivax versus P. falciparum and P. yoelii and P. falciparum verus P. yoelii sporozoite
expression datasets. Filtering this set for genes that are significantly differentially
expressed among our P. vivax samples (pANOVA<0.05) and show maximum expression
in sporozoites, leaves 490 genes with gene expression values for orthologs in both of the
other two species. Comparison of this filtered list of 490 orthologous genes is most
appropriate for comparison of sporozoite gene expression since they are differentially
expressed and highest in the sporozoite sample. The 490 gene set shows nearly equal
Pearson correlations of 0.60 for P. vivax versus P. falciparum, 0.65 for P. vivax versus P.
yoelii and 0.64 for P. falciparum versus P. yoelii sporozoite expression datasets.
Therefore, despite species specific gene expression differences, the overall pattern of
gene expression is similar for sporozoites of all three species.
P. falciparum salivary gland sporozoites were obtained from Sanaria, Inc. For
Fig. 3 and Fig. S1, P. falciparum sporozoite RNA was isolated and amplified using
Affymetrix kits as described for P. vivax sporozoite samples. P. falciparum 3D7 strain
RNA from in vitro synchronized trophozoite stage parasites was isolated and amplified as
described for P. vivax samples. Amplified cRNA was hybridized to the Pftiling array
described previously(13). Raw hybridization data for all unique probes was visualized
using custom scripts written in Matlab to prepare Figures 3 and S2.
Quantitative RT-PCR of Sporozoite cDNA
To validate the expression comparison of genes that are differentially expressed in
sporozoites of P. vivax and P. falciparum, we performed quantitative reverse
transcriptase polymerase chain reaction (QRT-PCR) on 22 genes with orthologs in both
species, and two P. vivax specific genes. Primers used are listed below. All primer sets
were optimized using genomic DNA from 3D7 strain P. falciparum and Salvador I strain
P. vivax at three dilutions of 10ng/ul, 1ng/ul and 0.1ng/ul to ensure that the amplification
threshold values accurately reflected the difference in DNA template concentration.
Primer sets for the two different species produced similar threshold Ct values (+/- 1.5 Ct)
for all primer sets for both species. An additional aliquot of 150,000 sporozoites for both
P. falciparum and P. vivax from Sanaria, Inc, were used to isolate total RNA using Trizol
as described previously. This total RNA sample was split into equally into three reactions
to produce single stranded cDNA using reverse transcriptase and a T7-Oligo dT primer
from the cDNA synthesis kit (Affymetrix) according to manufacturers instructions. The
single stranded cDNA was used as template for QRT-PCR reactions using the primer sets
presented. To account for variability in the input cDNA between different cDNA
reactions from the same species, we normalized the threshold Ct values by the average
difference between reactions across all genes. When comparing P. vivax and P.
falciparum threshold Ct values, we found that the highest expressed gene (CSP) and
lowest expressed gene (Pv117045 zinc finger) in both species showed very similar
threshold Ct values within 1.5 Ct cycles, which was the within the error observed by
DNA optimization. QRT-PCR reactions were prepared using SYBR GREEN PCR
Master Mix (Applied Biosystems) according to manufacturers instructions, and were run
on Applied Biosystems TaqMan machine using SDS 2.2.1 software. Threshold Ct values
were determined using default settings and automatic threshold determination. All
amplification results were manually inspected to ensure that threshold levels were
determined within the logarithmic amplification phase of the reaction for accurate
determination of Ct values. Fold difference between P. falciparum and P. vivax QRTPCR determined expression values are equal to 2 raised to the power of the difference in
Ct values between the two species. QRT-PCR results and primers used are listed in
Supplemental Table S5.
Comparison to P. falciparum Sporozoite proteome dataset
Lasonder et al, 2008 identified 478 proteins with at least one peptide, and 349
proteins with at least two peptides in P. falciparum salivary gland sporozoites(14). We
compared these mass spectral counts with our P. falciparum sporozoite gene expression
values reported in Zhou et al, 2008, and the current dataset of orthologous P. vivax
sporozoite gene expression. Spearman rank correlation of the full set of 478 proteins with
P. falciparum and P. vivax expression was 0.2 and 0.099, respectively. Spearman rank
correlation of the set of 349 proteins with at least two peptides with P. falciparum and P.
vivax expression was 0.121 and 0.125, respectively. While numerous genes that are in the
top 1% of P. falciparum genes expressed in sporozoites are represented by more than 15
spectra, there are many examples of genes expressed at background levels represented by
more than 15 spectra. Comparisons with our P. vivax sporozoite transcriptome show less
correlation, due to species specific differences in gene expression and regulation.
This low correlation between is not unusual given the limited depth of mass
spectrometry and large numbers of proteins represented by only one or two spectra. Mass
spectrometry is not as directly quantitative for all proteins expressed in a single timepoint
compared to microarray transcriptional analysis, due to the stochastic nature of ionization
and difficulty in obtaining sufficient mass spectra to assay proteins in moderate or low
abundance. There is also not a direct correlation between transcript abundance and
protein abundance due to post-transcriptional regulation of gene expression, and delays in
translation often result in genes expressed in one life stage appearing as abundant
proteins in the following life stage. Previous analysis from our group and others has
shown better correlation between transcript abundance in one stage and protein
abundance in the following life stage(15, 16).
Therefore, a more appropriate comparison would be between midgut sporozoite
gene expression and salivary gland sporozoite protein spectra, or between salivary gland
sporozoite gene expression and liver stage protein spectra. There is no gene expression
data available for P. vivax or P. falciparum midgut sporozoites or protein expression for
liver stages of P. vivax or P. falciparum. However, we have gene expression data
available for midgut sporozoites of P. yoelii(1), for 419 proteins detected by at least one
peptide in the Lasonder, et al salivary gland sporozoite protein dataset, and for 314
proteins detected by at least two peptides, and 197 proteins with at least 5 peptides. We
calculate Spearman rank correlation of 0.24 for the full set of 419 proteins with at least
one spectra, and 0.2 for the set of 314 proteins with at least two spectra, and 0.174 for the
set of 197 proteins with at least 5 spectra. The low correlation may be due to comparison
of different types of datasets from different stages of different species. Some of the
highest expressed genes do show a good correlation, for example, Thrombospondin
repeat associated protein (TRAP/SSP2)(PF13_0201, PY03052) is the 4th highest
expressed gene overall and the in P. yoelii midgut sporozoites and is the 7th highest
expressed in the P. falciparum salivary gland sporozoite peptide dataset. Also, SIAP-1,
Sporozoite Invasion-Associated Protein 1(17) (PFD0425w, PY00455) is the 20th highest
overall expressed gene in P. yoelii midgut sporozoites and is the 6th highest in the
Lasonder P. falciparum salivary gland sporozoite peptide dataset. Additional data from
these stages of P. falciparum or P. vivax may provide better correlation.
Discovery of un-annotated gene expression in P. vivax
To infer gene expression of un-annotated genes in P. vivax, we performed a
BLAST search of all P. falciparum and P. knowlesi annotated genes against the P. vivax
genome to identify all putative orthologous genes that may not be annotated in P. vivax.
The BLAST similarity coordinates were used to define the coding region in P. vivax. We
have not validated the coding sequence for proper gene translation nor have we defined
intron-exon boundaries for these genes. These gene boundary definitions were used to
pick probes to evaluate the level of gene expression from these regions in the same way
as all other annotated P. vivax genes described earlier. These genes were originally
named using the GeneID numbers of their P. falciparum and P. knowlesi orthologs. We
have included these gene expression values for these putative genes in Table S2. We also
performed an analysis of all P. vivax RNA microarray hybridization data to identify
highly transcribed regions of 50bp that do not overlap with existing gene annotations. We
found a few of these regions, but they appeared to correspond to additional exons,
intronic regions, or 5’ or 3’ untranslated regions of existing genes. One additional gene
identified by this method is the Pv_PF11_0140 gene displayed in Figure S2. We provide
putative P. vivax Gene ID numbers for these genes based on their position relative to
existing flanking genes. We provide a list of these new putative gene coordinates in
Supplemental Table S7.
We provide the putative gene coding region and amino acid sequence alignment
of the hypothetical orthologs of PKH_141170, a SCOT gene highly expressed in
sporozoites of P. vivax and P. falciparum. The gene is currently annotated in P. knowlesi
and P. chabaudi, but all other genes are new predictions based on BLAST identity to the
>MAL13 | | 527065 to 527262 (reverse-complement)
>MALPY00640 | | 10172 to 10366 (reverse-complement)
>PB_RP2745 | | 7929 to 8134 (reverse-complement)
>CM000455 | | 578725 to 578929 (reverse-complement)
Clustal 2.0.10 multiple sequence alignment of PKH_141170 putative
: *::* .*:: :******** ******* *:*:.**::***:*.
P. falciparum QQLCLSYAYTSPVTVII- 65
PC102342.00.0 PQVCLSYAYTSPVTVII- 66
We provide the putative DNA and protein sequence of the newly annotated gene
Pv096306, and the alignment of its amino acid sequence with putative orthologs in other
Pv096306 Pv_PKH_031410_MAL7P1.105 2 exons of 86 and 258bp, similar to
Pf with exons = 86,229 bp.
>CM000444 | | 608130 to 608645 (reverse-complement)
>Pv096306 Pv_PKH_031410_MAL7P1.105 coding sequence (introns spliced
>Pv096306 Pv_PKH_031410_MAL7P1.105 putative protein sequence
CLUSTAL 2.0.10 multiple sequence alignment of MAL7P1.105 putative
*.: : : : :*. .: **********::**:* *:: **: ::**: **.: :.:
* **.: :::
Zhou Y, et al. (2008) Evidence-Based Annotation of the Malaria Parasite's
Genome Using Comparative Expression Profiling. PLoS ONE 3(2):e1570.
Le Roch KG, et al. (2003) Discovery of gene function by expression profiling of
the malaria parasite life cycle. Science (New York, N.Y 301(5639):1503-1508.
Daily JP, et al. (2007) Distinct physiological states of Plasmodium falciparum in
malaria-infected patients. Nature 450(7172):1091-1095.
Kidgell C, et al. (2006) A systematic map of genetic variation in Plasmodium
falciparum. PLoS pathogens 2(6):e57.
Zhou Y & Abagyan R (2002) Match-only integral distribution (MOID) algorithm
for high-density oligonucleotide array analysis. BMC Bioinformatics 3:3.
Young JA, et al. (2005) The Plasmodium falciparum sexual development
transcriptome: a microarray analysis using ontology-based pattern identification.
Molecular and biochemical parasitology 143(1):67-79.
Young JA, et al. (2008) In silico discovery of transcription regulatory elements in
Plasmodium falciparum. BMC genomics 9:70.
Zhou Y, et al. (2005) In silico gene function prediction using ontology-based
pattern identification. Bioinformatics (Oxford, England) 21(7):1237-1245.
Bozdech Z, et al. (2008) The transcriptome of Plasmodium vivax reveals
divergence and diversity of transcriptional regulation in malaria parasites.
Proceedings of the National Academy of Sciences of the United States of America
Watanabe J, Wakaguri H, Sasaki M, Suzuki Y, & Sugano S (2007) Comparasite:
a database for comparative study of transcriptomes of parasites defined by fulllength cDNAs. Nucleic acids research 35(Database issue):D431-438.
Cui L, et al. (2005) Gene discovery in Plasmodium vivax through sequencing of
ESTs from mixed blood stages. Molecular and biochemical parasitology
Carlton JM, et al. (2008) Comparative genomics of the neglected human malaria
parasite Plasmodium vivax. Nature 455(7214):757-763.
Dharia NV, et al. (2009) Use of high-density tiling microarrays to globally
identify mutations and elucidate mechanisms of drug resistance in Plasmodium
falciparum. Genome biology 10(2):R21.
Lasonder E, et al. (2008) Proteomic profiling of Plasmodium sporozoite
maturation identifies new proteins essential for parasite development and
infectivity. PLoS pathogens 4(10):e1000195.
Le Roch KG, et al. (2004) Global analysis of transcript and protein levels across
the Plasmodium falciparum life cycle. Genome research 14(11):2308-2318.
Mair GR, et al. (2006) Regulation of sexual development of Plasmodium by
translational repression. Science (New York, N.Y 313(5787):667-669.
Siau A, et al. (2008) Temperature shift and host cell contact up-regulate
sporozoite expression of Plasmodium falciparum genes involved in hepatocyte
infection. PLoS pathogens 4(8):e1000121.
Singh AP, et al. (2007) Plasmodium circumsporozoite protein promotes the
development of the liver stages of the parasite. Cell 131(3):492-504.
Simmons D, Woollett G, Bergin-Cartwright M, Kay D, & Scaife J (1987) A
malaria protein exported into a new compartment within the host erythrocyte. The
EMBO journal 6(2):485-491.
Doolan DL, et al. (1996) Circumventing genetic restriction of protection against
malaria with multigene DNA immunization: CD8+ cell-, interferon gamma-, and
nitric oxide-dependent immunity. The Journal of experimental medicine
Aly AS & Matuschewski K (2005) A malarial cysteine protease is necessary for
Plasmodium sporozoite egress from oocysts. The Journal of experimental