Orthogonality and Orthography: Introducing Measured Distance into Semantic Space Trevor Cohen

Transcription

Orthogonality and Orthography: Introducing Measured Distance into Semantic Space Trevor Cohen
Orthogonality and Orthography:
Introducing Measured Distance into Semantic Space
Trevor Cohen1 , Dominic Widdows2 , Manuel Wahle1 , and Roger Schvaneveldt3
1
University of Texas School of Biomedical Informatics at Houston
2
Microsoft Bing
3
Arizona State University
Abstract. This paper explores a new technique for encoding structured information into a semantic model, for the construction of vector representations of words
and sentences. As an illustrative application, we use this technique to compose
robust representations of words based on sequences of letters, that are tolerant to
changes such as transposition, insertion and deletion of characters. Since these
vectors are generated from the written form or orthography of a word, we call
them ‘orthographic vectors’. The representation of discrete letters in a continuous vector space is an interesting example of a Generalized Quantum model, and
the process of generating semantic vectors for letters in a word is mathematically
similar to the derivation of orbital angular momentum in quantum mechanics.
The importance (and sometimes, the violation) of orthogonality is discussed in
both mathematical settings.This work is grounded in psychological literature on
word representation and recognition, and is also motivated by potential technological applications such as genre-appropriate spelling correction. The mathematical method, examples and experiments, and the implementation and availability
of the technique in the Semantic Vectors package are also discussed.
Keywords: Distributional Semantics, Orthographic Similarity, Vector Symbolic Architectures
1
Introduction
The relationships between words, their representation in text, concepts in the mind, and
objects in the real world, has been the source of inquiry over many centuries. Empirical,
distributional paradigms have been shown to successfully derive human-like estimates
of semantic distance from large text corpora, and recent developments in this area have
mediated the enrichment of distributional models with structural information, such as
the relative position of terms [1, 2], and orthographic information describing the configuration of characters from which words are composed [3, 4]. This paper extends these
beginnings in three principal ways. First, we propose a new and very simple method
for encoding structure into semantic vectors, using a quantization of the space between
two extreme ‘demarcator vectors’. This vector generation method performs well in experiments and has some key computational advantages. Second, the method is general
enough to be applied within a wide range of Vector Symbolic Architectures (VSAs),
including Circular Holographic Reduced Representations (CHHRs) that use complex
vectors [5]. Thirdly, we investigate some higher-level compositions as well, demonstrating some early results with compositional representations for sentences.
Each of these developments is related to quantum interaction as follows. Demarcator vector generation is similar in a sense to the derivation of orbital angular momentum
values in quantum mechanics, in that they use the same mathematics. The application
within many VSAs, including those over complex vector space, makes the new methods available within algebras that are particularly related to generalized quantum models. Finally, compositional methods are important in research on generalized quantum
structures, and the way many levels of representation are combined seamlessly in the
compositional models presented here should be of interest to researchers in the field.
2
Orthogonality in Distributional and Orthographic Models
Geometric methods of distributional semantics derive vector representations of terms
from electronic text, such that terms that occur in similar contexts will have similar
vectors [6, 7]. While such models have been shown to approximate human performance
on a number of cognitive tasks [8, 9], they generally do not take into account structural
elements of language, and as consequently have been referred to, at times critically, as
“bags of words” models. Emerging approaches to semantic space models have leveraged reversible vector transformations to encode additional layers of meaning into vector representations of terms and concepts. Examples include the encoding of the relative
position of terms [1, 10], syntactic information [11], and orthographic information [4].
The general approach used depends upon the generation of vector representations
for terms, and associating reversible vector transformations with properties of interest.
In accordance with terminology developed in [12], we will refer to the vector representations of atomic components such as terms as elemental vectors. Elemental vectors are
constructed using a randomization procedure such that they have a high probability of
being mutually orthogonal, or close-to-orthogonal. This adds robustness to the model,
by making it highly improbable that elemental vectors would be confused with one another, despite the distortion that occurs during training. However, it also introduces the
implicit assumption that elemental vectors are unrelated to one another, which means
that models generated in this way must be composed of discrete elements.
This limitation notwithstanding, this approach has allowed for the integration of
structured information into distributional models of meaning. From the perspective of
cognitive psychology, this is desirable as it presents the possibility of a unified term
representation that can account for a broad range of experimental phenomena. Recent
work in this area has leveraged circular convolution to generate vectors representing the
orthographic form of words [3], and integrate these with a geometric model of distributional semantics [4]. Vector representations of orthographic word form are generated by
using circular convolution to generate bound products representing the component bigrams of the term concerned, including non-contiguous bigrams. Karchergis et al give
the following example (~ indicates binding using circular convolution) [4]:
word =w + o + r + d + w ~ o + o ~ r + r ~ d
+w ~ o + (w ~ ) ~ r + (w ~ ) ~ d + (w ~ o) ~ r + ((w ~ ) ~ r) ~ d
+(w ~ ) ~ d + (o ~ ) ~ d + r ~ d + ((w ~ o) ) ~ d + (o ~ r) ~ d
The vector representation for the term “word” is generated by combining a set of vectors representing unigrams, bigrams, and trigrams of characters. It is a characteristic of
the model employed that each of these vectors have a high probability of being mutually orthogonal, or close-to-orthogonal. So, for example, the vectors representing the
trigrams ((w ~ ) ~ r) and ((w ~ o) ~ r) will be dissimilar from one another. Consequently it is necessary to explicitly encode all of the n-grams of interest, including
gapped trigrams (such as “w r”) to provide flexibility. From the perspective of computational complexity, this is less than ideal, as the number of representational units that
must be generated and encoded is at least quadratic to the length of the sequence.
Rather than explicitly encoding character position precisely (with respect to some
other character, or the term itself), alternative models of orthographic representation
allow for a degree of uncertainty with respect to letter position. These approaches measure the relatedness between terms on the basis of the similarity between probability
distributions assigned to the positions of each matching character [13, 14], providing
a more flexible measure of similarity. However, on account of the constraints we have
discussed, only orthographic representations based on discrete bigrams or the exact
position of characters have been combined with other sorts of distributional information in an attempt to generate a holistic representation to date [3, 15]. Imposing nearorthogonality adds robustness, but also necessitates ignoring potentially useful information to do with structure, namely the proximity between character positions within
a word. Consequently, we have selected the generation of orthographic representations
as an example application through which to illustrate the utility of our approach.
The paper proceeds as follows. First, we present the mathematical language we will
use to describe the operators provided by VSAs, a family of representational approaches
based on reversible vector transformations [16]. Next, we will describe an approach
we have developed through which the distance between elemental vectors, and hence
bound products, can be predetermined. In the context of an illustrative application for
orthographic modeling, we show that this approach permits the encoding of structural
information to do with proximity, rather than absolute position, into a distributional
model. We then discuss a relationship between this approach and quantum mechanics,
and conclude with some experimental results and example applications.
3
3.1
Mathematical Structure and Methods
Vector Symbolic Architectures (VSAs)
The reversible vector transformations we have discussed are a distinguishing feature
of a family of representational approaches collectively known as VSAs [16]. In our
experiments the VSAs we will use are Kanerva’s Binary Spatter Code (BSC), which
uses binary vectors [17], and Plate’s CHRR [5], which uses complex vectors where each
dimension represents an angle between −π and π, using the implementation developed
in [10]. In addition, we will use an approach based on permutation of real vectors [2].
Binding is the primary operation facilitated by VSAs (in addition to standard operators for vector superposition and vector comparison). Binding is a multiplication-like
operator through which two vectors are combined to form a third vector C that is dissimilar from either of its component vectors A and B. We will use the symbol “⊗” for
binding, and the symbol “” for the inverse of binding for the remainder of this paper.
It is important that this operator be invertible: if C = A ⊗ B, then A C = A (A ⊗
B) = B. In some models, this recovery may be approximate, but the robust nature of the
representation guarantees that A C is similar enough to B that B can easily be recognized as the best candidate for A C in the original set of concepts. Thus the invertible
nature of the bind operator facilitates the retrieval of the information it encodes.
In the case of the BSC, elemental vectors are initialized by randomly assigning 0 or
1 to each dimension with equal probability. Pairwise exclusive or (XOR) is used as a
binding operator: X⊗Y = X XOR Y. As it is its own inverse, the binding and decoding
processes are identical (⊗=). For superposition, the BSC employs a majority vote: if
the component vectors of the bundle have more ones than zeros in a dimension, this
dimension will have a value of one, with ties broken at random.
In CHRR, binding through circular convolution is accomplished by pairwise multiplication: X ⊗ Y = {X1 Y1 , X2 Y2 , .....Xn−1 Yn−1 , Xn Yn }, which is equivalent to
addition of the phase angles of the circular vectors concerned. Binding is inverted by
binding to the inverse of the vector concerned: X Y = X ⊗ Y −1 , where the inverse
of a vector is its complex conjugate. Elemental vectors are initialized by randomly assigning a phase angle to each dimension. Superposition is accomplished by pairwise
addition of the unit circle vectors, and normalization of the result for each circular
component. In the implementation used in our experiments, normalization occurs after
training concludes, so the sequence in which superposition occurs is not relevant.
Our real vector implementation follows the approach developed by Sahlgren and
his colleagues [2], and differs from the binary and complex implementations, in that
elemental vectors are “bound” to permutations, rather than to other vectors. Elemental
vectors are constructed by creating a high-dimensional zero vector (on the order of 1000
dimensions), and setting a small number of the dimensions of this vector (on the order
of 10) to either +1 or -1 at random. The permutations utilized consist of shifting all of
the elements of a given vector n positions to the right, where each value n is assigned to,
or derived from, the information it is intended to encode. In the case of our orthographic
model, this information consists of the character occurring in a particular position, so
we have used the ASCII value of the character concerned as n. Binding is reversed
by permuting all of the elements of the vector n positions to the left. Superposition is
accomplished by adding the vectors concerned, and normalizing the result.
In all models, the “random” initiation of elemental vectors is rendered deterministic
by seeding the random number generator with a hash value derived from a string or
character of interest following the approach developed in [18]. This retains the property of near-orthogonality where desired, while ensuring that random overlap between
elemental vectors is consist across experiments.
α
1
2
P(1)
α
3
θ θ
θ
θ
REAL/COMPLEX
1
0
1
0
1
0.75
0.25
1
0
2
0.50
0.50
1
0
3
0.25
0.75
1
0
1
0
ω
0
1
ω
BINARY
Fig. 1. Interpolation to Generate Five Demarcator Vectors. P(1) = probability of 1.
3.2
Measured Similarity
The first step in our approach involves generating a set of vectors that are a fixed distance apart, which we will refer to as demarcator vectors (D(position)), as illustrated
in Figure 1. The first pair of demarcator vectors are conventional elemental vectors
D(α) and D(ω), constructed randomly such that they have a high probability of being
mutually orthogonal or close-to-orthogonal. To ensure this with certainty, we render
D(ω) orthogonal to D(α) using the quantum negation procedure, or its binary approximation, described in [19] and [20] respectively. The remaining demarcator vectors are
generated by interpolation. In the continuous vector spaces, this is accomplished by
subdividing the 90o angle between D(α) and D(ω) and generating the corresponding
unit vectors. In binary vector space, this is accomplished by weighting the probability
of assigning a 1 when D(α) and D(ω) disagree in accordance with the desired distance
between the new demarcator vector and each of these extremes4 . As the vectors representing adjacent numbers are approximately equidistant, the distances between vectors
pairs representing numbers the same distance apart should also be approximately equal
(e.g. sim(D(1),D(3)) ≈ sim(D(2), D(4)).
Table 1 illustrates the pairwise similarities between a set of five demarcator vectors
constructed in this manner. In the binary case vectors of dimensionality 32,000 are used,
in the complex case vectors of dimensionality 500 are used, and in the real case, vectors
of dimensionality 1,000 are used. In both cases, the relatedness between demarcator
vectors a fixed distance apart is approximately equal. For example, the similarity between all pairs of demarcator vectors two positions apart (e.g. 1 and 3) is approximately
0.5 in the binary implementation, and 0.71 in the complex and real vector implementations. In the binary implementation, the difference in relatedness is proportional to the
difference in demarcator position. This is not the case in the complex or real implementations, where the drop in similarity between demarcator vectors becomes progressively
steeper. This is an artifact of the metric used to measure similarity in each case. With
binary vectors, 2 × (0.5 − normalized Hamming distance) is used, but with continuous vectors the cosine distance metric is used. While a proportional decrement could be
4
The same random number sequence must be used for all vectors in a demarcator set, so that a
consistent random value for each bit position is compared to the relevant thresholds.
Table 1. Pairwise Similarity between Demarcator Vectors
BINARY
α
1
2
COMPLEX
3
ω
α
1
2
3
REAL
ω
α
1
2
3
ω
α 1.00 0.75 0.49 0.25 0.00 α 1.00 0.92 0.71 0.38 0.00 α 1.00 0.92 0.71 0.38 0.00
1 0.75 1.00 0.74 0.50 0.25 1 0.92 1.00 0.92 0.71 0.38 1 0.92 1.00 0.92 0.71 0.38
2 0.49 0.74 1.00 0.75 0.51 2 0.71 0.92 1.00 0.92 0.71 2 0.71 0.92 1.00 0.92 0.71
3 0.25 0.50 0.75 1.00 0.75 3 0.38 0.71 0.92 1.00 0.92 3 0.38 0.71 0.92 1.00 0.92
ω 0.00 0.25 0.51 0.75 1.00 ω 0.00 0.38 0.71 0.92 1.00 ω 0.00 0.38 0.71 0.92 1.00
obtained by measuring the angle between complex vectors directly (or taking the arccos
of this cosine value), we will retain the use of the cosine metric for our experiments.
3.3
Encoding Orthography
Using controlled degrees of non-orthogonality, we can encode information about the
positions of letters in words into their vector representations. Like spatial encoding
[14] and the overlap model [13], our approach is based upon measuring the difference
in position between matching characters. This is accomplished by creating elemental
vectors for characters, and binding them to demarcator vectors representing positions.
For example, the orthographic vector for the term “word” is constructed as follows:
S(word)+= E(w) ⊗ D(1) + E(o) ⊗ D(2) + E(r) ⊗ D(3) + E(d) ⊗ D(4)
As the vectors for characters are mutually orthogonal or near-orthogonal, bound products derived from different characters will be orthogonal or near-orthogonal also. For
example, in the complex vector space used to generate Table 1, sim(E(w)⊗D(1), E(q)⊗
D(1)) = 0. Furthermore, the distance between bound products containing the same
character will approximate the distance between their demarcator vectors. For example,
in the real vector space used to generate Table 1, sim(E(w) ⊗ D(1), E(w) ⊗ D(2)) =
0.92 = sim(D(1), D(2)). Ultimately, the similarity between a pair of terms is derived
from the distance between their matching characters. If this distance is generally low,
the orthographic similarity between these terms will be high5 . Thus, the models so generated are innately tolerant to variations such as transposition, insertion and deletion of
sequence elements.
5
For terms of different lengths, we elected to construct a set of demarcator vectors for each
term. So while D(α) and D(ω) will be identical, the demarcator for a particular position may
differ. It would also be possible to use identical demarcator vectors (by generating a set large
enough to accommodate the longest term), which may be advantageous for some tasks.
2
1
4
0
3
2
-1
1
-2
Fig. 2. Orbital angular momentum vectors, as derived from quantized states along a given axis
(left), and a related strategy for encoding positions as vectors
4
Demarcator Vectors and Orbital Angular Momentum
Evenly distributing normalized vectors between two orthogonal vectors is one of many
strategies we could adopt to generate demarcator vectors. To generalize this process,
we can describe it as follows:
1. Construct a line in the vector space with a given starting point and direction.
2. Place demarcators along this line using some dividing strategy.
3. (Optional) Project demarcator vectors onto the unit circle to normalize them.
Two examples of this strategy are illustrated in Figure 2. In the example on the left,
the generating line is the vertical axis, the dividing strategy is to mark points at even
intervals along this axis, and the projection strategy is to project orthogonally from the
vertical axis onto the unit sphere6 . In the example on the right, we have chosen a generating line parallel to the vertical axis, marked points at even intervals, and projected
onto the unit sphere using standard vector normalization. The left-hand strategy will be
familiar to some readers: this is precisely the way orbital angular momentum states are
generated in quantum mechanics. We refer the reader to a text on quantum mechanics
for this derivation, e.g., [21, Ch 14]: the process includes solving a wave equation in
three dimensions, exploring the angular momentum operator and the commutator relations between its components, and noting that each point on the axis can be mapped to
many on the outer sphere (this ambiguity corresponding to the fact that measuring the
component of angular momentum along one axis must leave the component along the
other axes undetermined according to the Uncertainy Principle). The important point
for this discussion is that it is quite standard to derive non-orthogonal vectors for states
6
In this example we have drawn negative and positive positions, though in practice we have
only experimented with nonnegative positions so far.
in this fashion, not only in generalized quantum structures but in quantum mechanics itself. The underlying spherical harmonic functions involved in angular momentum
are orthogonal to one another under pairwise integration, but lead to several possible
non-orthogonal angular momentum vectors.
The ambiguity of the projection onto the unit sphere in the angular momentum
model in more than two dimensions is problematic, and we expect the strategy on the
right to be simpler in practice. Note also that both strategies do not distribute vectors
evenly around the unit sphere: for example, in the right-hand strategy, the vectors representing positions 3 and 4 are closer to each other than the vectors representing positions
1 and 2. Such flexibility to vary these pairwise similarities between positions along a
string is a desirable property, because changes at the beginning of a word may be more
significant than changes in the middle [13]. We note also that in our current implementation in the Semantic Vectors package, this generalized strategy works well for
real-valued and complex-valued vectors, but is as yet underspecified for binary-valued
vectors because (for example) ‘the point half-way between A and B’ is multiply defined
using Hamming distance [22]. We are currently investigating appropriate strategies to
bring binary vectors into this generalized description.
5
Applications: Orthography, Morphology, Sentence Similarity
Table 2 provides examples of nearest-neighbor search based on orthographic similarity in real, complex and binary vector spaces derived from the widely-used Touchstone
Applied Science Associates, Inc. (TASA) corpus. Terms in the corpus that occurred between 5 and 15,000 times were represented as candidates for retrieval. The dimensionality of the real, complex and binary vector spaces concerned was 1000, 500 and 32,000
respectively. These dimensions were chosen so as to normalize the space requirements
of the stored vectors across models, and were retained in our subsequent experiments.
Our approach successfully recovers orthographically related terms, including terms
containing substrings of the original term (“dominic” vs. “condominium”); insertions
(“orthography” vs. “orthophotography”); substitutions (“angular” vs. “annular”)
and transposition of characters ( “wahle” vs. “whale”).
While we would be hesitant to propose the simple model of orthographic representation we have developed as a cognitive model of lexical coding, it is interesting to
Table 2. Orthographic Similarity
REAL
COMPLEX
BINARY
CUE dominic
CUE orthography
CUE orthogonality CUE angular CUE wahle
0.92
0.89
0.88
0.85
0.85
0.92
0.93
0.91
0.90
0.90
0.86
0.85
0.84
0.82
0.82
dominican
dominion
demonic
dominions
condominium
orthophotography
photography
chromatography
orthographic
choreography
orthogonally
orthogonal
orthodontia
ornithology
ornithologist
0.73
0.70
0.67
0.66
0.66
agranular
annular
angularly
gabular
inaugural
0.94
0.61
0.60
0.60
0.60
whale
awhile
while
whales
whaley
Table 3. Comparison with benchmark conditions from [15]. 1X = original dimensionality (used
edit distance
for Table 2). 2X = twice original dimensionality. nb = no binding. ED = 1 − combined
.
length
BINARY
Stability
Edge Effects
Local TL
Global TL
Distal TL
Compound TL
Distinct RP
Repeated RP
COMPLEX REAL
1X 2X 2Xnb 1X 2X 2Xnb 1X 2X ED
X X X X X X X X X
X
X X X X X X X X X
X
X X
X X
X X
X
X X
X X
X X
X
X
X X
X X
X X
X
X
note that it does conform to the majority of a set of constraints abstracted from lexical
priming data by Hannagan and his colleagues [15]. We will describe these constraints
in brief here, but refer the interested reader to [15] for further details. The constraints
are as follows: (1) Stability: a string should be most similar to itself (sim >= 0.95); (2)
Edge Effects: substitutions at the edges of strings should be more disruptive; (3) Local
Translocations (TL): transposing adjacent characters should be less disruptive than
substituting both of them; (4) Global TL: transposing all adjacent characters should
be maximally disruptive (5) Distal TL: transposing non-adjacent characters should be
more disruptive than substituting one, but less than substituting both; (6) Compound
TL: TL and substitution should be more disruptive than substitution alone; (7) Distinct
Relative Position (RP): removing some characters should preserve some similarity;
and (8) Repeated RP: removing a repeated or non-repeated letter should be equally
disruptive7 . Each constraint is accompanied by a set of test cases, consisting of paired
strings, and the degree to which a model meets the constraint is determined from the
estimated similarities between these pairs, and the relationships between them.
The extent to which the our models meet these constraints is shown in Table 3. Estimates based on all models consistently meet all constraints aside from those related
to Edge Effects, and the Global and Distal TL constraints (in the latter case this is due
to the fact that translocation of characters one position apart is less disruptive than substituting one of these characters). This represents a better fit to these constraints than
comparable models based on letter distribution only (labeled “nb”, or not bound). It also
represents a better fit than the majority of approaches evaluated against this benchmark
previously[15, 23], providing motivation for the further evaluation of a more developed
model in the future. While the real model appeared to meet the edge effect related constraint at its original dimensionality, this finding did not hold at higher dimensionality,
and was most likely produced by random overlap. This is not surprising given that our
7
As the randomization procedure makes it very unlikely that the estimates of similarity between
any two pairs will be identical, we have considered a difference of <= 0.05 to be approximately equal. This mirrors the relaxed constraint that >= 0.95 is approximately identical used
by Hannagan and his colleagues for the stability constraint.
Table 4. Combining Orthographic and Semantic Similarity
REAL
CUE think
¯
0.57 intend
0.56 know
0.51 thinks
0.51 thinking
0.51 want
COMPLEX
BINARY
CUE bring
CUE eat
CUE catch
CUE write
0.68
0.62
0.53
0.52
0.51
0.44
0.43
0.39
0.36
0.36
0.191
0.184
0.181
0.176
0.176
0.180
0.173
0.172
0.172
0.171
bringing
brings
brought
ring
burning
eats
ate
meat
eaten
restate
catching
caught
catches
watch
teach
writes
writer
rewritten
wrote
reread
current model does not address edge effects. As suggested earlier, one way to address
this issue would be to increase the distance between peripheral demarcator vectors, a
customization we plan to evaluate in future work.
The results in Table 4 were obtained by superposing the orthographic vector for each
term from the TASA corpus with a semantic vector for the same term generated using
the permutation-based approach described in [2], with a 2+2 sliding window. As anticipated by previous work combining orthographic and semantic relatedness [24, 25], the
examples suggest that this model is able to find associations between morphologically
related terms, including those between English verb roots and past tense forms are related by non-affixal patterns such as “bring:brought”. The combination of semantics
and orthography is evident in other examples, such as “think:intend”. However, this
sensitivity to morphological similarity comes at a cost of introducing false similarity
when common letter patterns do not have a semantic significance.
Table 5. Retrieval of Sentences from the TASA Corpus in complex (first two examples) and
binary (second two examples) spaces
CUE the greater the force of the air the louder the sounds
0.86
0.86
0.85
0.84
that is the smaller the wavelength the greater the energy of the radiation
the greater the amplitude the greater the amount of energy in the wave
the deeper the level of processing the stronger the trace and the better the memory
the darker the blue the deeper the water
CUE these four quantum numbers are used in describing electron behavior
0.353
0.3433
0.316
0.315
these numbers are important in chemistry
these are usually used in the home
what punctuation marks are used in these four sentences
these are abbreviations that sometimes are used in written directions
Table 5 illustrates the application of the same approach at a different level of granurality to facilitate the retrieval of similar sentences. The set of superposed semantic and
orthographic vectors used to generate Table 4, as well as randomly generated elemental
vectors for high frequency terms that were not represented in this set, were combined
with sets of demarcator vectors derived from different D(α) and D(ω) vectors than
those used at the character level. This allows for the encoding of phrases and sentences.
For example, the phrase “these numbers” would be encoded as: S(these) ⊗ D(1) +
S(numbers) ⊗ D(2). When applied to sentences from the TASA corpus, this facilitates
the retrieval of sentences describing entities related to one another in similar ways to
those in the cue sentence, a facility that has proved useful for the automated extraction
of propositional information from narrative text [26].
6
Conclusion
In this paper we have introduced a novel approach through which the near-orthogonality
of elemental vectors is deliberately violated to introduce measured similarity into semantic space. While illustrated primarily through orthographic modeling, the approach
is general in nature and can be applied in any situation in which a representation of sequence that is tolerant to variation is desired. Furthermore, this approach may mediate
the generation of holistic representations combining distributional and spatial information, a direction we plan to explore in future work. To facilitate further experimentation,
our real, binary and complex orthographic vector implementations have been released
as components of the open source Semantic Vectors package [27, 28].
Acknowledgments: This research was supported by US National Library of Medicine
grant R21 LM010826. We would like to thank Lance DeVine, for the CHRR implementation used in this research, and Tom Landauer for providing the TASA corpus.
References
1. M. N. Jones and D. J. K. Mewhort, “Representing word meaning and order information in a
composite holographic lexicon,” Psychological Review, vol. 114, pp. 1–37, 2007.
2. M. Sahlgren, A. Holst, and P. Kanerva, “Permutations as a means to encode order in
word space.,” Proceedings of the 30th Annual Meeting of the Cognitive Science Society
(CogSci’08), July 23-26, Washington D.C., USA., 2008.
3. G. E. Cox, G. Kachergis, G. Recchia, and M. N. Jones, “Toward a scalable holographic
word-form representation,” Behavior Research Methods, vol. 43, pp. 602–615, Sept. 2011.
4. G. Kachergis, G. Cox, and M. Jones, “OrBEAGLE: integrating orthography into a holographic model of the lexicon,” Artificial Neural Networks and Machine Learning–ICANN
2011, pp. 307–314, 2011.
5. T. A. Plate, Holographic Reduced Representation: Distributed Representation for Cognitive
Structures. Stanfpord, CA.: CSLI Publications, 2003.
6. T. Cohen and D. Widdows, “Empirical distributional semantics: methods and biomedical
applications,” Journal of Biomedical Informatics, vol. 42, pp. 390–405, Apr. 2009.
7. P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,”
Journal of Artificial Intelligence Research, vol. 37, no. 1, pp. 141–188, 2010.
8. T. Landauer and S. Dumais, “A solution to plato’s problem: The latent semantic analysis
theory of acquisition, induction, and representation of knowledge,” Psych. Review, vol. 104,
pp. 211–240, 1997.
9. C. Burgess, K. Livesay, and K. Lund, “Explorations in context space: Words, sentences,
discourse,” Discourse Processes, vol. 25, pp. 211–257, 1998.
10. L. De Vine and P. Bruza, “Semantic oscillations: Encoding context and structure in complex
valued holographic vectors,” Proc AAAI Fall Symp on Quantum Informatics for Cognitive,
Social, and Semantic Processes, 2010.
11. P. Basile, A. Caputo, and G. Semeraro, “Encoding syntactic dependencies by vector permutation,” in Proceedings of the EMNLP 2011 Workshop on GEometrical Models of Natural
Language Semantics, GEMS, vol. 11, pp. 43–51, 2011.
12. T. Cohen, D. Widdows, R. Schvaneveldt, and T. Rindflesch, “Finding schizophrenia’s prozac:
Emergent relational similarity in predication space,” in Proc 5th International Symposium on
Quantum Interactions. Aberdeen, Scotland. Springer-Verlag Berlin, Heidelberg., 2011.
13. P. Gomez, R. Ratcliff, and M. Perea, “The Overlap Model: A Model of Letter Position Coding,” Psychological review, vol. 115, pp. 577–600, July 2008.
14. C. J. Davis, “The spatial coding model of visual word identification.,” Psychological Review,
vol. 117, no. 3, p. 713, 2010.
15. T. Hannagan, E. Dupoux, and A. Christophe, “Holographic string encoding,” Cognitive science, vol. 35, no. 1, pp. 79–118, 2011.
16. R. W. Gayler, “Vector symbolic architectures answer jackendoff’s challenges for cognitive
neuroscience,” in In Peter Slezak (Ed.), ICCS/ASCS International Conference on Cognitive
Science, (Sydney, Australia. University of New South Wales.), pp. 133–138, 2004.
17. P. Kanerva, “Binary spatter-coding of ordered k-tuples,” Artificial Neural Networks—ICANN
96, pp. 869–873, 1996.
18. M. Wahle, D. Widdows, J. R. Herskovic, E. V. Bernstam, and T. Cohen, “Deterministic
Binary Vectors for Efficient Automated Indexing of MEDLINE/PubMed Abstracts,” AMIA
Annual Symposium Proceedings, vol. 2012, pp. 940–949, Nov. 2012.
19. D. Widdows and S. Peters, “Word vectors and quantum logic experiments with negation and
disjunction,” in Proc 8th Math. of Language Conference., (Bloomington, Indiana.), 2003.
20. T. Cohen, D. Widdows, L. D. Vine, R. Schvaneveldt, and T. C. Rindflesch, “Many Paths
Lead to Discovery: Analogical Retrieval of Cancer Therapies,” in Quantum Interaction (J. R.
Busemeyer, F. Dubois, A. Lambert-Mogiliansky, and M. Melucci, eds.), no. 7620 in Lecture
Notes in Computer Science, pp. 90–101, Springer Berlin Heidelberg, Jan. 2012.
21. D. Bohm, Quantum Theory. Prentice-Hall, 1951. Republished by Dover, 1989.
22. D. Widdows and T. Cohen, “Real, complex, and binary semantic vectors,” in Sixth International Symposium on Quantum Interaction, (Paris, France), 2012.
23. T. Hannagan and J. Grainger, “Protein Analysis Meets Visual Word Recognition: A Case for
String Kernels in the Brain,” Cognitive Science, vol. 36, no. 4, pp. 575–606, 2012.
24. D. Yarowsky and R. Wicentowski, “Minimally supervised morphological analysis by multimodal alignment,” in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, (Stroudsburg, PA, USA), pp. 207–216, Association for Computational Linguistics, 2000.
25. M. Baroni, J. Matiasek, and H. Trost, “Unsupervised discovery of morphologically related
words based on orthographic and semantic similarity,” in Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6, MPL ’02, (Stroudsburg, PA,
USA), pp. 48–57, Association for Computational Linguistics, 2002.
26. S. Dennis, “Introducing word order within the LSA framework,” Handbook of Latent Semantic Analysis, 2007.
27. D. Widdows and K. Ferraro, “Semantic vectors: A scalable open source package and online technology management application,” Sixth International Conference on Language Resources and Evaluation (LREC 2008)., 2008.
28. D. Widdows and T. Cohen, “The semantic vectors package: New algorithms and public tools
for distributional semantics,” in Fourth IEEE International Conference on Semantic Computing (ICSC), pp. 9–15, 2010.