Repeats in CMC 1 Running head: Repeats as Cues in CMC

Transcription

Repeats in CMC 1 Running head: Repeats as Cues in CMC
Repeats in CMC
Running head: Repeats as Cues in CMC
Letter and Punctuation Mark Repeats as Cues in ComputerMediated Communication
Yoram M Kalman
Darren Gergle
The Center for Technology and Social Behavior
Northwestern University
1
Repeats in CMC
2
Abstract
Analysis of an extensive corpus of half a million e-mail messages reveals abundant use of
letter repeats and of punctuation mark repeats. The role of these repeats as paralinguistic cues in
computer-mediated communication (CMC) is explored through quantitative and qualitative
analysis of the repeats. The findings of this study strengthen the claim that CMC has additional
and important cues, beyond chronemics and emoticons. We show that letter and punctuation
mark repeats appear throughout the Enron Corpus, and that users apply them creatively to
achieve a host of effects which are often analogous to those achieved through paralinguistic cues
in spoken conversation. The quantity and diversity of character repeats in the Corpus point to the
importance of letter and punctuation mark repeats as a CMC cue, and suggest further research to
explore this richness in detail. The paper discusses the way repeats are used as a cue in CMC, the
relation between this cue and paralinguistic cues in spoken conversation, and the manner in
which these cues are constructed. The implications of the findings on hyperpersonal
communication and social information processing in CMC are presented.
Repeats in CMC
3
Despite earlier claims that text-based Computer-Mediated Communication (CMC) has a
reduced capacity for supporting interpersonal and emotional interactions due to its lack of social
cues, later research has revealed extensive use of CMC to convey subtle messages that support
personal interactions (for a detailed review of cues in CMC see: Walther & Parks, 2002). Yet,
more than a decade after the emergence of the concepts of hyperpersonal communication and
social information processing, which explore the extensive richness afforded by CMC (Walther,
1992, 1996), we are still working to reveal the particular mechanisms by which users verbalize
their relational content or achieve hyperpersonal communication in actual communication. The
majority of research to date is dominated by methodologies that rely upon word-level analyses.
For example, studies have examined emotional expression by exploring word-counts within
particular semantic categories (Hancock, Landrigan, & Silver, 2007), or higher-level structural
measures such as the use of assertions or qualifiers (Guiller & Durndell, 2006). The common
theme underlying these approaches is that when communicating in text-based CMC, personal
expression can be exhibited through word choice.
However, as has been demonstrated in the research of face-to-face communication, word
choice is only one of the components of the message, namely the verbal component. The
nonverbal component includes cues such as gestures, inflection, and pitch. This component is
used, in conjunction with the verbal component, to convey complex and subtle messages
(Burgoon & Hoobler, 2002). In CMC, most of the research that went beyond research of word
choice, focused on emoticons and on chronemic (time-related) cues (Walther & Parks, 2002).
Reference to other cues is sparse, anecdotal in nature, and usually based on relatively small
samples. In his extensive analysis of online language, David Crystal describes paralinguistic and
Repeats in CMC
4
prosodic cues in CMC, such as repeated letters, repeated punctuation marks, all capital letters,
letter spacing and emphasis using asterisks (Crystal, 2001, pp. 34-35). Fox and colleagues (Fox,
Bukatko, Hallahan, & Crawford, 2007) examined gender differences in the use of textual
conventions such as, among a number of other variables, exclamations, italics, and repeated
letters during Instant Messaging (IM) sessions. Our paper extends this work by taking an indepth look at one category of these cues, the use of letter repeats and punctuation mark repeats.
We attempt to understand how these repeats are used to enrich communication in a large dataset
of authentic and unobtrusively collected e-mails. This work is inspired by the analogy between
CMC cues and the nonverbal cues studied by nonverbal communication researchers.
Nonverbal Communication
Burgoon and Hoobler (2002) define nonverbal communication more generally as
“…those behaviors other than words themselves that form a socially shared coding system”
(p.244), and mention three general categories of codes: Visual and auditory codes (kinesics,
physical appearance and vocalics), contact codes (proxemics and haptics), and place and time codes
(environment, artifacts and chronemics). Specifically, vocalic, or paralinguistic cues include audible
behaviors that augment or modify the spoken word, such as pitch, loudness, tempo, pauses, and
inflection.
Does CMC also have the ability to convey the subtle social and interpersonal messages that
nonverbal cues convey in spoken conversation? Studies that have been carried out in the last 15 years
point to an affirmative answer to this question (e.g. Byron & Baldridge, 2007; Kalman & Rafaeli,
2008; Panteli, 2002; Walther & Tidwell, 1995). These studies also suggest some of the underlying
mechanisms that allow people to achieve hyperpersonal communication using CMC: the selection of
the words in the message, the integration of physical appearance cues (emoticons), and chronemic
cues. These findings on cues in CMC beg the question whether emoticons and chronemic cues
Repeats in CMC
5
represent the majority of cues that can be conveyed using CMC, or whether CMC has a significantly
richer repertoire of cues that have not yet been explored in depth. In this study, we aim to support the
latter assertion by conducting an in-depth exploration of one category of cues which until today has
only been mentioned anecdotally, for example in Crystal‟s work described above. Specifically, we
chose to focus on repeats of letters and of punctuation marks. By showing the richness and extent of
usage of one often ignored category of CMC cues, we hope to strengthen the assertion that CMC
cues are as deserving of attention as are nonverbal cues in spoken communication.
Repeats and Sequences
The study of letter repeats, or co-occurrence of letters, has traditionally been carried out
in the context of memory retrieval studies (Jones & Mewhort, 2004a, 2004b) or for computing
applications such as statistical parsing (Collins, 1996) and predictive text-entry methods. In these
cases, the goal of co-occurrence research is to accurately determine the probabilities for the most
common letter bigrams for correctly spelled words in a given language. However, in this paper
we are interested in examining how letter repeats, and violations of proper spelling, can serve a
communicative function such as the repeated letter o in the following excerpt:
Anyone there? Helloooooo!1
A co-occurrence analysis would typically disregard such repeats. Little research has been
published on the prevalence of specific characters and character combinations (namely bigrams)
in text in general, and in CMC in particular. We are aware of no study that went beyond bigrams
in any type of naturally occurring text.
The Research Question
This exploratory study describes the manner in which character repeats are employed in
e-mail messages written and received by employees of a large American corporation.
Repeats in CMC
6
Specifically, the study asks how consecutive repeats of letters and of punctuation marks are used
in a large dataset of unobtrusively derived e-mail messages.
Method
The Enron Corpus
The Enron Corpus is based on the email archives of Enron Inc., which were confiscated
and published online as a part of the investigation which followed the Enron scandal (Berman,
2003). The original dataset was processed to accommodate the needs of researchers who wish to
explore the archive, and was republished online (Cohen, 2005). The resulting Corpus contains
about 500,000 e-mail messages in .txt format. The corpus, which was used in this study, is at
present the only publicly available dataset of naturally occurring and unobtrusively derived email messages, and it led to studies in diverse areas related to e-mail communication (e.g.
Chapanond, Krishnamoorthy, & Yener, 2005; Cohen, 2005; Kalman, Ravid, Raban, & Rafaeli,
2006). It provides researchers with a wide selection of naturally occurring e-mail messages that
include both professional and interpersonal communications. Some of the limitations of the
dataset are that the e-mails are at least five to ten years old and are focused on a single US-based
organization. Another downside is that the dataset contains many duplicates and corrupted
messages, as well as other sources of noise such as HTML code and large stretches of characters
which represent ASCII converted text of decompressed file attachments. For this reason, the
dataset still requires substantial cleansing, which we describe in the following sections.
Method of Analysis
A proprietary Python-based tool (“CorpusCruizer”)2 was developed to accommodate the
study of repeats in the Enron Corpus. We begin with a description of the capabilities of
Repeats in CMC
7
CorpusCruizer, and then detail the way it was applied to the study of repeats in the Enron
Corpus.
CorpusCruizer
CorpusCruizer allows the efficient analysis of the hundreds of thousands of messages in
the Enron Corpus, using Python Regular Expressions (Python Community, 2008). It has three
primary functions which were utilized in this study. The first function is to use regular
expression pattern matching to identify every occurrence of a particular sub-string (e.g. a repeat
of exactly three lower case m‟s), in the message bodies. Message headers were not analyzed in
this study unless they were included in the body of a message due to forwarding, replying with
quotes, etc. The second function is to extract a random sample of resulting matches. This
sampling allows the researcher to view a more manageable subset of the hundreds or thousands
of messages which contain a specific sequence of characters. The third function is to generate a
concordance style presentation of all of the occurrences of a specific sequence of characters, in
the context of the original flanking text. This allows the efficient export and consequent
visualization and manipulation of hundreds of snippets of texts that include the requested
sequence of characters, as well as the surrounding context in which they appear. In conclusion,
CorpusCruizer is able to efficiently process the hundreds of thousands of files in the Enron
Corpus, and to produce output that permits a general overview of character string frequencies, as
well as an in-depth exploration of the usage of specific consecutive repeats in the context of the
e-mail messages they appear in.
In this study, CorpusCruizer was used to identify the occurrence and location of repeats
of the 26 lowercase and uppercase letters, as well as of exclamation marks, questions marks and
Repeats in CMC
8
periods. A result set for each of the sequences was produced by CorpusCruizer, and the number
of occurrences in the dataset was tabulated.
Index construction and item classification
The first task was to reduce the thousands of occurrences of repeats in the Corpus to a
manageable list that describes the way repeats were used by the authors of the e-mail messages.
This list is an index of entries. Each entry represents a word in which a repeat was used. Thus, if
the word so appears in different places in the Corpus as so, sooo, soooo and Soooooo, the index
will include one entry, the word so, in which the repeat is marked by an asterisk (i.e. so*). To
facilitate the creation of the index, all cases of three or more lower case letter repeats were
exported and sorted. Duplicate messages were identified, and sources of noise (such as URL‟s, email addresses and random strings which included repeats) were removed. At the end of the
process, each entry in the index represents the aggregate of possible repeats (permutations) that
formed variations on the standard (normative) spelling of the entry. For a sample of entries from
the index see Table 1.
The entries in the index were reviewed and classified by two coders. The coding was
inspired by two observations: The first was that a disproportionally large fraction of the repeats
seemed to be in words which replicate audible sounds (like mmm or boom). The second
observation was that some repeats were much more abundant than other repeats, and that the rare
repeats seemed to be repeats which are difficult to articulate or to “speak out”. Accordingly, the
first classification was concerned with the role of each entry in the index: Is this entry a function
or content word (a “lexical word”), or is it an entry that reproduces a sound. Thus, the entry
lo*ng would be classified as a lexical word, while the entry agh* would be classified as a sound.
The second classification was concerned with whether the repeat could be interpreted as the
Repeats in CMC
9
written representation of a spoken elongation, or not. The coder looked at the specific letter that
was repeated, in the context of the entry, and classified whether the repeat can or cannot be
articulated as an audible elongation of the sound created by the repeated letter. For example,
swee*t would be classified as articulable if the coder feels that words such as sweeeet are
articulable as the word sweet with an extended e sound in its middle, while the entry help* would
be classified as an un-articulable entry if the coder would think that the repeated p in a word like
helppppp is difficult to articulate since the repeated letter p stops the airflow, not allowing the
articulation of an elongated sound. In the case of a sound created by more than one letter (e.g. ck
or sh), the repeat of even just one of the letters was interpreted as an elongation of the sound (e.g.
russssshhhh, rushhhhh and russsssh are all equivalent elongations of the same terminal sound).
Both the first classification (lexical word, or sound) and the second classification (articulable or
un-articulable) are not mutually exclusive, and due to their subjective nature, were carried out by
two independent coders. A small minority of the entries was labelled as unclassifiable by one or
both of the raters. These items were removed, and an inter-rater reliability statistic was
computed. The average of the percentages reported by both raters was used in the reporting.
Usage analysis
CorpusCruizer was used to extract and examine samples of repeat usage in the Corpus.
This included the extraction of various subsets of the corpus using Python regular expressions,
the export of these subsets into word processor files and spreadsheet files, and the qualitative and
quantitative exploration of these files.
Repeats in CMC
10
Results
Letter repeats
Appendix A details the frequency of occurrences of sequences of n consecutive identical
letters, by case (upper and lower), for n=1-10, as well as cumulatively for n>10. The relative
frequencies of lower case letters and bigrams were compared to those reported by Jones and
Mewhort (2004b), to verify that the distribution of character sequences in the Enron Corpus did
not differ substantially from other linguistic corpora of news, literature, online content, etc. The
exploration of the relative frequencies assisted in the identification of abnormalities in letter
frequency distributions which were apparently not the result of an intentional alteration by the
creators of the e-mail messages3
The index of words which included repeats comprised 236 entries. Table 1 lists 20
selected examples of words from the index, alongside an illustrative example of each entry, from
the Enron Corpus.
The entries in the index were reviewed and classified, as described in the Method section.
The inter-rater agreement was calculated for the first and second classifications: = .701, and =
.695 respectively.
Seventy two percent of the entries were classified as lexical words, for example, the
words way, too and long in:
please call soon. i must speak with you. it's been waaaaaay toooooooo loooong
Nineteen percent of the entries were classified as sounds, such as shhhh or vrrrrrm:
Shhhh.... it's a SURPRISE
when sure enough off in the distance we hear this vrrrrrm getting steadily louder
and louder.
Repeats in CMC
11
------------------------Table 1 about here
------------------------And, nine percent of the entries were classified as both words and sounds, such as the
onomatopoeic buzz:
Buzz. Buzz. Buzzzzzzz. No sound could be more soothing,
In the classification of the entries into articulable and un-articulable repeats, 86% of the
entries were classified as articulable repeats. For example, the repeated o in:
NOOOO! Tell me it isn't so!!!
Or the repeated e and s in:
yeeeeeeeeeeeeesssssssssssssss
Six percent were classified as un-articulable repeats, such as the repeated t in:
I don't know about this, buttttttt you never know
And, in eight percent of the cases, the same entry had both an articulable repeat and an unarticulable one, for example the repeated i‟s and g‟s in the word big:
I love biiiiigggg breakfasts in NIICE restaurants.
Punctuation mark repeats
Appendix A details the frequency of occurrences of sequences of n consecutive identical
exclamation marks, question marks, and periods, for n=1-10, as well as for n>10.
Exclamation mark repeats
Repeats of exclamation marks were used extensively in the dataset. Some examples
include:
Have a great trip!!!!!
they may increase their capacity to take TW gas from 300,000 up to 400,000
MMBtu/d!!
Repeats in CMC
12
A rather extreme version of usage of exclamation (and question) marks was:
neat! don't worry, i'm excited to cook!!!! i'll try to do it in advance!!!! but if i
can't, i'll need to cook at your place!!!! that ok?!?!?!!!!!! see you soon!!!!! Best!
As can be seen in Appendix A, the most common exclamation mark repeat was a triple
exclamation mark, the next most common repeat was a double exclamation mark, and then, in
declining frequency, four, five, six, seven, eight, nine, and ten consecutive repeats.
Question mark repeats
Repeats of question marks were abundant throughout the dataset4:
Didn't Pillsbury used to represent PG&E? I know they were virtually captive to
Chevron for many years, but I thought PG&E as well????
A nice buck on that side of our property limping in the front, hhmmm....... Jim's
deer?????!!!!!!????????
or simply:
Have you had any sleep lately????
As can be seen in Appendix A, the distribution of frequencies is somewhat inconsistent. The
most frequent consecutive repeat is of two question marks, and the next in frequency are five,
three, four, seven, six, eight, nine and ten repeats.
Period repeats
Repeats of periods were abundant throughout the dataset. Some examples include the
more common ellipsis:
So much wasted time and energy and the well is running dry at the same time the
animosity towards Enron and marketers seems only to be increasing. I like the
proposal to merge the discussions but... You will recall at the WSPP meeting in
Colorado I asked
Still doesn't work...I have even rebooted the whole system.
We also find abundant use of more than three points in a row, such as in:
was that last email the rudest thing i've ever said? so sorry......
Repeats in CMC
13
No problem if you are too busy, believe me I understand.........or whatever the
reason might be.
An interesting combination of a larger number of repeated periods and an ellipsis is:
Dow jones headline 2day: Enron says no more layoffs at us operations..........I am
laying off 20 percent of my group next week...who will explain the headline to my
group?
Lastly, we see use of a double period, which is not always clear whether it is a typo, or a
shortened ellipsis:
THanks..I have no revisions.
A word of caution, it is remotely possible that some of these individuals may have
enrolled in the very recent past, or have attended and this data not yet reached
Dick's spreadsheet, so do double check that they haven't yet enrolled before
twisting their arms..
In other cases, it seems like the double period is used as a variation on an ellipsis, or a short
pause:
do you think we can get the draft of the CSA before you leave for your offsite..
Frank- want to discuss you coming up to discuss ISDAs.. in the interests of time,
its probably best to do a video confo.. I am out next week.. the following week??
As can be seen in Appendix A, the most frequently used period repeat was an ellipsis
(three consecutive periods), the next highest frequency was of four consecutive periods, and then
two, five, six, seven, eight, nine and ten.
Discussion
Usage of Repeats
Following a discussion of emoticons in e-mail, Naomi Baron asks “Are additional
paralinguistic cues really necessary for sending satisfactory email messages? Probably not”
(Baron, 2000, p. 242). While we agree that these cues might not be necessary for sending
satisfactory e-mail messages, our research reveals extensive use of repeats in email messages.
Repeats in CMC
14
Are users being redundant and inefficient, or could these repeats fulfill a role? Here is a partial
list of possible roles of repeats.
Repeats seem to indicate the stretching of a word, emulating a stretched out syllable in
spoken conversation:
I was in an electronics store the other night... Panasonic has 9" Portable DVD
player ( like your sony) with an 8 hour battery... $999.00 US. It is sweeeeeeet.
or, more playfully:
Whaaaassssupppp
To denote a change in pitch in:
Yeeeeeeeeehaaaw!!!!!!!!!!
To denote decreased volume in:
sshhhhhh......let's keep it between us
To denote or to fill a pause:
I'm on vacation so send your changes and new items to her by Friday, December
17th.....Thanks....and Merry Christmas everybody......!!!!!
Hmmmm, I think you're right. Looks like the more we can get done tonite, the
better.
Or, to express sounds (paralinguistic alternants (Poyatos, 2002b) ):
now that i have a ‘temporary' plate for the
harley.......vvvvrrrrroooooommmmm..............vvvvvrrrrooooommmmmm!
To denote the musical intonation (of a parody on the song „American Pie‟):
I never worried on the whole way up
Buying dot coms from the back of a pickup truck
But Friday I ran out of luck
It was the day the NAAAASDAQ died
I started singin'
Bye-bye to my piece of the pie
Or, of a birthday song:
Happy birthday to youuuu
Happy birthday to youuuu
Repeats in CMC
15
Happy birthday dear
To add intonation:
WOOOOOOOOOOHOOOOOOOOOOOOOOOO, Daddy's getting a new Blue
Wave Bay boat!!!! WOOOHOOOO
To express human-made sounds:
And pfffffff, he is away
Such as laughter:
Heeeeeheeee!
Or guttural sounds:
uggggghhhh!!! what a complete and utter pr--k!! i am SO annoyed reading
To denote the rising pitch in an emphasized question:
Interesting.....Is Lay going to retire????
Other repeats seem to focus on producing a visual cue:
lllllllllllllllllooooooooooooooooooovvvvvvvvvvvvvvvvvvvveeeeeeeeeeeeeeeee
Or, to emphasize a question:
Just wondering if you are at work today or not??? If so - we're going to Happy
Hour again
In summary, we see extensive and diverse usage of repeats in e-mails of various types.
The repeats seem to communicate tempo, pitch, prosody and other paralinguistic elements, as
well as to achieve visual emphasis. Despite the wide range of apparent usage patterns, it is
difficult to ascertain objectively which paralinguistic cue is intended. The general challenge of
interpreting nonverbal communication is compounded by the fact that the same text can receive a
different interpretation by different readers.
Given this subjectivity, what evidence do we have that these specific character sequences
attempt to convey a specific cue? One line of support comes from literature on nonverbal
communication in written literature in general (Poyatos, 2002b), and of the way punctuation is
Repeats in CMC
16
used to convey paralinguistic cues in writing (Poyatos, 2002c). Poyatos‟s work suggests a
cultural consensus as to how specific paralinguistic cues are communicated in writing. This
cultural consensus is employed by authors and poets in general, and by playwrights in particular,
to more effectively convey subtle linguistic cues (Poyatos, 2002a).
A second line of support comes from our initial findings on the frequencies in the index
of Enron Corpus repeats. We see that close to one third of the items in the index were classified
as either sounds (e.g. shhhh), or as both a sound and a lexical word (e.g. boooom). In addition,
we see that a significant majority of the letter repeats seem to be articulable, and that about half
of those which are un-articulable, appear in a word that also includes at least one articulable
repeat. Thus, we see an abundance of repeats of vowels and of continuant consonants (e.g. m and
s), which, when spoken, allow the continued flow of air, and we see a relative scarcity of repeats
of plosive consonants (e.g. p and b), in which the flow of air in the vocal tract is stopped. The
claim that letter repeats try to emulate an audible paralinguistic cue are supported by the
circumstantial evidence for a link between letter repeats and the way words are articulated in
spoken conversation.
Co-occurrence of CMC Cues
In many of the quotations from the Corpus, it is evident that several cues are employed
simultaneously: we see multiple examples of letter and punctuation mark repeats, strategic use of
uppercase and lowercase letters, alternative spellings, and rich usage of commas, periods, dashes,
and other punctuation marks. This concurrent usage is reminiscent of the co-temporality and
simultaneous use of verbal and nonverbal cues in spoken communication, and should be
explored in this context of text-based CMC. For example, in the following short example we see
a repeated period, a repeated question mark, and a repeated letter:
Repeats in CMC
17
Good Morning. Either you're extremely busy or ......???? Ummmmm.
This co-occurence is apparent in the samples cited in this paper, as well as in blog
searches, where postings with letter repeats seem to be accompanied by an even richer variety of
CMC cues such as colored fonts, underlined and bold letters, etc. If this phenomenon of cooccurrence of CMC cues is confirmed by future research, it might indicate that users who wish to
make their message more expressive through the usage of CMC cues, are more likely to use
more than a single CMC cue to emphasize or to fine-tune their message.
Distribution of punctuation mark repeats
A visual inspection of the distributions of the three punctuation mark repeats shows a
highly asymmetric distribution, which is reminiscent of heavy-tailed distributions such as the
power law distribution (Newman, 2005). In order to corroborate this observation, techniques
need to be developed to separate “real” punctuation mark repeats which were created by users
who were writing sentences in their emails, from repeats which are a result of other automated
mechanisms. For example, question mark repeats often represent a sequence of non-ASCII fonts
(e.g. foreign language) which were not recognized by the system when the emails were
converted to .txt files. After this “cleansing” of the Corpus is carried out, it would be interesting
to investigate the distributions in depth: a long tail distribution might indicate more random
processes, whereas deviations from such a distribution might indicate special cases.
Generalizability
The Enron Corpus contains emails written by a narrow segment of the population, around
the turn of the millennium. This raises the question to what extent are the findings reported here
generalizable beyond this very specific context? The finding that the proportions reported in
Appendix A are in line with the proportions reported by Jones and Mewhort (2004b) suggest that
Repeats in CMC
18
the level of “noise” in the Corpus is acceptable, and that it is not significantly different from
other more traditional corpora. But this is only an indication, and until the findings reported here
are replicated in other media, their validity is limited. For example, do these cues appear in other
CMC media, such as blogs? Blogs represent the writing of a different segment of the population
(Huffaker & Calvert, 2005), and it is possible to search both contemporary and historical blog
postings from past years. For example, using the blog search engine Google Blog Search
(http://blogsearch.google.com), we can see that the (approximate) number of times the word cool
appeared in English language blogs in 2008 is 22,298,038, and that in the same time period, the
extended version with 3-10 repeats of the letter o appeared approximately 37,408 times. Or, to
take one of the words that combine an articulable and an un-articulable repeat, the word help, we
find that it appeared an estimated 4,099,758 in the same time period, while the versions with 210 l‟s or with 2-10 p‟s appeared an estimated 4,217 and 6,817 times, respectively. One last
example from table 1, the sound yum, appeared in the same time period an estimated 588,105
times, while the version of the same word with 2-10 repeats of the letter m appeared an estimated
49,085 times. These “back of an envelope” results suggest that the phenomena we report from
the Enron Corpus are generalizable to other CMC media and to contemporary times.
Theoretical Implications: Cues in CMC
Do these findings improve our understanding of CMC cues, in the context of the concepts
of hyperpersonal communication and social information processing? Are character repeats an
element of CMC which can help in coding and decoding social information, and in making
online communication more personal? We believe that our findings that repeats are used in an
analogous manner to nonverbal cues in spoken communication, and that they exhibit many of the
traits of nonverbal communication, such as ambiguity and context dependency, strengthen the
Repeats in CMC
19
claim that CMC can be a cue-rich communication medium. It is a medium that can draw on the
richness of both spoken and written communication, and in doing so, its more creative users can
employ numerous strategies to increase its richness and expressivity. The use of character and
punctuation mark repeats seems to be one of the mechanisms to achieve this goal, other
mechanisms being chronemic cues, and use of emoticons, as well as others, less studied cues
such as the use of uppercase letters, asterisks, bold and italic letters, colors, fonts, etc.
Limitations and Further Research
This preliminary foray into the realm of character repeats in e-mail messages is limited in
many ways, and most of the generalizations and interpretations presented here require substantial
additional research in order to support or refute them.
The most significant limitation of this paper is its descriptive nature. In order to establish
the role of repeats as CMC cues which have a role that is analogous to nonverbal cues in spoken
conversation, we need to move beyond descriptions and interpretations that are based on the
personal experience of the researchers as CMC users and as CMC scholars. Our claim will
remain tentative until samples of these putative cues are experimentally presented to their
creators as well as to potential addressees, and the linkage between the letter repeats and
paralinguistic cues is established. What was the intention of the creators of these repeats, and
how are the repeats used in interpreting the texts?
The Enron Corpus is extensive and diverse, and includes many examples of task oriented
messages, as well as of messages of a more interpersonal nature. But, its size, diversity and the
haphazard manner in which the dataset was put together by federal officers who confiscated any
e-mail server they were able to find, mean that it is also very noisy, full of duplications and nonemail character sequences. These could bias the results in unpredictable ways.
Repeats in CMC
20
The index of 236 items is only a subset of all items in which a repeat is used in the
Corpus. Further extensive work still needs to be carried out on the Corpus in order to isolate
items with repeats related to upper case letters, as well as repeats of two consecutive letters.
This work focused only on three types of punctuation marks, and did not look at repeats
of other punctuation marks (e.g. dashes) and of other non-letter characters (e.g. asterisks).
Further analysis of the Corpus will help in elucidating the role of these repeats in the emails.
This study did not explore the location of the repeated letters in the word, or the location
of words with repeats in the sentence or in the paragraph. It would be interesting to explore
whether these are distributed evenly. For example, it appears as if more of the un-articulable
repeats appear at the end of words, rather than at the beginning or the middle. It is possible that
the dynamics of typing make it easier to repeat a terminal letter than one that is in the beginning
or middle of a typed word.
In several of the e-mail messages, letter repeats were used to express stuttering. The
manual sorting of the items that took place during the construction of the index seemed to
indicate that this is a rare usage. Consequently, this usage was not discussed further.
Conclusion
Letter repeats and punctuation mark repeats have been described by previous researchers
as paralinguistic cues used in text-based CMC, but have not been explored systematically. In this
paper we explore the concept of paralinguistic cues, as well as the more general concept of
nonverbal cues, in relation to text-based CMC, in an attempt to establish the theoretical relevance
of the question of CMC cues in relation to the concepts of social information processing and of
hyperpersonal communication. The role of chronemics, and to a lesser extent of emoticons, in
enriching CMC with cues has been established by previous studies, but are character repeats an
Repeats in CMC
21
example of an additional category of CMC cues, not yet explored? In this paper we report on the
construction of a tool that facilitates the systematic exploration of letter and punctuation mark
repeats, and present findings on the prevalence and the usage of these repeats in a naturalistic
corpus of about half a million e-mail messages.
We find thousands of examples of repeats, which could fulfill paralinguistic roles such as
modified pitch, loudness, tempo, inflection, and more. These thousands of examples of letter
repeats were collapsed to an index of several hundred items, each of which describes a word
from the Corpus in which a repeat was found. An analysis of the properties of these items
showed that a disproportionally large percentage of the items describe sounds, and that many of
the repeats were of letters or sounds which are easily articulated aloud as extended syllables. On
the one hand, these findings strengthen the intuition that repeats are often times an attempt to
emulate a vocal paralinguistic cue. On the other hand, the existence of a significant minority of
un-articulable repeats, as well as of repeated punctuation marks, remind us that repeats can also
be used as purely visual emphasis tools, not necessary linked to an audible counterpart. Taken
together, these findings should be treated as a confirmation of the suggestions in the literature
that repeats serve as paralinguistic cues. These findings also suggest that we are only beginning
to understand the role of CMC cues in online communication, and the interaction between these
cues and the verbal content of messages. A better understanding of CMC cues will improve our
understanding of the richness and power of text-based CMC.
Acknowledgements
We thank Alberto Gonzalez for his work on the development of CorpusCruizer, Amanda
Yentz for her work on the Enron Corpus, and Joe Walther for helpful and inspiring discussions.
Repeats in CMC
References
Baron, N. S. (2000). Alphabet to Email. New York: Routledge.
Berman, D. K. (2003, 5 October). Online laundry: Government posts Enron's e-mail --- amid
power-market minutiae, many personal items; `about Wednesday... '. The Wall Street
Journal, p. 1,
Burgoon, J. K., & Hoobler, G. D. (2002). Nonverbal signals. In M. Knapp & J. Daly (Eds.),
Handbook of interpersonal communication (pp. 240-299). Thousand Oaks, CA: Sage.
Byron, K., & Baldridge, D. C. (2007). E-mail recipients' impressions of senders' likability: The
interactive effect of nonverbal cues and recipients' personality. Journal of Business
Communication, 44(2), 137.
Chapanond, A., Krishnamoorthy, M. S., & Yener, B. (2005). Graph Theoretic and Spectral
Analysis of Enron Email Data. Computational & Mathematical Organization Theory,
11(3), 265-281.
Cohen, W. W. (2005). Enron Email Dataset. Retrieved October 27, 2008, from
http://www.cs.cmu.edu/~enron/
Collins, M. J. (1996). A new statistical parser based on bigram lexical dependencies. Paper
presented at the Proceedings of the 34th annual meeting on Association for
Computational Linguistics.
Crystal, D. (2001). Language and the Internet. Port Chester, NY: Cambridge University Press.
Fox, A. B., Bukatko, D., Hallahan, M., & Crawford, M. (2007). The Medium Makes a
Difference: Gender Similarities and Differences in Instant Messaging. Journal of
Language and Social Psychology, 26(4), 389-397.
22
Repeats in CMC
23
Guiller, J., & Durndell, A. (2006). 'I totally agree with you': gender interactions in educational
online discussion groups. Journal of Computer Assisted Learning, 22(5), 368-381.
Hancock, J. T., Landrigan, C., & Silver, C. (2007). Expressing emotion in text-based
communication. Paper presented at the Proceedings of the SIGCHI conference on Human
factors in computing systems.
Huffaker, D. A., & Calvert, S. L. (2005). Gender, Identity, and Language Use in Teenage Blogs.
Journal of Computer-Mediated Communication, 10(2).
Jones, M. N., & Mewhort, D. J. K. (2004a). Case-sensitive letter and bigram frequency counts
from large-scale English corpora. Behavior Research Methods, Instruments, &
Computers, 36(3), 388-396.
Jones, M. N., & Mewhort, D. J. K. (2004b). Jones_BRMIC_2004.zip. Retrieved October 27,
2008, from http://www.psychonomic.org/ARCHIVE/
Kalman, Y. M., & Rafaeli, S. (2008). Chronemic nonverbal expectancy violations in written
computer mediated communication. Paper presented at the International Communication
Association.
Kalman, Y. M., Ravid, G., Raban, D. R., & Rafaeli, S. (2006). Pauses and response latencies: A
chronemic analysis of asynchronous CMC. Journal of Computer Mediated
Communication, 12(1), 1-23.
Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary
Physics, 46(5), 323-351.
Panteli, N. (2002). Richness, power cues and email text. Information and Management, 40(2),
75-86.
Repeats in CMC
24
Poyatos, F. (2002a). Functions of Nonverbal Communication in Literature. In F. Poyatos (Ed.),
Nonverbal Communication across Disciplines. Volume 3 : Narrative literature, theater,
cinema, translation (Vol. 3, pp. 153-182). Amsterdam: John Benjamins Publishing
Company.
Poyatos, F. (2002b). Nonverbal communication across disciplines: Paralanguage, kinesics,
silence, personal and environmental interaction (Vol. 2). Amsterdam: John Benjamins
Publishing Company.
Poyatos, F. (2002c). Punctuation as nonverbal communication In F. Poyatos (Ed.), Nonverbal
Communication across Disciplines. Volume 3 : Narrative literature, theater, cinema,
translation (Vol. 3, pp. 125-151). Amsterdam: John Benjamins Publishing Company.
Python Community. (2008). Regular Expression Syntax Retrieved October 27, 2008, from
http://www.python.org/doc/2.5.2/lib/re-syntax.html
Walther, J. B. (1992). Interpersonal Effects in Computer-Mediated Interaction - a Relational
Perspective. Communication Research, 19(1), 52-90.
Walther, J. B. (1996). Computer-mediated communication: Impersonal, interpersonal, and
hyperpersonal interaction. Communication Research, 23(1), 3-43.
Walther, J. B., & Parks, M. R. (2002). Cues filtered out, cues filtered in. In M. Knapp & J. Daly
(Eds.), Handbook of interpersonal communication (pp. 529-563). Thousand Oaks, CA:
Sage.
Walther, J. B., & Tidwell, L. C. (1995). Nonverbal cues in computer-mediated communication,
and the effect of chronemics on relational communication. Journal of Organizational
Computing, 5, 355-378.
Repeats in CMC
Table 1
Twenty examples of items from the Enron Corpus index. Letters with an asterisk represent
letter repeats that appear in email messages
ag*h*
Agghhh, it's so nice to get email from you
all*
he slept alllll afternoon!
ba*d
I'm a baaaad person.
brr*
I bet you love the temperature today (BRRRR).
coo*l
That's soooooo coooool
cra*sh
Bang! Booommm! Craaash!...What was that???
fa*r
dude, that is faaaaar.
free*zing
it is freeeezing in here.
goo*d
As they say in Alabama, "It's all gooood."
hm*
Well, hmmm, lets see.
hel*p*
I forgot which one. Helllllpppp.
lo*ve
we looove the kiddush cup, how should we clean it??????
m*
Mmmmmm, stuffed mushrooms!!!!!
oo*ps
Ooops, forgot the redline for changes in Exhibit A.
p*l*e*a*s*e* Thanks....and pllleeeeease forgive me and tell me your husband's first name
slow*
It is slowwwww.
so*
You are sooooooo crazy!!!!
swee*t
Check out the new Tyan S2466 motherboard. Sweeeet!.
was*u*p*
HEY WASSSUPPP?
yum*
Complete with chocolate covered strawberries! Yummm.
25
Page 26
Appendix A
Occurrences of Letter and punctuation mark Repeats in the Enron Corpus, by Character and by Number of Repeats
1
2
3
4
5
6
7
8
9
10
11+
a
48423405
24410
273
100
52
14
9
4
5
2
41
b
8262795
67478
71
5
0
0
0
0
0
0
0
c
19500604
630631
157
50
3
153
0
0
3
0
0
d
20110786
323615
95
7
0
26
0
0
0
0
0
e
68439146
2186204
534
60
16
103
3
7
8
7
14
f
9458663
912662
150
511
5
2827
2
0
0
0
0
g
11703362
140204
147
10
4
0
1
0
0
0
0
h
22452327
8297
155
68
62
10
19
11
5
0
39
i
41851910
14322
2758
11
4
42
1
4
40
0
7
j
1200570
2500
30
2
0
1
0
0
0
0
0
k
5235647
7012
39
27
3
0
0
3
0
0
0
l
18345988
3676890
469
12
13
4
6
0
3
1
14
m
14616403
642590
562
99
30
27
7
1
9
2
5
n
44375225
617294
265
17
5
4
1
0
4
0
0
o
46973573
903036
513
242
170
74
26
37
11
23
96
p
11323667
691563
336
18
3
4
6
0
0
0
0
q
722228
699
26
0
0
0
0
0
0
0
0
r
39292214
723565
426
39
10
6
11
0
1
1
2
s
32093025
2219836
623
61
28
21
0
3
10
0
34
t
48065597
1407495
356
10
3
0
1
0
0
2
0
u
16914169
2433
35
44
2
0
6
3
2
0
16
v
6244554
1817
20
3
3
0
0
0
0
0
3
w
9222261
9089
221888
23
44
1
0
0
0
0
2
x
1536148
5865
164
211
25
43
45
5
4
0
107
y
11757553
1999
29
50
9
1
0
0
1
0
41
z
684745
18777
218
15
8
1
3
0
1
0
0
Repeats in CMC
Page 27
1
2
3
4
5
6
7
8
9
10
11+
A
5323480
320437
81619
71385
22193
16119
12093
18198
13697
4528
41714
B
2389952
18561
2828
11
1
0
2
0
0
0
0
C
5653905
44843
392
81
0
343
0
0
0
0
0
D
3144278
13821
104
3
0
19
0
0
0
0
0
E
7004663
252425
589
8
23
37
0
12
1
0
27
F
2267466
20499
88
219
8
2310
0
0
0
0
0
G
1891967
8146
55
3
2
1
0
0
0
0
0
H
2229027
2245
32
38
24
9
3
3
3
0
4
I
4050987
18293
7574
6
2
0
0
0
0
0
7
J
1398623
1651
78
24
0
0
0
0
0
0
0
K
941623
2033
44
3
0
0
0
0
0
0
0
L
2031536
83349
39
9
3
2
1
0
0
0
0
M
4076985
43887
426
15
3
8
2
2
0
0
0
N
3704492
21271
58
35
2
12
0
0
0
0
11
O
3623388
28812
87
25
20
62
1
4
1
7
26
P
3330653
35525
346
40
0
24
0
0
0
0
1
Q
448785
4862
141
0
0
9
0
0
0
0
3
R
3316230
18979
212
10
3
0
2
0
0
4
3
S
5117713
70596
101
90
23
1
0
27
0
0
3
T
6113182
24257
95
6
2
0
0
0
0
0
0
U
1580966
1682
47
8
3
3
0
0
0
0
0
V
794723
1306
16
2
0
0
0
0
0
0
0
W
1717823
1421
478
30
1
0
0
4
0
0
0
X
383982
2478
329
942
151
28
12
4
7
0
7
Y
767325
2037
46
167
4
0
0
0
0
0
1
Z
284323
1373
31
15
0
0
1
0
1
0
1
Repeats in CMC
2
3
4
5
6
7
8
9
10
11+
Page 28
!
235659
12413
33652
?
715624
44149
.
10186044
12428
Repeats in CMC
3071
1394
572
299
226
211
128
920
16551
6927
23394
1917
5489
1040
866
609
14408
73409
18821
6693
3335
1822
1331
809
813
6449
Page 29
Endnotes
1
This is a direct quote from the Enron Corpus. For a description of the Corpus, see the Method
section. All excerpts from the Enron Corpus in this paper will appear verbatim, as indented and
italicized texts.
2
Available from the authors
3
An examination of a random sample of each category of repeats revealed some letter
sequences that are a result of repetitions which are less relevant to this study. Examples include: (i)
letter bigrams that are common in normal text (such as ee or ll) tend to appear more often as letter
triplets too (eee, lll), apparently as a result of typos. For example “Don't worry about tellling her my
travel schedule.” or “Thanks for taking time out of your busy schedules to meeet with me on the 28th
of December - especially during the holiday season”. (ii) Specific repeats seem to be a result of
computer code, especially HTML color codes such as #ffffff or #ffff00, or of other components
which appear in e-mails. For example, the sequence www which is common in links, or sequences of
many upper case A‟s which appear frequently in e-mail attachments which have been converted into
ASCII text.
4
Many of the question mark repeats were not created intentionally by writers, but are the result
of the conversion into text (.txt) files of characters not included in the character set. Such unidentified
characters are replaced by question marks. These repeats are included in the table in Appendix A.
Repeats in CMC