Repeats in CMC 1 Running head: Repeats as Cues in CMC
Transcription
Repeats in CMC 1 Running head: Repeats as Cues in CMC
Repeats in CMC Running head: Repeats as Cues in CMC Letter and Punctuation Mark Repeats as Cues in ComputerMediated Communication Yoram M Kalman Darren Gergle The Center for Technology and Social Behavior Northwestern University 1 Repeats in CMC 2 Abstract Analysis of an extensive corpus of half a million e-mail messages reveals abundant use of letter repeats and of punctuation mark repeats. The role of these repeats as paralinguistic cues in computer-mediated communication (CMC) is explored through quantitative and qualitative analysis of the repeats. The findings of this study strengthen the claim that CMC has additional and important cues, beyond chronemics and emoticons. We show that letter and punctuation mark repeats appear throughout the Enron Corpus, and that users apply them creatively to achieve a host of effects which are often analogous to those achieved through paralinguistic cues in spoken conversation. The quantity and diversity of character repeats in the Corpus point to the importance of letter and punctuation mark repeats as a CMC cue, and suggest further research to explore this richness in detail. The paper discusses the way repeats are used as a cue in CMC, the relation between this cue and paralinguistic cues in spoken conversation, and the manner in which these cues are constructed. The implications of the findings on hyperpersonal communication and social information processing in CMC are presented. Repeats in CMC 3 Despite earlier claims that text-based Computer-Mediated Communication (CMC) has a reduced capacity for supporting interpersonal and emotional interactions due to its lack of social cues, later research has revealed extensive use of CMC to convey subtle messages that support personal interactions (for a detailed review of cues in CMC see: Walther & Parks, 2002). Yet, more than a decade after the emergence of the concepts of hyperpersonal communication and social information processing, which explore the extensive richness afforded by CMC (Walther, 1992, 1996), we are still working to reveal the particular mechanisms by which users verbalize their relational content or achieve hyperpersonal communication in actual communication. The majority of research to date is dominated by methodologies that rely upon word-level analyses. For example, studies have examined emotional expression by exploring word-counts within particular semantic categories (Hancock, Landrigan, & Silver, 2007), or higher-level structural measures such as the use of assertions or qualifiers (Guiller & Durndell, 2006). The common theme underlying these approaches is that when communicating in text-based CMC, personal expression can be exhibited through word choice. However, as has been demonstrated in the research of face-to-face communication, word choice is only one of the components of the message, namely the verbal component. The nonverbal component includes cues such as gestures, inflection, and pitch. This component is used, in conjunction with the verbal component, to convey complex and subtle messages (Burgoon & Hoobler, 2002). In CMC, most of the research that went beyond research of word choice, focused on emoticons and on chronemic (time-related) cues (Walther & Parks, 2002). Reference to other cues is sparse, anecdotal in nature, and usually based on relatively small samples. In his extensive analysis of online language, David Crystal describes paralinguistic and Repeats in CMC 4 prosodic cues in CMC, such as repeated letters, repeated punctuation marks, all capital letters, letter spacing and emphasis using asterisks (Crystal, 2001, pp. 34-35). Fox and colleagues (Fox, Bukatko, Hallahan, & Crawford, 2007) examined gender differences in the use of textual conventions such as, among a number of other variables, exclamations, italics, and repeated letters during Instant Messaging (IM) sessions. Our paper extends this work by taking an indepth look at one category of these cues, the use of letter repeats and punctuation mark repeats. We attempt to understand how these repeats are used to enrich communication in a large dataset of authentic and unobtrusively collected e-mails. This work is inspired by the analogy between CMC cues and the nonverbal cues studied by nonverbal communication researchers. Nonverbal Communication Burgoon and Hoobler (2002) define nonverbal communication more generally as “…those behaviors other than words themselves that form a socially shared coding system” (p.244), and mention three general categories of codes: Visual and auditory codes (kinesics, physical appearance and vocalics), contact codes (proxemics and haptics), and place and time codes (environment, artifacts and chronemics). Specifically, vocalic, or paralinguistic cues include audible behaviors that augment or modify the spoken word, such as pitch, loudness, tempo, pauses, and inflection. Does CMC also have the ability to convey the subtle social and interpersonal messages that nonverbal cues convey in spoken conversation? Studies that have been carried out in the last 15 years point to an affirmative answer to this question (e.g. Byron & Baldridge, 2007; Kalman & Rafaeli, 2008; Panteli, 2002; Walther & Tidwell, 1995). These studies also suggest some of the underlying mechanisms that allow people to achieve hyperpersonal communication using CMC: the selection of the words in the message, the integration of physical appearance cues (emoticons), and chronemic cues. These findings on cues in CMC beg the question whether emoticons and chronemic cues Repeats in CMC 5 represent the majority of cues that can be conveyed using CMC, or whether CMC has a significantly richer repertoire of cues that have not yet been explored in depth. In this study, we aim to support the latter assertion by conducting an in-depth exploration of one category of cues which until today has only been mentioned anecdotally, for example in Crystal‟s work described above. Specifically, we chose to focus on repeats of letters and of punctuation marks. By showing the richness and extent of usage of one often ignored category of CMC cues, we hope to strengthen the assertion that CMC cues are as deserving of attention as are nonverbal cues in spoken communication. Repeats and Sequences The study of letter repeats, or co-occurrence of letters, has traditionally been carried out in the context of memory retrieval studies (Jones & Mewhort, 2004a, 2004b) or for computing applications such as statistical parsing (Collins, 1996) and predictive text-entry methods. In these cases, the goal of co-occurrence research is to accurately determine the probabilities for the most common letter bigrams for correctly spelled words in a given language. However, in this paper we are interested in examining how letter repeats, and violations of proper spelling, can serve a communicative function such as the repeated letter o in the following excerpt: Anyone there? Helloooooo!1 A co-occurrence analysis would typically disregard such repeats. Little research has been published on the prevalence of specific characters and character combinations (namely bigrams) in text in general, and in CMC in particular. We are aware of no study that went beyond bigrams in any type of naturally occurring text. The Research Question This exploratory study describes the manner in which character repeats are employed in e-mail messages written and received by employees of a large American corporation. Repeats in CMC 6 Specifically, the study asks how consecutive repeats of letters and of punctuation marks are used in a large dataset of unobtrusively derived e-mail messages. Method The Enron Corpus The Enron Corpus is based on the email archives of Enron Inc., which were confiscated and published online as a part of the investigation which followed the Enron scandal (Berman, 2003). The original dataset was processed to accommodate the needs of researchers who wish to explore the archive, and was republished online (Cohen, 2005). The resulting Corpus contains about 500,000 e-mail messages in .txt format. The corpus, which was used in this study, is at present the only publicly available dataset of naturally occurring and unobtrusively derived email messages, and it led to studies in diverse areas related to e-mail communication (e.g. Chapanond, Krishnamoorthy, & Yener, 2005; Cohen, 2005; Kalman, Ravid, Raban, & Rafaeli, 2006). It provides researchers with a wide selection of naturally occurring e-mail messages that include both professional and interpersonal communications. Some of the limitations of the dataset are that the e-mails are at least five to ten years old and are focused on a single US-based organization. Another downside is that the dataset contains many duplicates and corrupted messages, as well as other sources of noise such as HTML code and large stretches of characters which represent ASCII converted text of decompressed file attachments. For this reason, the dataset still requires substantial cleansing, which we describe in the following sections. Method of Analysis A proprietary Python-based tool (“CorpusCruizer”)2 was developed to accommodate the study of repeats in the Enron Corpus. We begin with a description of the capabilities of Repeats in CMC 7 CorpusCruizer, and then detail the way it was applied to the study of repeats in the Enron Corpus. CorpusCruizer CorpusCruizer allows the efficient analysis of the hundreds of thousands of messages in the Enron Corpus, using Python Regular Expressions (Python Community, 2008). It has three primary functions which were utilized in this study. The first function is to use regular expression pattern matching to identify every occurrence of a particular sub-string (e.g. a repeat of exactly three lower case m‟s), in the message bodies. Message headers were not analyzed in this study unless they were included in the body of a message due to forwarding, replying with quotes, etc. The second function is to extract a random sample of resulting matches. This sampling allows the researcher to view a more manageable subset of the hundreds or thousands of messages which contain a specific sequence of characters. The third function is to generate a concordance style presentation of all of the occurrences of a specific sequence of characters, in the context of the original flanking text. This allows the efficient export and consequent visualization and manipulation of hundreds of snippets of texts that include the requested sequence of characters, as well as the surrounding context in which they appear. In conclusion, CorpusCruizer is able to efficiently process the hundreds of thousands of files in the Enron Corpus, and to produce output that permits a general overview of character string frequencies, as well as an in-depth exploration of the usage of specific consecutive repeats in the context of the e-mail messages they appear in. In this study, CorpusCruizer was used to identify the occurrence and location of repeats of the 26 lowercase and uppercase letters, as well as of exclamation marks, questions marks and Repeats in CMC 8 periods. A result set for each of the sequences was produced by CorpusCruizer, and the number of occurrences in the dataset was tabulated. Index construction and item classification The first task was to reduce the thousands of occurrences of repeats in the Corpus to a manageable list that describes the way repeats were used by the authors of the e-mail messages. This list is an index of entries. Each entry represents a word in which a repeat was used. Thus, if the word so appears in different places in the Corpus as so, sooo, soooo and Soooooo, the index will include one entry, the word so, in which the repeat is marked by an asterisk (i.e. so*). To facilitate the creation of the index, all cases of three or more lower case letter repeats were exported and sorted. Duplicate messages were identified, and sources of noise (such as URL‟s, email addresses and random strings which included repeats) were removed. At the end of the process, each entry in the index represents the aggregate of possible repeats (permutations) that formed variations on the standard (normative) spelling of the entry. For a sample of entries from the index see Table 1. The entries in the index were reviewed and classified by two coders. The coding was inspired by two observations: The first was that a disproportionally large fraction of the repeats seemed to be in words which replicate audible sounds (like mmm or boom). The second observation was that some repeats were much more abundant than other repeats, and that the rare repeats seemed to be repeats which are difficult to articulate or to “speak out”. Accordingly, the first classification was concerned with the role of each entry in the index: Is this entry a function or content word (a “lexical word”), or is it an entry that reproduces a sound. Thus, the entry lo*ng would be classified as a lexical word, while the entry agh* would be classified as a sound. The second classification was concerned with whether the repeat could be interpreted as the Repeats in CMC 9 written representation of a spoken elongation, or not. The coder looked at the specific letter that was repeated, in the context of the entry, and classified whether the repeat can or cannot be articulated as an audible elongation of the sound created by the repeated letter. For example, swee*t would be classified as articulable if the coder feels that words such as sweeeet are articulable as the word sweet with an extended e sound in its middle, while the entry help* would be classified as an un-articulable entry if the coder would think that the repeated p in a word like helppppp is difficult to articulate since the repeated letter p stops the airflow, not allowing the articulation of an elongated sound. In the case of a sound created by more than one letter (e.g. ck or sh), the repeat of even just one of the letters was interpreted as an elongation of the sound (e.g. russssshhhh, rushhhhh and russsssh are all equivalent elongations of the same terminal sound). Both the first classification (lexical word, or sound) and the second classification (articulable or un-articulable) are not mutually exclusive, and due to their subjective nature, were carried out by two independent coders. A small minority of the entries was labelled as unclassifiable by one or both of the raters. These items were removed, and an inter-rater reliability statistic was computed. The average of the percentages reported by both raters was used in the reporting. Usage analysis CorpusCruizer was used to extract and examine samples of repeat usage in the Corpus. This included the extraction of various subsets of the corpus using Python regular expressions, the export of these subsets into word processor files and spreadsheet files, and the qualitative and quantitative exploration of these files. Repeats in CMC 10 Results Letter repeats Appendix A details the frequency of occurrences of sequences of n consecutive identical letters, by case (upper and lower), for n=1-10, as well as cumulatively for n>10. The relative frequencies of lower case letters and bigrams were compared to those reported by Jones and Mewhort (2004b), to verify that the distribution of character sequences in the Enron Corpus did not differ substantially from other linguistic corpora of news, literature, online content, etc. The exploration of the relative frequencies assisted in the identification of abnormalities in letter frequency distributions which were apparently not the result of an intentional alteration by the creators of the e-mail messages3 The index of words which included repeats comprised 236 entries. Table 1 lists 20 selected examples of words from the index, alongside an illustrative example of each entry, from the Enron Corpus. The entries in the index were reviewed and classified, as described in the Method section. The inter-rater agreement was calculated for the first and second classifications: = .701, and = .695 respectively. Seventy two percent of the entries were classified as lexical words, for example, the words way, too and long in: please call soon. i must speak with you. it's been waaaaaay toooooooo loooong Nineteen percent of the entries were classified as sounds, such as shhhh or vrrrrrm: Shhhh.... it's a SURPRISE when sure enough off in the distance we hear this vrrrrrm getting steadily louder and louder. Repeats in CMC 11 ------------------------Table 1 about here ------------------------And, nine percent of the entries were classified as both words and sounds, such as the onomatopoeic buzz: Buzz. Buzz. Buzzzzzzz. No sound could be more soothing, In the classification of the entries into articulable and un-articulable repeats, 86% of the entries were classified as articulable repeats. For example, the repeated o in: NOOOO! Tell me it isn't so!!! Or the repeated e and s in: yeeeeeeeeeeeeesssssssssssssss Six percent were classified as un-articulable repeats, such as the repeated t in: I don't know about this, buttttttt you never know And, in eight percent of the cases, the same entry had both an articulable repeat and an unarticulable one, for example the repeated i‟s and g‟s in the word big: I love biiiiigggg breakfasts in NIICE restaurants. Punctuation mark repeats Appendix A details the frequency of occurrences of sequences of n consecutive identical exclamation marks, question marks, and periods, for n=1-10, as well as for n>10. Exclamation mark repeats Repeats of exclamation marks were used extensively in the dataset. Some examples include: Have a great trip!!!!! they may increase their capacity to take TW gas from 300,000 up to 400,000 MMBtu/d!! Repeats in CMC 12 A rather extreme version of usage of exclamation (and question) marks was: neat! don't worry, i'm excited to cook!!!! i'll try to do it in advance!!!! but if i can't, i'll need to cook at your place!!!! that ok?!?!?!!!!!! see you soon!!!!! Best! As can be seen in Appendix A, the most common exclamation mark repeat was a triple exclamation mark, the next most common repeat was a double exclamation mark, and then, in declining frequency, four, five, six, seven, eight, nine, and ten consecutive repeats. Question mark repeats Repeats of question marks were abundant throughout the dataset4: Didn't Pillsbury used to represent PG&E? I know they were virtually captive to Chevron for many years, but I thought PG&E as well???? A nice buck on that side of our property limping in the front, hhmmm....... Jim's deer?????!!!!!!???????? or simply: Have you had any sleep lately???? As can be seen in Appendix A, the distribution of frequencies is somewhat inconsistent. The most frequent consecutive repeat is of two question marks, and the next in frequency are five, three, four, seven, six, eight, nine and ten repeats. Period repeats Repeats of periods were abundant throughout the dataset. Some examples include the more common ellipsis: So much wasted time and energy and the well is running dry at the same time the animosity towards Enron and marketers seems only to be increasing. I like the proposal to merge the discussions but... You will recall at the WSPP meeting in Colorado I asked Still doesn't work...I have even rebooted the whole system. We also find abundant use of more than three points in a row, such as in: was that last email the rudest thing i've ever said? so sorry...... Repeats in CMC 13 No problem if you are too busy, believe me I understand.........or whatever the reason might be. An interesting combination of a larger number of repeated periods and an ellipsis is: Dow jones headline 2day: Enron says no more layoffs at us operations..........I am laying off 20 percent of my group next week...who will explain the headline to my group? Lastly, we see use of a double period, which is not always clear whether it is a typo, or a shortened ellipsis: THanks..I have no revisions. A word of caution, it is remotely possible that some of these individuals may have enrolled in the very recent past, or have attended and this data not yet reached Dick's spreadsheet, so do double check that they haven't yet enrolled before twisting their arms.. In other cases, it seems like the double period is used as a variation on an ellipsis, or a short pause: do you think we can get the draft of the CSA before you leave for your offsite.. Frank- want to discuss you coming up to discuss ISDAs.. in the interests of time, its probably best to do a video confo.. I am out next week.. the following week?? As can be seen in Appendix A, the most frequently used period repeat was an ellipsis (three consecutive periods), the next highest frequency was of four consecutive periods, and then two, five, six, seven, eight, nine and ten. Discussion Usage of Repeats Following a discussion of emoticons in e-mail, Naomi Baron asks “Are additional paralinguistic cues really necessary for sending satisfactory email messages? Probably not” (Baron, 2000, p. 242). While we agree that these cues might not be necessary for sending satisfactory e-mail messages, our research reveals extensive use of repeats in email messages. Repeats in CMC 14 Are users being redundant and inefficient, or could these repeats fulfill a role? Here is a partial list of possible roles of repeats. Repeats seem to indicate the stretching of a word, emulating a stretched out syllable in spoken conversation: I was in an electronics store the other night... Panasonic has 9" Portable DVD player ( like your sony) with an 8 hour battery... $999.00 US. It is sweeeeeeet. or, more playfully: Whaaaassssupppp To denote a change in pitch in: Yeeeeeeeeehaaaw!!!!!!!!!! To denote decreased volume in: sshhhhhh......let's keep it between us To denote or to fill a pause: I'm on vacation so send your changes and new items to her by Friday, December 17th.....Thanks....and Merry Christmas everybody......!!!!! Hmmmm, I think you're right. Looks like the more we can get done tonite, the better. Or, to express sounds (paralinguistic alternants (Poyatos, 2002b) ): now that i have a ‘temporary' plate for the harley.......vvvvrrrrroooooommmmm..............vvvvvrrrrooooommmmmm! To denote the musical intonation (of a parody on the song „American Pie‟): I never worried on the whole way up Buying dot coms from the back of a pickup truck But Friday I ran out of luck It was the day the NAAAASDAQ died I started singin' Bye-bye to my piece of the pie Or, of a birthday song: Happy birthday to youuuu Happy birthday to youuuu Repeats in CMC 15 Happy birthday dear To add intonation: WOOOOOOOOOOHOOOOOOOOOOOOOOOO, Daddy's getting a new Blue Wave Bay boat!!!! WOOOHOOOO To express human-made sounds: And pfffffff, he is away Such as laughter: Heeeeeheeee! Or guttural sounds: uggggghhhh!!! what a complete and utter pr--k!! i am SO annoyed reading To denote the rising pitch in an emphasized question: Interesting.....Is Lay going to retire???? Other repeats seem to focus on producing a visual cue: lllllllllllllllllooooooooooooooooooovvvvvvvvvvvvvvvvvvvveeeeeeeeeeeeeeeee Or, to emphasize a question: Just wondering if you are at work today or not??? If so - we're going to Happy Hour again In summary, we see extensive and diverse usage of repeats in e-mails of various types. The repeats seem to communicate tempo, pitch, prosody and other paralinguistic elements, as well as to achieve visual emphasis. Despite the wide range of apparent usage patterns, it is difficult to ascertain objectively which paralinguistic cue is intended. The general challenge of interpreting nonverbal communication is compounded by the fact that the same text can receive a different interpretation by different readers. Given this subjectivity, what evidence do we have that these specific character sequences attempt to convey a specific cue? One line of support comes from literature on nonverbal communication in written literature in general (Poyatos, 2002b), and of the way punctuation is Repeats in CMC 16 used to convey paralinguistic cues in writing (Poyatos, 2002c). Poyatos‟s work suggests a cultural consensus as to how specific paralinguistic cues are communicated in writing. This cultural consensus is employed by authors and poets in general, and by playwrights in particular, to more effectively convey subtle linguistic cues (Poyatos, 2002a). A second line of support comes from our initial findings on the frequencies in the index of Enron Corpus repeats. We see that close to one third of the items in the index were classified as either sounds (e.g. shhhh), or as both a sound and a lexical word (e.g. boooom). In addition, we see that a significant majority of the letter repeats seem to be articulable, and that about half of those which are un-articulable, appear in a word that also includes at least one articulable repeat. Thus, we see an abundance of repeats of vowels and of continuant consonants (e.g. m and s), which, when spoken, allow the continued flow of air, and we see a relative scarcity of repeats of plosive consonants (e.g. p and b), in which the flow of air in the vocal tract is stopped. The claim that letter repeats try to emulate an audible paralinguistic cue are supported by the circumstantial evidence for a link between letter repeats and the way words are articulated in spoken conversation. Co-occurrence of CMC Cues In many of the quotations from the Corpus, it is evident that several cues are employed simultaneously: we see multiple examples of letter and punctuation mark repeats, strategic use of uppercase and lowercase letters, alternative spellings, and rich usage of commas, periods, dashes, and other punctuation marks. This concurrent usage is reminiscent of the co-temporality and simultaneous use of verbal and nonverbal cues in spoken communication, and should be explored in this context of text-based CMC. For example, in the following short example we see a repeated period, a repeated question mark, and a repeated letter: Repeats in CMC 17 Good Morning. Either you're extremely busy or ......???? Ummmmm. This co-occurence is apparent in the samples cited in this paper, as well as in blog searches, where postings with letter repeats seem to be accompanied by an even richer variety of CMC cues such as colored fonts, underlined and bold letters, etc. If this phenomenon of cooccurrence of CMC cues is confirmed by future research, it might indicate that users who wish to make their message more expressive through the usage of CMC cues, are more likely to use more than a single CMC cue to emphasize or to fine-tune their message. Distribution of punctuation mark repeats A visual inspection of the distributions of the three punctuation mark repeats shows a highly asymmetric distribution, which is reminiscent of heavy-tailed distributions such as the power law distribution (Newman, 2005). In order to corroborate this observation, techniques need to be developed to separate “real” punctuation mark repeats which were created by users who were writing sentences in their emails, from repeats which are a result of other automated mechanisms. For example, question mark repeats often represent a sequence of non-ASCII fonts (e.g. foreign language) which were not recognized by the system when the emails were converted to .txt files. After this “cleansing” of the Corpus is carried out, it would be interesting to investigate the distributions in depth: a long tail distribution might indicate more random processes, whereas deviations from such a distribution might indicate special cases. Generalizability The Enron Corpus contains emails written by a narrow segment of the population, around the turn of the millennium. This raises the question to what extent are the findings reported here generalizable beyond this very specific context? The finding that the proportions reported in Appendix A are in line with the proportions reported by Jones and Mewhort (2004b) suggest that Repeats in CMC 18 the level of “noise” in the Corpus is acceptable, and that it is not significantly different from other more traditional corpora. But this is only an indication, and until the findings reported here are replicated in other media, their validity is limited. For example, do these cues appear in other CMC media, such as blogs? Blogs represent the writing of a different segment of the population (Huffaker & Calvert, 2005), and it is possible to search both contemporary and historical blog postings from past years. For example, using the blog search engine Google Blog Search (http://blogsearch.google.com), we can see that the (approximate) number of times the word cool appeared in English language blogs in 2008 is 22,298,038, and that in the same time period, the extended version with 3-10 repeats of the letter o appeared approximately 37,408 times. Or, to take one of the words that combine an articulable and an un-articulable repeat, the word help, we find that it appeared an estimated 4,099,758 in the same time period, while the versions with 210 l‟s or with 2-10 p‟s appeared an estimated 4,217 and 6,817 times, respectively. One last example from table 1, the sound yum, appeared in the same time period an estimated 588,105 times, while the version of the same word with 2-10 repeats of the letter m appeared an estimated 49,085 times. These “back of an envelope” results suggest that the phenomena we report from the Enron Corpus are generalizable to other CMC media and to contemporary times. Theoretical Implications: Cues in CMC Do these findings improve our understanding of CMC cues, in the context of the concepts of hyperpersonal communication and social information processing? Are character repeats an element of CMC which can help in coding and decoding social information, and in making online communication more personal? We believe that our findings that repeats are used in an analogous manner to nonverbal cues in spoken communication, and that they exhibit many of the traits of nonverbal communication, such as ambiguity and context dependency, strengthen the Repeats in CMC 19 claim that CMC can be a cue-rich communication medium. It is a medium that can draw on the richness of both spoken and written communication, and in doing so, its more creative users can employ numerous strategies to increase its richness and expressivity. The use of character and punctuation mark repeats seems to be one of the mechanisms to achieve this goal, other mechanisms being chronemic cues, and use of emoticons, as well as others, less studied cues such as the use of uppercase letters, asterisks, bold and italic letters, colors, fonts, etc. Limitations and Further Research This preliminary foray into the realm of character repeats in e-mail messages is limited in many ways, and most of the generalizations and interpretations presented here require substantial additional research in order to support or refute them. The most significant limitation of this paper is its descriptive nature. In order to establish the role of repeats as CMC cues which have a role that is analogous to nonverbal cues in spoken conversation, we need to move beyond descriptions and interpretations that are based on the personal experience of the researchers as CMC users and as CMC scholars. Our claim will remain tentative until samples of these putative cues are experimentally presented to their creators as well as to potential addressees, and the linkage between the letter repeats and paralinguistic cues is established. What was the intention of the creators of these repeats, and how are the repeats used in interpreting the texts? The Enron Corpus is extensive and diverse, and includes many examples of task oriented messages, as well as of messages of a more interpersonal nature. But, its size, diversity and the haphazard manner in which the dataset was put together by federal officers who confiscated any e-mail server they were able to find, mean that it is also very noisy, full of duplications and nonemail character sequences. These could bias the results in unpredictable ways. Repeats in CMC 20 The index of 236 items is only a subset of all items in which a repeat is used in the Corpus. Further extensive work still needs to be carried out on the Corpus in order to isolate items with repeats related to upper case letters, as well as repeats of two consecutive letters. This work focused only on three types of punctuation marks, and did not look at repeats of other punctuation marks (e.g. dashes) and of other non-letter characters (e.g. asterisks). Further analysis of the Corpus will help in elucidating the role of these repeats in the emails. This study did not explore the location of the repeated letters in the word, or the location of words with repeats in the sentence or in the paragraph. It would be interesting to explore whether these are distributed evenly. For example, it appears as if more of the un-articulable repeats appear at the end of words, rather than at the beginning or the middle. It is possible that the dynamics of typing make it easier to repeat a terminal letter than one that is in the beginning or middle of a typed word. In several of the e-mail messages, letter repeats were used to express stuttering. The manual sorting of the items that took place during the construction of the index seemed to indicate that this is a rare usage. Consequently, this usage was not discussed further. Conclusion Letter repeats and punctuation mark repeats have been described by previous researchers as paralinguistic cues used in text-based CMC, but have not been explored systematically. In this paper we explore the concept of paralinguistic cues, as well as the more general concept of nonverbal cues, in relation to text-based CMC, in an attempt to establish the theoretical relevance of the question of CMC cues in relation to the concepts of social information processing and of hyperpersonal communication. The role of chronemics, and to a lesser extent of emoticons, in enriching CMC with cues has been established by previous studies, but are character repeats an Repeats in CMC 21 example of an additional category of CMC cues, not yet explored? In this paper we report on the construction of a tool that facilitates the systematic exploration of letter and punctuation mark repeats, and present findings on the prevalence and the usage of these repeats in a naturalistic corpus of about half a million e-mail messages. We find thousands of examples of repeats, which could fulfill paralinguistic roles such as modified pitch, loudness, tempo, inflection, and more. These thousands of examples of letter repeats were collapsed to an index of several hundred items, each of which describes a word from the Corpus in which a repeat was found. An analysis of the properties of these items showed that a disproportionally large percentage of the items describe sounds, and that many of the repeats were of letters or sounds which are easily articulated aloud as extended syllables. On the one hand, these findings strengthen the intuition that repeats are often times an attempt to emulate a vocal paralinguistic cue. On the other hand, the existence of a significant minority of un-articulable repeats, as well as of repeated punctuation marks, remind us that repeats can also be used as purely visual emphasis tools, not necessary linked to an audible counterpart. Taken together, these findings should be treated as a confirmation of the suggestions in the literature that repeats serve as paralinguistic cues. These findings also suggest that we are only beginning to understand the role of CMC cues in online communication, and the interaction between these cues and the verbal content of messages. A better understanding of CMC cues will improve our understanding of the richness and power of text-based CMC. Acknowledgements We thank Alberto Gonzalez for his work on the development of CorpusCruizer, Amanda Yentz for her work on the Enron Corpus, and Joe Walther for helpful and inspiring discussions. Repeats in CMC References Baron, N. S. (2000). Alphabet to Email. New York: Routledge. Berman, D. K. (2003, 5 October). Online laundry: Government posts Enron's e-mail --- amid power-market minutiae, many personal items; `about Wednesday... '. The Wall Street Journal, p. 1, Burgoon, J. K., & Hoobler, G. D. (2002). Nonverbal signals. In M. Knapp & J. Daly (Eds.), Handbook of interpersonal communication (pp. 240-299). Thousand Oaks, CA: Sage. Byron, K., & Baldridge, D. C. (2007). E-mail recipients' impressions of senders' likability: The interactive effect of nonverbal cues and recipients' personality. Journal of Business Communication, 44(2), 137. Chapanond, A., Krishnamoorthy, M. S., & Yener, B. (2005). Graph Theoretic and Spectral Analysis of Enron Email Data. Computational & Mathematical Organization Theory, 11(3), 265-281. Cohen, W. W. (2005). Enron Email Dataset. Retrieved October 27, 2008, from http://www.cs.cmu.edu/~enron/ Collins, M. J. (1996). A new statistical parser based on bigram lexical dependencies. Paper presented at the Proceedings of the 34th annual meeting on Association for Computational Linguistics. Crystal, D. (2001). Language and the Internet. Port Chester, NY: Cambridge University Press. Fox, A. B., Bukatko, D., Hallahan, M., & Crawford, M. (2007). The Medium Makes a Difference: Gender Similarities and Differences in Instant Messaging. Journal of Language and Social Psychology, 26(4), 389-397. 22 Repeats in CMC 23 Guiller, J., & Durndell, A. (2006). 'I totally agree with you': gender interactions in educational online discussion groups. Journal of Computer Assisted Learning, 22(5), 368-381. Hancock, J. T., Landrigan, C., & Silver, C. (2007). Expressing emotion in text-based communication. Paper presented at the Proceedings of the SIGCHI conference on Human factors in computing systems. Huffaker, D. A., & Calvert, S. L. (2005). Gender, Identity, and Language Use in Teenage Blogs. Journal of Computer-Mediated Communication, 10(2). Jones, M. N., & Mewhort, D. J. K. (2004a). Case-sensitive letter and bigram frequency counts from large-scale English corpora. Behavior Research Methods, Instruments, & Computers, 36(3), 388-396. Jones, M. N., & Mewhort, D. J. K. (2004b). Jones_BRMIC_2004.zip. Retrieved October 27, 2008, from http://www.psychonomic.org/ARCHIVE/ Kalman, Y. M., & Rafaeli, S. (2008). Chronemic nonverbal expectancy violations in written computer mediated communication. Paper presented at the International Communication Association. Kalman, Y. M., Ravid, G., Raban, D. R., & Rafaeli, S. (2006). Pauses and response latencies: A chronemic analysis of asynchronous CMC. Journal of Computer Mediated Communication, 12(1), 1-23. Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323-351. Panteli, N. (2002). Richness, power cues and email text. Information and Management, 40(2), 75-86. Repeats in CMC 24 Poyatos, F. (2002a). Functions of Nonverbal Communication in Literature. In F. Poyatos (Ed.), Nonverbal Communication across Disciplines. Volume 3 : Narrative literature, theater, cinema, translation (Vol. 3, pp. 153-182). Amsterdam: John Benjamins Publishing Company. Poyatos, F. (2002b). Nonverbal communication across disciplines: Paralanguage, kinesics, silence, personal and environmental interaction (Vol. 2). Amsterdam: John Benjamins Publishing Company. Poyatos, F. (2002c). Punctuation as nonverbal communication In F. Poyatos (Ed.), Nonverbal Communication across Disciplines. Volume 3 : Narrative literature, theater, cinema, translation (Vol. 3, pp. 125-151). Amsterdam: John Benjamins Publishing Company. Python Community. (2008). Regular Expression Syntax Retrieved October 27, 2008, from http://www.python.org/doc/2.5.2/lib/re-syntax.html Walther, J. B. (1992). Interpersonal Effects in Computer-Mediated Interaction - a Relational Perspective. Communication Research, 19(1), 52-90. Walther, J. B. (1996). Computer-mediated communication: Impersonal, interpersonal, and hyperpersonal interaction. Communication Research, 23(1), 3-43. Walther, J. B., & Parks, M. R. (2002). Cues filtered out, cues filtered in. In M. Knapp & J. Daly (Eds.), Handbook of interpersonal communication (pp. 529-563). Thousand Oaks, CA: Sage. Walther, J. B., & Tidwell, L. C. (1995). Nonverbal cues in computer-mediated communication, and the effect of chronemics on relational communication. Journal of Organizational Computing, 5, 355-378. Repeats in CMC Table 1 Twenty examples of items from the Enron Corpus index. Letters with an asterisk represent letter repeats that appear in email messages ag*h* Agghhh, it's so nice to get email from you all* he slept alllll afternoon! ba*d I'm a baaaad person. brr* I bet you love the temperature today (BRRRR). coo*l That's soooooo coooool cra*sh Bang! Booommm! Craaash!...What was that??? fa*r dude, that is faaaaar. free*zing it is freeeezing in here. goo*d As they say in Alabama, "It's all gooood." hm* Well, hmmm, lets see. hel*p* I forgot which one. Helllllpppp. lo*ve we looove the kiddush cup, how should we clean it?????? m* Mmmmmm, stuffed mushrooms!!!!! oo*ps Ooops, forgot the redline for changes in Exhibit A. p*l*e*a*s*e* Thanks....and pllleeeeease forgive me and tell me your husband's first name slow* It is slowwwww. so* You are sooooooo crazy!!!! swee*t Check out the new Tyan S2466 motherboard. Sweeeet!. was*u*p* HEY WASSSUPPP? yum* Complete with chocolate covered strawberries! Yummm. 25 Page 26 Appendix A Occurrences of Letter and punctuation mark Repeats in the Enron Corpus, by Character and by Number of Repeats 1 2 3 4 5 6 7 8 9 10 11+ a 48423405 24410 273 100 52 14 9 4 5 2 41 b 8262795 67478 71 5 0 0 0 0 0 0 0 c 19500604 630631 157 50 3 153 0 0 3 0 0 d 20110786 323615 95 7 0 26 0 0 0 0 0 e 68439146 2186204 534 60 16 103 3 7 8 7 14 f 9458663 912662 150 511 5 2827 2 0 0 0 0 g 11703362 140204 147 10 4 0 1 0 0 0 0 h 22452327 8297 155 68 62 10 19 11 5 0 39 i 41851910 14322 2758 11 4 42 1 4 40 0 7 j 1200570 2500 30 2 0 1 0 0 0 0 0 k 5235647 7012 39 27 3 0 0 3 0 0 0 l 18345988 3676890 469 12 13 4 6 0 3 1 14 m 14616403 642590 562 99 30 27 7 1 9 2 5 n 44375225 617294 265 17 5 4 1 0 4 0 0 o 46973573 903036 513 242 170 74 26 37 11 23 96 p 11323667 691563 336 18 3 4 6 0 0 0 0 q 722228 699 26 0 0 0 0 0 0 0 0 r 39292214 723565 426 39 10 6 11 0 1 1 2 s 32093025 2219836 623 61 28 21 0 3 10 0 34 t 48065597 1407495 356 10 3 0 1 0 0 2 0 u 16914169 2433 35 44 2 0 6 3 2 0 16 v 6244554 1817 20 3 3 0 0 0 0 0 3 w 9222261 9089 221888 23 44 1 0 0 0 0 2 x 1536148 5865 164 211 25 43 45 5 4 0 107 y 11757553 1999 29 50 9 1 0 0 1 0 41 z 684745 18777 218 15 8 1 3 0 1 0 0 Repeats in CMC Page 27 1 2 3 4 5 6 7 8 9 10 11+ A 5323480 320437 81619 71385 22193 16119 12093 18198 13697 4528 41714 B 2389952 18561 2828 11 1 0 2 0 0 0 0 C 5653905 44843 392 81 0 343 0 0 0 0 0 D 3144278 13821 104 3 0 19 0 0 0 0 0 E 7004663 252425 589 8 23 37 0 12 1 0 27 F 2267466 20499 88 219 8 2310 0 0 0 0 0 G 1891967 8146 55 3 2 1 0 0 0 0 0 H 2229027 2245 32 38 24 9 3 3 3 0 4 I 4050987 18293 7574 6 2 0 0 0 0 0 7 J 1398623 1651 78 24 0 0 0 0 0 0 0 K 941623 2033 44 3 0 0 0 0 0 0 0 L 2031536 83349 39 9 3 2 1 0 0 0 0 M 4076985 43887 426 15 3 8 2 2 0 0 0 N 3704492 21271 58 35 2 12 0 0 0 0 11 O 3623388 28812 87 25 20 62 1 4 1 7 26 P 3330653 35525 346 40 0 24 0 0 0 0 1 Q 448785 4862 141 0 0 9 0 0 0 0 3 R 3316230 18979 212 10 3 0 2 0 0 4 3 S 5117713 70596 101 90 23 1 0 27 0 0 3 T 6113182 24257 95 6 2 0 0 0 0 0 0 U 1580966 1682 47 8 3 3 0 0 0 0 0 V 794723 1306 16 2 0 0 0 0 0 0 0 W 1717823 1421 478 30 1 0 0 4 0 0 0 X 383982 2478 329 942 151 28 12 4 7 0 7 Y 767325 2037 46 167 4 0 0 0 0 0 1 Z 284323 1373 31 15 0 0 1 0 1 0 1 Repeats in CMC 2 3 4 5 6 7 8 9 10 11+ Page 28 ! 235659 12413 33652 ? 715624 44149 . 10186044 12428 Repeats in CMC 3071 1394 572 299 226 211 128 920 16551 6927 23394 1917 5489 1040 866 609 14408 73409 18821 6693 3335 1822 1331 809 813 6449 Page 29 Endnotes 1 This is a direct quote from the Enron Corpus. For a description of the Corpus, see the Method section. All excerpts from the Enron Corpus in this paper will appear verbatim, as indented and italicized texts. 2 Available from the authors 3 An examination of a random sample of each category of repeats revealed some letter sequences that are a result of repetitions which are less relevant to this study. Examples include: (i) letter bigrams that are common in normal text (such as ee or ll) tend to appear more often as letter triplets too (eee, lll), apparently as a result of typos. For example “Don't worry about tellling her my travel schedule.” or “Thanks for taking time out of your busy schedules to meeet with me on the 28th of December - especially during the holiday season”. (ii) Specific repeats seem to be a result of computer code, especially HTML color codes such as #ffffff or #ffff00, or of other components which appear in e-mails. For example, the sequence www which is common in links, or sequences of many upper case A‟s which appear frequently in e-mail attachments which have been converted into ASCII text. 4 Many of the question mark repeats were not created intentionally by writers, but are the result of the conversion into text (.txt) files of characters not included in the character set. Such unidentified characters are replaced by question marks. These repeats are included in the table in Appendix A. Repeats in CMC