fonetik 2005 - Institutionen för filosofi, lingvistik och vetenskapsteori
Transcription
fonetik 2005 - Institutionen för filosofi, lingvistik och vetenskapsteori
Proceedings FONETIK 2005 The XVIIIth Swedish Phonetics Conference May 25–27 2005 Department of Linguistics Göteborg University Proceedings FONETIK 2005 The XVIIIth Swedish Phonetics Conference, held at Göteborg University, May 25–27, 2005 Edited by Anders Eriksson and Jonas Lindh Department of Linguistics Göteborg University Box 200, SE 405 30 Göteborg ISBN 91-973895-9-5 © The Authors and the Department of Linguistics Cover photo and design: Anders Eriksson Printed by Reprocentralen, Humanisten, Göteborg University Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Preface This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish Phonetics Conference, organized by the Phonetics group at Göteborg University on May 25–27, 2005. The papers appear in the order they were presented at the conference. Only a limited number of copies of this publication has been printed for distribution among the authors and those attending the conference. For access to electronic versions of the contributions, please look under: http://www.ling.gu.se/konferenser/fonetik2005/ We would like to thank all contributors to the Proceedings. We are also indebted to Fonetikstiftelsen for financial support. Göteborg in May 2005 On behalf of the Phonetics group Anders Eriksson Åsa Abelin iii Jonas Lindh Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Previous Swedish Phonetics Conferences (from 1986) I II III IV V VI VII VIII –– IX X XI XII XIII XIV XV XVI XVII 1986 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Uppsala University Lund University KTH Stockholm Umeå University (Lövånger) Stockholm University Chalmers and Göteborg University Uppsala University Lund University (Höör) (XIIIth CPhS in Stockholm) KTH Stockholm (Nässlingen) Umeå University Stockholm University Göteborg University Skövde University College Lund University KTH Stockholm Umeå University (Lövånger) Stockholm University iv Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Contents Dialectal, regional and sociophonetic variation Phonological quantity in Swedish dialects: A data-driven categorization Felix Schaeffler 1 Phonological variation and geographical orientation among students in a West Swedish small town school Anna Gunnarsdotter Grönberg 5 On the phonetics of unstressed /e/ in Stockholm Swedish and FinlandSwedish Yuni Kim 9 The interaction of word accent and quantity in Gothenburg Swedish My Segerup 13 Speaker recognition and synthesis Visual acoustic vs. aural perceptual speaker identification in a closed set of disguised voices Jonas Lindh 17 A model based experiment towards an emotional synthesis Jonas Lindh 21 Annotating speech data for pronunciation variation modelling Per-Anders Jande 25 Prosody – duration, quantity and rhythm Estonian rhythm and the Pairwise Variability Index Eva Liina Asu and Francis Nolan 29 Duration of syllable-sized units in casual and elaborated Finnish: a comparison with Swedish and Spanish Diana Krull 33 Language contact, second language learning and foreign accent The sound of 'Swedish on multilingual ground' Petra Bodén 37 The communicative function of "sì" in Italian and "ja" in Swedish: an acoustic analysis Loredana Cerrato 41 Presenting in English and Swedish Rebecca Hincks 45 v Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Phonetic aspects in translation studies Dieter Huber 49 Scoring children's foreign language pronunciation Linda Oppelstrup, Mats Blomberg and Daniel Elenius 51 Speech development and acquisition On linguistic and interactive aspects of infant-adult communication in a pathological perspective Ulla Bjursäter, Francisco Lacerda and Ulla Sundberg 55 Durational patterns produced by Swedish and American 18- and 24-month-olds: Implications for the acquisition of the quantity contrast Lina Bonsdroff and Olle Engstrand 59 /r/ realizations by Swedish two-year-olds: preliminary observations Petra Eklund, Olle Engstrand, Kerstin Gustafsson, Ekaterina Ivachova and Åsa Karlsson 63 Tonal word accents produced by Swedish 18- and 24-month-olds Germund Kadin and Olle Engstrand 67 Development of adult-like place and manner of articulation in initial sC clusters Fredrik Karlsson 71 Poster session Phonological interferences in the third language learning of Swedish and German (FIST) Robert Bannert 75 Word accents over time: comparing present-day data with Meyer´s accent contours Linnéa Fransson and Eva Strangert 79 Multi-sensory information as an improvement for communication systems efficiency Francisco Lacerda, Eeva Klintfors and Lisa Gustavsson 83 Effects of stimulus duration and type on perception of female and male speaker age Susanne Schötz 87 Effects of age of learning on VOT in voiceless stops produced by near-native L2 speakers Katrin Stölten 91 Prosody – F0, intonation and phrasing Prosodic phrasing and focus productions in Greek Antonis Botinis, Stella Ganetsou and Magda Griva vi 95 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Syntactic and tonal correlates of focus in Greek and Russian Antonis Botinis, Yannis Kostopoulos, Olga Nikolaenkova and Charalabos Themistocleous 99 Prosodic correlates of attitudinally-varied back channels in Japanese Yasuko Nagano-Madsen and Takako Ayusawa 103 Speech perception Prosodic features in the perception of clarification ellipses Jens Edlund, David House, and Gabriel Skantze 107 Perceived prominence and scale types Christian Jensen and John Tøndering 111 The postvocalic consonant as a complementary cue to the perception of quantity in Swedish – a revisit Bosse Thorén 115 Gender differences in the ability to discriminate emotional content from speech Juhani Toivanen, Eero Väyrynen and Tapio Seppänen 119 Speech production Vowel durations of normal and pathological speech Antonis Botinis, Marios Fourakis and Ioanna Orfanidou 123 Acoustic evidence of the prevalence of the emphatic feature over the word in Arabic Zeki Majeed Hassan 127 Closing discussion Athens 2006 ISCA Workshop on Experimental Linguistics Antonis Botinis, Christoforos Charalabakis, Marios Fourakis and Barbara Gawronska 131 Additional paper submitted for the poster session A positional analysis of quantity complementarity in Swedish with comparison to Arabic Zeki Majeed Hassan and Barry Heselwood 135 Author index 139 vii Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University viii Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Phonological Quantity in Swedish dialects: A data-driven categorisation Felix Schaeffler Department of Philosophy and Linguistics, Umeå University Swedish dialects was performed, based on measurements of sound duration from 86 places in Sweden and Finland. This aimed at a categorisation of the dialects that was independent of traditional descriptions and categories, thus providing the possibility to discover new dialectal groups or typological categories. The study is an extension of Strangert & Wretling (2003), Schaeffler & Wretling (2003) and Schaeffler (2005), motivated by an extended data-set and methodological improvements. Abstract This study presents a data-driven categorisation (cluster analysis) of 86 Swedish dialects, based on durational measurements of long and short vowels and consonants. The study reveals a clear geographic distribution that, for the most part, corresponds with dialectological descriptions. For a minor group of dialects, however, the results suggest mismatches between the quantity system and the observed segment durations. This phenomenon is discussed with reference to a theory of quantity change (Labov 1994). Material and Subjects The material used for this study was part of the SweDia corpus (www.swedia.nu). It comprised hand-segmented data from 976 speakers from 86 recording locations (usually 12 speakers per recording location). One word pair was investigated: tak vs. tack (english: roof – thanks). For historical reasons, these two words should have a V:C vs. VC: structure, even in those dialects where additional quantity patterns exist. In most dialects, the words consist of a voiceless dental or alveolar plosive, a low vowel and a voiceless velar plosive. Segmentation included the vocalic phase and the closure phase of the velar plosive. If present, preaspiration was marked as a separate segment, but was treated as a part of the consonantal closure. Four variables were measured: the durations of the long vowel, the short consonant, the short vowel and the long consonant. To arrive at a measure of central tendency for every recording location, the median of each variable was calculated for every speaker. The median of the speaker medians for every recording location then served as the value that was used for the data-driven categorisation. Medians instead of arithmetic means were calculated as, unlike the arithmetic mean, the median is not sensitive to outliers. Introduction Phonological quantity in Standard Swedish is usually described as being complementary: Long vowels in closed syllables are followed by short consonants, while short vowels are always followed by a long consonant or a consonant cluster. This modern system has developed from a quantity system with independent vowel and consonant quantity, where all four possible combinations of long and short segments (VC, V:C, VC: and V:C:) existed. The modern system evolved by shortening of V:C: and lengthening of VC structures. Not all dialects of Modern Swedish have completed this change. Some dialects kept the four-way-distinctions. This applies to a group of dialects in the FinnishSwedish region and in Dalarna in Western Sweden. Another group of dialects abandoned V:C: successions but kept VC structures. This has mainly been reported for large parts of Northern Sweden, but also for some places in Middle Sweden. There are, thus, today at least three different quantity systems in the dialects of Modern Swedish: 4-way-systems (VC, V:C, VC: and V:C:), 3-way-systems (VC, V:C and VC:) and 2-way-systems with complementarity quantity (V:C and VC:). Data-driven categorisation The method of choice in this study was a hierarchical cluster-analysis, with euclidean dis- Aims and Objectives In this study, a data-driven categorisation of 1 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University tances as dissimilarity measures and the Ward method as the linkage criterion (see e.g. Gordon, 1999). Hierarchical clustering treats each object initially as a single cluster. In the next steps, the objects and resulting clusters are combined according to the dissimilarity measure and the linkage criterion, until all objects are joined in a single cluster. The increment of the linkage criterion is recorded during the process. It is usually displayed in a so-called dendrogram, that visualises the increment of the linkage criterion as the length of vertical branches in a tree structure. This information can be used for the selection of an adequate number of clusters. In the present study, the recording locations constituted the objects. Each recording location was described by four variables: the median durations of the four segments (the long and short vowels and consonants). It is often recommended in the literature to verify a hierarchical cluster analysis with a nonhierarchical method (Bortz, 1999). This has been done in the present study but did not lead to strongly deviating results. The results of the non-hierarchical method are therefore not reported here. to the Northern parts of Sweden, while the clusters (2), n=17, and (1), n=39, are restricted to Southern Sweden, with the exception of an area comprising Jämtland, Ångermanland and Medelpad. The clusters (1) and (2) show less geographic separation, but there is a tendency for cluster (2) to occur mainly in the Southwestern parts and coastal regions. Results The visual inspection of the cluster dendrogram suggested a four cluster solution as an appropriate number of clusters for the current analysis. Additionally, the parameter η2 was calculated, which is usually used in analyses of variance to describe the amount of explained variance, but has also been suggested as a criterion for the estimation of different cluster solutions (Timm, 2002). The four-cluster solution lead to an η2 value of 0.68. For comparison, the three-cluster solution showed an η2 of 0.32, and a tencluster solution showed an η2 of 0.86. The large increment of η2 from the three to the four cluster solution followed by comparatively low increments supported the four cluster solution. Figure 1: Geographic distribution of the 4 clusters obtained by cluster analysis (see text). Durational characteristics of the clusters The figures 2 and 3 show the distributions of the four segment durations across the four clusters in the form of box-plots. Figure 2 shows the distribution of V: and V durations, figure 3 of C: and C durations. The Finnish cluster (4) is separated from the other clusters by longer V: and V durations and shorter C durations. C: durations in the Finnish cluster (4) are close to the ones in the Northern cluster (3), but clearly longer than those in the Southern clusters (1) and (2). Consequently, the Northern cluster (3), as well, is separated from the Southern clusters (1) and (2) by longer C: durations. However, the Northern cluster (3) shows rather long C durations, which constitutes a clear difference to the short C durations in the Finnish cluster (4). Geographic distribution Figure 1 shows a stylised map of Sweden and the Swedish parts of Finland. The colour-coding and the numbers show the geographic distribution of the four clusters. The clusters show a clear geographic distribution. Cluster (4), n=7, separates all dialects on the Finnish mainland from the rest of the dialects. Cluster (3), n=23, is mainly restricted 2 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The vowel durations show no clear separation between the Northern cluster (3) and the Southern clusters (1) and (2): The V and V: durations of cluster (3) lie in between the durations of (1) and (2). Cluster (1) shows a tendency to have longer V and V: durations than cluster (2). The same relationship exists between the consonants in cluster (1) and (2). Thus, all segments tend to be longer in cluster (1) than in cluster (2). cluster (1) and (2), due to their longer C:. V:/C ratios as well as V/C: ratios are very similar in the Southern clusters (1) and (2). Table 1: Median V:/C and V/C: ratios in the four clusters. V:/C V/C: Cl. (1) 1.15 0.53 Cl. (2) 1.13 0.57 Cl. (3) 1 0.41 Cl. (4) 1.83 0.44 Relations between long and short segments The ratios between the durations of the long and the short segments are shown as boxplots in figure 4. The figure shows that the consonant ratios have a wider range than the vowel ratios, but are generally lower. Some dialects show consonant ratios close to 1.0, hence “long” and “short” consonants have approximately the same duration, while the highest values are around 2.0-2.2, showing long consonants that are twice as long as short vowels. The median of the distribution is 1.22, and 50% of the values lie between 1.14 and 1.32. The vowel ratios, on the other hand, are rarely lower than 1.4, the highest values are similar to the highest values for the consonant ratios, around 2.0-2.2. The median is 1.82, 50% of the values lie between 1.7 and 1.9. Figure 2: V and V: durations per cluster. Grey: short vowels, white: long vowels. Figure 3: C and C: durations per cluster. Grey: short consonants, white: long consonants. Relations between vowel and consonant duration The different durational characteristics of the segments result also in different V:/C and V/C: ratios. Because of the very long V: and very short C, the Finnish cluster (4) shows a deviating V:/C ratio (see table 1). The differences in the V/C: ratios are less pronounced. However, cluster (3) and (4) show lower V/C: ratios than Figure 4: V:/V and C:/C ratios per cluster. Grey: vowel ratios, white: consonant ratios. Cluster (4) is clearly separated from the rest of the dialects by the consonant ratios. All dialects in cluster (4) show values higher than 1.6, while all other dialects lie below this value. The 3 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University dialects in cluster (3) show somewhat higher C:/C ratios than those in the clusters (1) and (2), although there is a lot of overlap. The median for cluster (3) is 1.29, for cluster (2) it is 1.14 and for cluster (1) it is 1.19. ence, that the loss of the V:C: structure in the Finnish areas is a relatively recent phenomenon (Ivars, 1988). If this is correct, then 4-way durational values in 3-way dialects could be attributed to the recent loss of a richer quantity system. While the phonological system is a 3-waysystem, the phonetic durations still reflect a 4way system. Labov (1994) has argued that quantity changes spread via “lexical diffusion”, i.e. phonetically abrupt but lexically gradual. Thus, when V:C: recently became VC: in certain Finnish dialects, the change was presumably phonetically “abrupt”. A speaker adopting the change, replaces V: in V:C: by the corresponding V, and replaces thus a V:C: structure with a VC: structure with the same durational characteristics as original VC: structures. maintaining the durational characteristics of V and C:. The change does not necessarily affect all words with the matching environment at the same time (“lexically gradual”), but might gradually spread through the lexicon until all V:C: sequences have vanished from the dialect. This scenario would provide an explanation for the observed mismatches between phonological system and phonetic duration in some of the Finnish dialects. Discussion The cluster analysis showed results similar to the analysis presented in Schaeffler (2005). There are three main groups with a clear geographic distribution: A Finnish cluster, which includes all dialects on the Finnish mainland, a Northern cluster, mainly concentrated in Northern Sweden from Lappland to Gästrikland, and two Southern clusters. The consideration of two Southern cluster instead of one was motivated by a major increase in the value of η2. The durational characteristics, however, do not present much support for such a partitioning. All segments in cluster (1) are longer than in cluster (2), which leads to very similar segment ratios (see table 1 and figure 4). This, together with the lack of a clear geographic distribution, suggests that speech rate effects might be responsible for the difference. The geographic distribution corresponds with the traditional descriptions of the Swedish dialects as outlined in the introduction. 4-way distinctions are mainly found in the Finnish region, 3-way distinctions frequently in the Northern regions and 2-way distinctions in the Southern Swedish areas (see e.g. Wessen 1969, Riad 1992). In Schaeffler (2005), the observed durational differences were attributed to these functional differences. In 4-way distinctions, where V:C and VC: sequences contrast with VC and V:C:, clear durational distinctions of vowels and consonants are expected. This corresponds with the observed durations for the Finnish region. All other dialects, however, show rather low consonant ratios, while a clear durational difference between the vowels is maintained. A further aspect of the results deserves attention: The geographic distribution resulting from the cluster analysis is almost too clear-cut. According to dialectological descriptions, some dialects in the Finnish cluster (4) show 3-way systems (Ivars 1988), comparable to those in the Northern Swedish regions. In spite of these functional congruences between parts of Northern Swedish and Finland-Swedish, the segment durations differ. There is, however, strong evid- References Bortz J. (1999) Statistik für Sozialwissenschaftler. 5th edition. Berlin: Springer. Gordon A.D. (1999) Classification. 2nd edition. Boca Raton: Chapman & Hall Ivars A.-M. (1988) Närpesdialekten på 1980talet. In Std. i nordisk fonologi, volume 70. Labov W. (1994) Principles of Linguistic Change., Vol. I. Cambridge: Blackwell P. Riad, T. (1992) Structures in Germanic Prosody. Stockholm: Univ. Schaeffler F. and Wretling P. (2003) Towards a typology of phonological quantity in the dialects of Modern Swedish. Proc. 15th ICPhS Barcelona, p. 2697-2700. Schaeffler, F. (2005, forthcoming) Drawing the Swedish quantity map: from Bara to Vörå. Proc. Nordic Prosody IX. Lund. Strangert & Wretling (2003) Complementary quantity in Swedish dialects. Proc. Fonetik 2003. Umeå. Timm N.H. (2002) Applied Multivariate Analysis. New York: Springer. Wessén, E. (1969) Våra folkmål. C E Fritzes Bokförlag, Lund. 4 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Phonological variation and geographical orientation among students in a West Swedish small town school Anna Gunnarsdotter Grönberg Institute for Dialectology, Onomastics and Folklore Research, Göteborg, Sweden 1998 had a total of 1400 students in 14 different national study programmes. These students come from quite a large area surrounding Alingsås. The informants in the study represent five municipalities, ten different study programmes, and there is an even gender distribution. A sample of the results was also compared with the GSM corpus2, consisting of recordings of group conversations with 105 upper secondary school students in Göteborg 1997–1998. Even in this corpus, the informants are distributed evenly with respect to gender, study programmes and geographical areas in Göteborg and surroundings. Abstract This paper presents main results from a Ph.D. thesis on sociolinguistic variation among students in an upper secondary school in Alingsås, a town of 25,000, northeast of Göteborg. Phonological variants are found to be associated with traditional local dialect, regional and supraregional standard, Göteborg vernacular, general and Göteborg youth language. Correlations with demogeographical areas generally show a pattern going from southwest to northeast (along the E20 highway and the railway from Göteborg). One area does not fit into the continuum, Sollebrunn (NW of Alingsås), where particularly female informants tend to use standard and innovations to a surprisingly high extent. Gender is the second most important social factor, but in different ways. There are major differences from one social group to another when it comes to expressing gendered identity through linguistic means. Method The informants were categorized according to social variables representing different aspects of background and identity: Gender, type of study programmes (vocational, intermediate, preparatory for university), demogeographical areas (based on the extent of urbanization in the five municipalities), Alingsås neighbourhoods (divided on the basis of socio-economic factors), and lifestyle based on two-dimensional mapping (concerning taste, leisure, mobility, plans for the future, etc.). The lifestyle analysis both complements and includes traditional sociolinguistic variables.3 Eight linguistic variables were analyzed extensively, four phonological, two lexical and two morpho-phonological. Instants of variants in the recorded interviews were counted manually, and frequencies of variants were correlated statistically to social variables on a group level. Examples from analyzes of three phonological variables will be used in the followng discussions. Introduction This article presents some of the core results from my Ph.D. thesis (Grönberg 2004), a study of sociolinguistic variation among students from five municipalities, all attending an upper secondary school in Alingsås, a town of 25,000, northeast of Göteborg, Sweden.1 The main aim of the thesis was to study covariation between linguistic variation and social identity, and to relate it to language and dialect change. A number of questions were raised, of which the following will be discussed here: – To what extent does linguistic variation depend on the informant’s orientation towards the place where (s)he lives or other places? – How do the findings from the upper secondary school in Alingsås differ from results from comparable informants in Göteborg? – Which social factors are most important for linguistic choices? Geographical orientation and linguistic variation One of the main issues was to find out to what extent linguistic variation depends on the informant’s orientation towards the place where (s)he lives, towards Göteborg, Stockholm, or other places. Does the influence stem from the Material The material studied consists of tape-recorded interviews with 97 students at the Alströmergymnasium, which at the time of recording in 5 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Göteborg dialect, ideas about standard Swedish, or from an ideal national youth language? The variants studied can be related to several layers of spoken Swedish: traditional local dialect (västgötska), regional and supraregional standard, traditional Göteborg dialect, Göteborg youth language, and general youth language. The question concerning the origin of a variant in a certain level or variety is not always easy to answer, as there are several cases of identical forms in different layers. One such example is the variable Ö (long /ö/, except when preceding /r/), where the variant ‘closed Ö’[ø:] (grön [grø:n] ‘green’) can be found in both local dialect and standard language, while at the same time contrasting with the innovation ‘open Ö’ [œ:] (grön [grœ:n]), which can be found in both traditional Göteborg dialect and in general youth language. The variation found can be interpreted as attributable to differences in geographical orientation in the various groups. Some groups seem quite locally rooted, while others are more oriented towards Göteborg, and some have more far-reaching aspirations – not necessarily towards Stockholm, but towards major urban areas in general. There is hardly any orientation towards other areas than these, except in the case of a few informants who are drawn towards other places in West Sweden. The linguistic influence as seen in the use of standard forms and innovations is associated with an orientation towards both youth and standard language, on both a regional and supraregional level. When correlated with geographical areas, the linguistic variables generally show a pattern going from southwest to northeast. The frequency of both standard forms and innovations grows higher the closer the informants live to Göteborg, and the two peripheral areas Lerum (SW) and Herrljunga (NE) are, in almost every case, the extremes with the highest frequencies of non-local and local forms. Even in central Alingsås, a similar tendency can be found with informants, in the NE parts of town showing a high frequency of local forms, while those living in Centre and SW are more prone to using standard forms and innovations. One area does not, however, fit into the dialect continuum discernible along the E20 highway and the railway from Göteborg through Lerum, Alingsås and Herrljunga: in Sollebrunn (NW of Alingsås), some distance away from both the highway and the railway, the nine female informants particularly tend to use stan- dard forms and innovations to a much higher extent than the informants in Herrljunga, situated at the same distance from Göteborg. (Results are statistically significant at a five percent level.) One example is the variable U, that represents three variants of the pronunciation of long /u /. The variants are the local ‘lowered U’ [ʉ̞:] (hur [hʉ̞:r] ‘how’), the ‘standard U’ [ʉ̟:] (hur [hʉ̟:r]), and the ‘diphthongized U’ [ʉ̟ß] (hur [hʉ̟ßr]) that is considered an innovation from Göteborg in this study. As can be seen in figure 1, informants in Sollebrunn and Lerum have a similar frequency of the Göteborg innovation ‘diphthongized U’. % 100 80 60 Diphtongized U Standard U 40 20 Lowered U 0 Alingsås Herrljunga Lerum Gräfsnäs Sollebrunn!! Figure 1. Frequency of U variants in geographical areas Elements in the Sollebrunn-informants’ adherence to groups point towards a stronger need for identification with other geographical areas than their own. I hope to be able in future research to go into greater detail with regard to the attitudes and values of these informants, to find out why they differ from the overall pattern. Comparing results from Alingsås and Göteborg How do the findings from the upper secondary school in Alingsås differ from findings from comparable students in Göteborg? The answer to this question is in some ways already given above. The distribution pattern going from SW to NE, as seen between Lerum and Herrljunga, is supplemented and strengthened through a comparison with the GSM corpus. In relation to three of the phonological variables, the results are unambiguous, with the frequency of innovations being substantially higher in the Göteborg informants than in the Alingsås informants. 6 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The curves, which show a strong slant between NE and SE, show an even steeper slant between the areas of Lerum and Göteborg. One example is the variable I/Y, as illustrated in figure 2. The variable I/Y represents three variants of pronunciation of the long /i/ and /y/. The local variant is the ‘lowered I/Y’ [ɪ:]/[ʏ:] (fin [fɪ:n] ‘nice’, typ [tʏ:p] ‘like, sort of’), the standard variant is the ‘standard I/Y’ [i:]/[y:] (fin [fi:n], typ [ty:p]), and the Göteborg innovation is ‘fricativized I/Y’ [i:z]/[y:z] (fin [fi:zn], typ z [ty: p]). The results of correlation with geographical areas forms a step, increasing curve for the Göteborg ‘fricativized I/Y’ from Sollebrunn via Lerum to Göteborg, and a steeply decreasing curve for the local ‘lowered I/Y’ between the same areas. Further spreading of Göteborg features? One interesting question is whether innovations that are spreading in Göteborg will spread even more in the region. This would change the spoken language of the Alingsås area even more towards that of Göteborg, as has already happened in, for instance, Kungälv and Kungsbacka, some 30 km to the north and south of Göteborg (Grönvall 2005), or whether the variants close to standard will take over. Thelander (1979) and Westerberg (2004) describe the rise of a West Bothnia regional standard, where forms which are common to dialects in a larger area survive, while more local forms disappear. The question is whether a similar development might take place in relation to certain West Swedish forms. The variable ÖR, for instance, might suggest such a thing. ÖR represents the pronunciation of the long /ö/ before /r/, with two possible variants: the traditional, local ‘closed ÖR’ [ø:r] (göra [jø:ra] ‘do’), and the standard ‘open ÖR’ [œ:r] (göra [jœ:ra]). The ‘closed ÖR’ is present in a large area including the region of Västergötland (but not in Göteborg or the coastal regions), and this feature shows a high frequency in central Alingsås and all three of the areas to the north (Herrljunga, Gräfsnäs and Sollebrunn), as shown in figure 3. This points to the possibility of the ‘closed ÖR’ surviving as a part of a Västgöta regional standard, distinguishable from the Göteborg regional standard, in which the ‘open ÖR’ is standard. % 100 80 60 Fricativized I/Y Standard I/Y 40 20 Lowered I/Y 0 Herrljunga Alingsås Lerum Gräfsnäs Sollebrunn Göteborg Figure 2. Frequency of I/Y variants in geographical areas including Göteborg One of the lexical variables displays a somewhat different distribution, but when a study of adolescents from Stockholm is added to the comparison, the results point to a spreading of the innovation ‘typ’ (‘like’) from Stockholm to Göteborg in the first place, and then to Alingsås, and that the use has stagnated in favour of other discourse markers in the two major cities, while upper secondary school students in Alingsås and the catchment area still use ‘typ’ to quite a large extent. Differing patterns of distribution for innovations (three phonological and one lexical) can be interpreted as two types of regionalization taking place at the same time. The first type consists of a gradual spread from the regional centre of Göteborg towards Alingsås and then further north, and the other type consists of a form of urban jumping, where forms spread from the capital to Göteborg, and then on to the town of Alingsås, and from there to surrounding areas (cf Sandøy 1993:119). % 100 Closed ÖR 80 60 Open ÖR 40 20 0 Herrljunga Sollebrunn!! Lerum Gräfsnäs Alingsås Figure 3. Frequency of ÖR variants in geographical areas Discussion Which social factors are most important for linguistic choices? Is it possible to identify groups 7 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University groups with common social identities in order to explain differences in linguistic behaviour? The social variables used in the study – gender, study programme, demogeography, and lifestyle – all show co-variation with linguistic variables as well as with each other, to some extent. The hypotheses formulated were not always verified, but this was not attributable to a lack of variation but to the fact that this variation was not the predicted variation. For the eight linguistic variables analyzed, different social factors are important, but the one factor, which is most often salient, is that of demogeography. After that, gender can be seen as being second most important, but in two different ways. On the one hand there are general differences between girls and boys seen as groups, on the other hand there are differences between different groups when sex is combined with study programme, and also in the lifestyle analysis. This proves that there are major differences from one social group to another when it comes to expressing gendered identity through linguistic means. The most salient geographical variation can be found in the phonological variables, while the lexical variables co-vary to a higher extent with gender, programme type, and lifestyle. As was discussed above, a distribution pattern going from SW to NE is discernible, and this is not only related to distance in kilometres to Göteborg, but also to the dominant lifestyles and values in the adult population in the different areas. The ones who stand out most clearly are the girls in Sollebrunn, with respect to both the demogeographical and the socio-economic categorization of the informants. The lifestyle analysis is an attempt to supply extra information to combine with the traditional social variables, and there is good potential for developing this method further in studies of linguistic change. It provides a better understanding of the informants’ social background and aspiration, both in that it takes into account more aspects, and makes it possible to move away from the hierarchical way of thinking which characterizes e.g. social indexation, and thus to capture more aspects of how social identities are constructed in contemporary society. school, thus having reached the age of 15-16 years. About 98 percent of Swedish 16-yearolds apply to the gymnasium. 2. Gymnasisters språk- och musikvärldar, The Language and Music Worlds of High School Students. See Norrby & Wirdenäs (1998). 3. The lifestyle analysis was based on Ungdomsbarometern (1999). Cf Bourdieu (2002) and Dahl (1997). References Bourdieu, Pierre (2002) [1984]. Distinction. A Social Critique of the Judgement of Taste. Reprint. London: Routledge & Kegan Paul Ltd. Dahl, Henrik 1997. Hvis din nabo var en bil. En bog om livsstil. København: Akademisk Forlag A/S. Grönberg, Anna Gunnarsdotter. (2004) Ungdomar och dialekt i Alingsås. (Nordistica Gothoburgensia 27.) Göteborg: Acta Universitatis Gothoburgensis. Grönvall, Camilla. (2005) Lättgöteborgska i Kungsbacka. En beskrivning av några gymnasisters språk 1997. Göteborg. Unpublished manuscript. Norrby, Catrin & Karolina Wirdenäs. (1998) The Language and Music Worlds of High School Students. In: Pedersen, Inge Lise & Jann Scheuer (eds.). Sprog, Køn – og Kommunikation. Rapport fra 3. Nordiske Konference om Sprog og Køn. København, 11.– 13. Oktober 1997. København: C.A. Rietzels Forlag.S. 155–164. Sandøy, Helge (1993). Talemål. Oslo: Novus. Thelander, Mats. (1979) Språkliga variationsmodeller tillämpade på nutida Burträsktal. Del 1 & 2. (Studia Philologiae Scandinavicae Upsaliensa 14:1 &14:2.) Uppsala: Acta Universitatis Upsaliensis. Ungdomsbarometern. (1999) Livsstil & fritid. Stockholm: Ungdomsbarometern AB. Westerberg, Anna. (2004) Norsjömålet under 150 år. (Acta Academiae Regiae Gustavi Adolphi LXXXVI.) Uppsala: Kungl. Gustav Adolf Akademien för svensk folkkultur. Notes 1. The Swedish upper secondary school, gymnasium, gives courses of three years’ duration for students having completed nine years of 8 On the phonetics of unstressed /e/ in Stockholm Swedish and Finland Swedish Yuni Kim Department of Linguistics, University of California at Berkeley have affected the lower frequencies where the vowel formants were located. The data for the rural Finland Swedish dialects consisted of the audio recordings in Harling-Kranck (1998), a transcribed collection of spontaneous narratives by speakers born around 1900. The scope of this study was limited to the southern dialects, and of these, 10 dialects (represented by one speaker each) had enough tokens of unstressed /e/ for consistent patterns to arise. From west to east, these were: Föglö and Kökar in eastern Åland; Houtskär in western Åboland; Tenala and Karis in western Nyland; Sjundeå and Helsinge in central Nyland; and Borgå, Lappträsk, and Pyttis in eastern Nyland. F1 and F2 values were measured for unstressed tokens of the phoneme /e/ in wordfinal syllables. Using Praat, measurements were taken at a stable point at or near the midpoint of each vowel. Formants were calculated by LPC, and FFT spectra were also consulted in cases of inconsistency between the LPC value and visual inspection of the spectrogram. Excluded from the measurements were: extremely reduced tokens with unclear formant structure, and tokens with dramatic formant movement throughout the course of the vowel (e.g. a linear drop of 300 Hz in F2). These criteria had the effect of excluding most tokens following velars, but due to the small total number of tokens it seemed safer for purposes of comparability to only include vowels with reasonably stable formant values. Abstract Dialects of Swedish vary in the pronunciation of unstressed /e/ in different phonological environments. In this pilot study, Stockholm Swedish is compared with several Finland Swedish dialects. Stockholm and one Åland dialect lower and back /e/ before [n], while Helsinki and most Nyland dialects lower and back /e/ before [r]. The data provide evidence for the sociolinguistic relevance of unstressed vowel pronunciation. Introduction Stressed short [e] and [æ] are in complementary distribution in most Swedish dialects: the allophone [æ] occurs before [r], and [e] occurs in all other environments. In Finland Swedish, transcription conventions (e.g. in Harling-Kranck 1998) and informal reports by native speakers suggest that the same distribution may hold in unstressed syllables as well. Since it is not clear how widespread this phenomenon is, a pilot study was conducted to investigate the phonetics of unstressed /e/ across several dialects. The following questions were addressed: 1) How is unstressed /e/ pronounced in Stockholm Swedish? 2) Are unstressed [e] and [æ] in fact in complementary distribution in standard Helsinki Swedish? 3) Do rural Finland Swedish dialects pattern with Helsinki, or with Stockholm – or do they show their own patterns? Finally, 4) Can the regional differences be explained? Results Materials and methods Preliminary inspection of the data indicated three categories of environments relevant to the phonetic realization of unstressed /e/: preceding [n], preceding [r], and elsewhere (usually wordfinal, or preceding [t] or [s]). Below, tokens for these environments are graphed in each dialect. The ellipses are marked N, R, and E, respectively. For the Stockholm and Helsinki speech samples, 5-minute news broadcasts from each city were recorded from the Internet into Praat at 22500 kHz. One male and one female newscaster were recorded for each variety. The audio files had been compressed for Internet broadcast, but it was assumed that the compression would not 9 Stockholm Unstressed /e/ in Stockholm Swedish was generally realized as schwa, but a pattern emerged for both Stockholm speakers that the schwa had higher F1 and lower F2 when preceding [n] than in other environments. There is little overlap between the [n]-environment tokens and the other tokens in the F2 vs. F1 plots in Figs. 1 and 2. 2100 1900 1700 1500 2200 2000 1800 1600 350 N E 450 550 1300 300 E 400 750 500 Figure 3. Helsinki newscaster. Female, rec. 2005. 600 R N Åland and the Åbo archipelago The next question is which pattern we find in dialects of Åland and the Åbo archipelago, geographically located halfway between Stockholm and Helsinki. Previously part of Sweden, Åland became an autonomous part of Finland in 1921 and maintains contacts with both countries. Thus it is not immediately obvious whether Åland dialect speakers would orient themselves more toward a Central Swedish or Finland Swedish norm in unstressed vowel pronunciation. The speaker from Föglö in eastern Åland has the Stockholm pattern, as shown in Fig. 4. 700 800 Figure 1. Stockholm newscaster. Female, rec. 2005. 1650 650 R 1550 1450 1350 1250 400 E 500 N R 600 1900 1800 1700 1600 1500 400 700 E R Figure 2. Stockholm newscaster. Male, rec. 2005. Helsinki The Helsinki newscasters had a very different pattern from Stockholm. Both speakers categorically lowered and backed unstressed /e/ before [r], as in Fig. 3. This result seems to confirm the existence of [e]~[æ] allophony in unstressed syllables, at least on a phonetic level. 500 600 N 700 Figure 4. Föglö, eastern Åland. Male, b. 1901. On the other hand, the speakers from Kökar and Houtskär (females, born 1900 resp. 1899) show a third type of pattern, where /e/ has lower 10 F2 before [r], but with (apparently) less of a difference in F1 between the environments. 2200 2000 1800 Easternmost Nyland, on the other hand, presents a bit of a mystery. The Lappträsk speaker (not shown here) has a tendency to lower and back /e/ before [r], but unlike in other parts of Nyland there is significant overlap with non[r]-environment tokens. The Pyttis speaker has an even more divergent pattern, illustrated in Fig. 8. Since the easternmost dialects have undergone heavy phonetic influence from Finnish, it may be possible to relate these divergent patterns to Bergroth’s (1917) observation that it is characteristic of Finnish-accented Swedish to pronounce unstressed –er as [er] instead of [ær]. The easternmost Nyland dialects should be investigated further. 1600 450 E 550 R 650 N 750 Figure 5. Kökar, eastern Åland. Female, b. 1900. 2000 Nyland In most rural villages of Nyland (the province where Helsinki is located), the Helsinki pattern obtains: before [r], unstressed /e/ approaches an [æ]-like pronunciation. 2450 2250 2050 1800 1600 1400 400 E R N 500 600 1850 450 E N R 550 700 650 Figure 8. Pyttis, western Kymmene (E. Nyland dialect group). Male, b. 1895. 750 Discussion and conclusion 850 Although recordings of only one or two speakers per dialect were available, multiple speakers in each region showed approximately the same patterns. Thus the results, though preliminary, seem to point to robust regional differences in the quality and distribution of unstressed tokens of /e/. It may be possible to explain some of this variation. As mentioned in the introduction, the Helsinki pattern, where [e] and [æ] are in complementary distribution, is parallel to an identical alternation in stressed syllables in many Swedish dialects. The fact that the alternation seems to have generalized to unstressed syllables precisely in Finland Swedish may perhaps be attributable to contact with Finnish, which tends not to reduce vowel quality in unstressed syllables. That is, speakers of Finland Swedish 950 Figure 6. Tenala, western Nyland. Female, b. 1885. 2200 E 2000 1800 1600 1400 450 N R 550 650 750 Figure 7. Borgå, eastern Nyland. Male, b. 1900. 11 may have acquired a habit of articulating the full or nearly-full quality of unstressed vowels, which could have triggered the [e]~[æ] alternation. The [æ] of Finland Swedish is noticeably more open than in the Swedish of Sweden (Reuter 1971), which also seems to contribute to the salience of the allophony. This hypothesis must remain as speculation, however, pending further data on vowel reduction in Finland Swedish (as well as evidence that the Helsinki pattern really is an innovation and not archaic). Once a wider range of dialects is studied, it may be possible to assemble a more coherent picture of cross-dialectal variation in unstressed vowel pronunciation. In future research, comparing measurements of unstressed /e/ to the rest of the vowel system, for example to stressed realizations of [e] and [æ], could shed further light on this topic. Normalization of the vowel formants would also allow direct comparison among speakers. Finally, these results have more general implications. Although sociophonetic research has often focused on stressed vowels (e.g. Labov 1994), the evidence presented here suggests that unstressed vowels can also have sociolinguistic relevance. Reuter M. (1971) Vokalerna i finlandssvenskan: en instrumentell analys och ett försök till systematisering enligt särdrag. Studier i nordisk filologi 58, 240-249. Helsinki: Svenska Litteratursällskapet. Acknowledgements I would like to thank Leanne Hinton for valuable advice and discussion. This research has been supported by a Fulbright Grant and a Jacob K. Javits Graduate Fellowship. References Bergroth H. (1917) Finlandssvenska: handledning till undvikande av provinsialismer i tal och skrift. Helsinki: Holger Schildts. Harling-Kranck G. (1998) Från Pyttis till Nedervetil: tjugonio dialektprov från Nyland, Åboland, Åland och Österbotten. Helsinki: Svenska Litteratursällskapet. Kuronen M. and Leinonen K. (2000) Fonetiska skillnader mellan finlandssvenska och rikssvenska. Svenskans beskrivning 24, 125138. Linköpings universitet. Labov W. (1994) Principles of Linguistic Change: Internal Factors. Oxford: Blackwell. 12 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The interaction of word accent and quantity in Gothenburg Swedish My Segerup Department of Linguistics and Phonetics, Lund University, Lund E-mail: [email protected] Background According to the Swedish tonal typology (Gårding, 1973, with Lindblad 1975, Bruce & Gårding, 1978) the timing/alignment of the word accent gesture is essential to the Swedish word accent distinction. The traditionally described word accent pattern of the West Swedish prosodic dialect type (see Bruce & Gårding, 1978) involves low pitch on the stressed vowel for accent 1 words and a peak on the stressed vowel for accent 2 words in focal position. Bruce (1998) has suggested that Gothenburg Swedish is characterized by two-peaked pitch contours for both word accents with an earlier timing in accent 1. A previous production study (Segerup, 2004) confirms that Gothenburg Swedish accent 1 deviates from the generally accepted West Swedish accent 1 pattern through having a fall on the stressed vowel. Furthermore, the fall of accent 2 is only marginally later than that of accent 1, meaning that the expected timing difference between accent 1 and accent 2 is less than in other dialect types. Consequently, the overall shape of the pitch contours is strikingly similar, but yet they are perceptually distinct (Segerup, 2004, Segerup & Nolan, forthc.). Abstract According to the conventional wisdom the word accent distinction in Swedish (dialects) is maintained chiefly by a difference in the timing of the word accent gesture (Gårding, 1973). Gothenburg Swedish, however, does not obey to the norm since both pitch height and timing contribute to the word accent distinction in this dialect (Segerup, 2004). In Gothenburg Swedish both word accents have a fall on the stressed vowel, which makes the pitch contours strikingly similar (Segerup, 2004). Up until now the material investigated has consisted of contrastive words in which the stressed vowel is phonologically long. In the present production study we proceed with wordpairs where the stressed vowel is phonologically short for a comparison. The acoustic analysis involves measurements of fundamental frequency (F0) and segments’ duration of five speakers’ production of seven word-pairs altogether. The results show a significant difference in the duration of the short stressed vowel between accent 1 and accent 2. Further, that word accent has effects on the vowel duration Pitch height and timing – collaborating cues Perhaps the most unexpected finding of the production study summarized above is that accent 2 was shown often to involve higher pitch in the stressed vowel than accent 1. The result of the acoustic analysis shows that the word accent distinction is maintained by comparatively small differences in the timing and height of the fall and further that speakers apparently use different strategies in order to maintain the distinction. Some speakers rely primarily on one cue or the other, other speakers rely on both. In order to find out whether listeners attend to pitch height or disregard it most likely as an unintentional consequence of producing the alignment difference, a perception experiment was carried out (Segerup & Nolan, forthc.). The stimuli used in the experiment were resynthesized from natural utterances with alignment and pitch height varied systematically. Twenty-four native speakers of INTRODUCTION The present paper investigates the interaction between word accent and quantity in Gothenburg Swedish. Minimal pairs with accent 1 and accent 2 and with either long or short stressed vowel are examined. How are pitch height and timing affected when the voiced portion of the syllable is minimized by having a short vowel followed by a voiceless consonant as opposed to a more sonorant environment, i.e. a long vowel or sonorant consonant? This is related to the general question of truncation or compression of the f0 contour in an intonationally unfavourable environment (see e.g. Bannert & Bredvad-Jensen, 1975). 13 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Gothenburg Swedish served as subjects. The results show that listeners do take note of the height from which the fall takes place as well as the alignment of the fall. The results of the production and perception study above suggest that there is a trading relationship between pitch height and timing, meaning that these two independent dimensions contribute in various proportions to the perception of the word accent distinction (Segerup & Nolan, forthc.). very natural and colloquial speech). At least three (but up to nine) repetitions of every sentence in each speaking mode (in random order) were recorded for each of the 5 speakers. The total number of tokens (including all 5 speakers’ repetitions) in the present analysis varies from approximately 15 to 28 tokens per word in each speaking style. Purpose The acoustic analysis includes segments’ duration and seven measurements of pitch values at specific preselected points. These are; the height of the preceding vowel (1), the start of the stressed vowel (2), the top corner of the fall (3), the bottom corner of the fall (4), the start of the rise (5), the phrase accent peak (6), and the end (7). In the case where the stressed vowel is followed by a voiced/voiceless consonant, the measurement point (5) is at the onset of the second vowel. In Figures 1-3 below the measurement points are marked by triangles for accent 1 and squares for accent 2. Acoustic analysis The purpose of the present study is to investigate the interaction between word accent and quantity, and, further, to investigate whether there is a difference between accent 1 and 2 as regards the duration of the short stressed vowel and long stressed vowel, respectively. INVESTIGATION Materials, subjects, method The speech materials comprise seven contrastive disyllabic word-pairs, all of which are listed pair-wise in Table 1 below. (Since the present investigation is part of a large-scale study, the word-pairs included here are not completely symmetric). The target words, in phrase-final focal position, were extracted from various sets of sentences (statements) spoken in two different speaking styles; normal and clear voice. RESULTS The results of the acoustic analysis are exemplified in Figures 1-4. Figures 1-3 show the average f0 curves for five speakers’ production of malen/malen, pollen/ pållen and tecken/täcken in clear style, respectively. The duration of the stressed vowel is indicated by a bar (at the top for accent 2 and at the bottom for accent 1). The pitch contours are aligned at the start of the stressed vowel and earlier points are shown as having negative times relative to the alignment point. In words with a long vowel the duration of the stressed vowel and the overall timing of the two word accents is very similar, meaning that a direct comparison of the timing of pitch events is possible, which is generally not the case in words with a short vowel. Figure 4 compares, for accent 1 and accent 2, the average duration of the stressed vowel for the word-pairs malen/malen, pollen/pållen and tecken/ täcken. Table 1. Contrastive word-pairs included in the present study. Polen (Poland) Judith malen (the moths) buren (the cage) Accent 2 V: pålen (the stake) ljudit (to have sounded) malen (ground) buren (carried) Accent 1 V pollen (pollen) tecken (signs) tjecker (Czechs) Accent 2 V pållen (horsey) täcken (quilts) checker (cheques) Accent 1 V: It is clear from the acoustic results that words with a short vowel behave differently from words with a long vowel. Words with a long vowel (Judith/ ljudit, Polen/pålen and buren/buren) behave nearly the same as malen/malen, which is shown in Figure 1. For both accents the f0 contour is falling throughout the vowel segment from an initial f0 maximum (defined as the top corner of the fall), which starts slightly later at a higher frequency level for accent 2 than for accent 1, to an f0 minimum (the bottom corner of the fall) at the end of the stressed vowel. Speakers were five elderly male native speakers of Gothenburg Swedish. The recordings were made using a portable DAT recorder in the subjects’ local environment. Two sets of recordings were made on two separate occasions. A Gothenburg Swedish interlocutor read various questions, to which the subjects read the answer (which proved to induce 14 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University frequency (hz) 275 250 225 200 175 150 125 100 75 50 -300 The falling f0 contour of accent 2 starts at a higher frequency level and with slightly later timing than the fall of accent 1 and reaches Low in the following consonant. Figure 3 reveals a further effect. In tecken/täcken and tjecker/checker (not shown here) where the voicing part is very short it appears that speakers compress the fall in accent 1, so that the Low is achieved at the end of the short vowel, whereas in accent 2 the pitch stays high at the end of the stressed vowel. The graph interpolates to the Low measured at the beginning of the second vowel, so that the true slope of the fall cannot be determined because of the voicelessness. Accent 1 Average 5 speakers malen A1 & A2 Accent 2 Acc 1 V Acc 2 V -100 100 300 time (ms) 500 700 Figure 1. Average fundamental frequency contours for accent 1 (triangles) and accent 2 (squares) for five speakers. The first point in the curves is in the preceding unstressed vowel. The bars show the duration of the stressed vowel for accent 1 (bottom) and accent 2 (top). Times are expressed relative to the onset of the stressed vowel. frequency (hz) There is no durational effect of accent evident in long vowel words, while this does seem to be the case in short vowel words. The difference seen for pollen/pållen (Figure 2) and tecken/täcken (Figure 3) was also seen for tjecker/checker. In the short vowel contours of pollen and pållen (Figure 2) the main pattern of the f0 contours of the long vowel words is preserved. Even if the fall of accent 1 pollen starts comparatively late into the vowel, the f0 contour is falling rapidly through the second part of the vowel to a final Low, approximately at the VC-boundary. frequency (Hz) 275 250 225 200 175 150 125 100 75 50 -300 Acc 2 Acc 2 V 100 300 time (ms) 500 Acc 1 V Acc 2 V -100 100 300 time (ms) 500 700 From the results it is obvious that the gradient of the fall in accent 1 differs between long vowel words and short vowel words. The results suggest that the gradient of the fall also differs between short vowel words, i.e. between words where the stressed vowel is followed by a voiced consonant versus a voiceless consonant. The gradient of the fall in tecken appears to be twice as steep as that of malen, while the gradient of the fall in pollen is approximately in between that of malen and tecken. Acc 1 V -100 275 250 225 200 175 150 125 100 75 50 -300 Accent 2 Figure 3. Average fundamental frequency contours for accent 1 (triangles) and accent 2 (squares) for five speakers. The first point in the curves is in the preceding unstressed vowel. The bars show the duration of the stressed vowel for accent 1 (bottom) and accent 2 (top). Times are expressed relative to the onset of the stressed vowel. Acc 1 Average 5 speakers pollen A1 pållen A2 Accent 1 Average 5 speakers tecken A1 täcken A2 700 Figure 4 compares the duration of the stressed vowel of accent 1 and accent 2 for comparison between malen/malen, pollen/pållen and tecken/täcken. The difference in duration of the short stressed vowel between accent 1 and accent 2 is noticeable. A t-test showed this difference to be significant for each individual speaker in each speaking style (at 5 % level). One exception is tjecker/checker in clear Figure 2. Average fundamental frequency contours for accent 1 (triangles) and accent 2 (squares) for five speakers. The first point in the curves is in the preceding unstressed vowel. The bars show the duration of the stressed vowel for accent 1 (bottom) and accent 2 (top). Times are expressed relative to the onset of the stressed vowel. 15 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University References speaking style, where two of the speakers’ vowel duration for accent 2 was only marginally shorter than that of accent 1. Acc 1 Average duration stressed vowel 5 speakers Acc 2 vowel duration (ms) 0 50 100 150 200 250 malen malen pollen pållen tecken täcken Figure 4. Average duration (ms) of the stressed vowel for malen/malen, pollen/pållen and tecken/täcken five speakers. Accent 1 is represented by the light bar and accent 2 by the dark bar. DISCUSSION In Gothenburg Swedish short vowel words, accent 2 seems to demonstrate truncation of the pitch fall and accent 1 seems to demonstrate compression of the fall and also some lengthening of the stressed vowel. It appears to be the case that Gothenburg Swedish speakers’ strategy is to preserve the fall on accent 1, while the fall seems to be of less importance for accent 2. One interpretation of this is that the falling f0 contour in the stressed vowel of accent 1 and the height from which the fall takes place in accent 2 is enough of a cue to maintain the distinction between the word accents in words with a short stressed vowel. House (1990) has worked with a model of tonal feature perception which may be applied to these findings. In order to fully understand the interaction of these cues a perceptual experiment with synthetic stimuli is in preparation, which will manipulate pitch height and slope in order to discover the relative importance of these factors. 16 Bannert R. and Bredvad-Jensen A-C. (1975) Temporal organization of Swedish tonal accents: The effect of vowel duration. Working Papers (Phonetics Laboratory, Department of general Linguistics, Lund University) 10,1-26. Bruce G. and Gårding E. (1978) A prosodic typology for Swedish dialects. In Gårding G., Bruce G. and Bannert R. (eds) Nordic Prosody: Papers from a symposium (Department of Linguistics, Lund University) 219-229. Bruce G. (1998) Allmän och svensk prosodi (Department of Linguistics, Lund University) 16. Gårding E. (1973) The Scandinavian word accents. Working papers (Phonetics Laboratory, Lund University) 8. Gårding E. and Lindblad P. (1975) Constancy and variation in Swedish word accent patterns. Working papers (Phonetics Laboratory, Lund University) 3, 36-100. House, D. (1990) Tonal Perception in Speech. (Traveaux de l’Institut de Linguistique de Lund, Lund University) 24. Segerup M. (2003) Word accent gestures in West Swedish. In Heldner M (ed.) Proceedings from FONETIK 2003, Phonum (Department of Philosophy and Linguistics, Umeå University) 9, 25-28. Segerup, M. (2004) Gothenburg Swedish word accents – a fine distinction. In Branderud, P. & H. Traunmüller (eds). Proceedings Fonetik 2004 (Department of Linguistics, Stockholm University) 28-31. Segerup, M. & Nolan F. (forthc) Gothenburg Swedish word accents – a case of cue trading? Nordic prosody (Department of Linguistics and Phonetics, Lund University) IX. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Visual Acoustic vs. Aural Perceptual Speaker Identification in a Closed Set of Disguised Voices Jonas Lindh Department of Linguistics Göteborg University the effects of low quality recordings. Generally, one can say that primarily aural identification has been the leading method when it comes to casework. Many studies have been carried out to see what parameters are most stable or where effects of low quality can be calculated, for example the telephone effect (Künzel, 2001). Generally, LTAS becomes rather stable after 30-40 seconds of speech. (Boves, 1984; Fritzell et. al., 1974; Keller, 2004) LTAS reflects the energy highs and lows generated by the vocal tract filter on average, which means that it should be more difficult to alter than, for example, F0 or specific phones, why this measure is often chosen to visually represent the general energy distributions in frequency for the speech signal. Several studies have been conducted to study energy ratios and level differences for LTAS (Löfqvist, 1986; Löfqvist & Mandersson, 1987; Gauffin & Sundberg, 1977; Kitzing, 1986). Kitzing (1986) recommended that patients should read at the same degree of vocal loudness to avoid the differences that occurred especially in higher frequencies. Kitzing & Åkerlund (1993) pointed out the need for an investigation of the effect of vocal loudness on LTAS curves. Nordenberg & Sundberg (2003) performed such a test and showed that vocal loudness and varied f0 gave variations in Long Time Average Spectra. However, even though an expected variation has been shown, the ability to perform pattern matching on the graphs seems to be possible. It has been observed that a slight difference between the identification results between subjects depends on whether they consider distance more important than shape/pattern. Hollien & Majewski (1977) tested long-term spectra as a means of speaker identification under three different speaking conditions, i.e. normal, during stress and disguised speech. LTS for fifty American and fifty polish male speakers were used under fullband as well as passband conditions. The results demonstrated high levels of correct identification (especially under fullband conditions) for normal speech with degrading results for stress and disguise. Abstract Many studies of automatic speaker recognition have investigated which parameters that perform best. This paper presents an experiment where graphic representations of LTAS (Long Time Average Spectrum) were used to identify speakers from a closed set of disguised voices and determine how well the graphic method performed compared to an aural approach. Nine different speakers were recorded uttering a fake threat. The speakers used different disguises such as dialect, accent, whisper, falsetto etc. and the verbatim “threat” in a normal voice. Using high quality recordings, visual comparison of the Praat “vocal tract” graphs of LTAS outperformed the aural approach in identifying the disguised voices. Performing speaker identification aurally does not mean analyzing a different sample than the one being analyzed acoustically. Studies of aural perception show a hypothesizing, top-down, active process, which create interesting questions regarding aural speaker identification with bad quality recordings in noisy backgrounds etc. However, more tests on telephone quality recordings, authentic material and additional types of acoustic measurements, must be performed to be able to say anything about LTAS with implications for forensic purposes. Background and Introduction The so-called “voiceprint” approach introduced by Lawrence Kersta (1962) suggested a pattern matching procedure comparing broadband spectrograms for speaker identification purposes. It is within this context that an interest in studying visual vs. aural methods arose. Since complex visual pattern matching activates the right hemisphere of the brain and speech- and language processes usually the left (Rose, 2002) it would be preferable to find a way to integrate both. There are many problems to be considered when using visual representations of acoustic data within the context of forensic speaker identification, especially considering 17 Method The sixteen disguised voices and “suspects” (references), were recorded by six females and three males. The recordings were made with a high quality microphone in front of a personal computer and the subjects recorded one “normal” and as many disguised voices as they wanted, repeating the same fake threat in Swedish. All recordings were between four and six seconds long and sampled at 16kHz. Forced choice was applied in both the aural and visual tests. then decided which one they thought was the most similar one comparing both shape and/or distance. The subjects were also told that the graphs had no timeline and that they were supposed to perform pattern matching, answering which graphs were the most similar ones in each test sample. They were also asked to comment on how they reached each conclusion and if distance or shape was most important when coming to a decision. This was done to be able to interpret how subjects compared the visual input. They were allowed to inspect the graphs as many times and as long as they wanted. The Graphic Representations of LTAS The “vocal tract” function in Praat draws the LTAS envelope (in decibel) as if they were vocal tract areas (in square meters). This gives a graph representing the LTAS. The graph does not give the axis values, which is reasonable since the overall absolute amplitude, as a parameter, has no real value (Nordenberg & Sundberg, 2003). The important information lies in the relative spectral envelopes represented by the line showing the energy distribution as a function of frequency. The Aural Identification Test Seven subjects performed aural identification on the same set of samples to be able to compare the results easily. The recordings were put in a list in a randomized order. Subjects used headphones and could listen to the samples as many times as they wanted before deciding which one of the references they thought sounded most like the target. All subjects were of the same category as in the visual test. Some test subjects were the same as in the visual test. Results and Discussion Even though there is a great difference in performance between subjects within each test, it is clear that the visual identification outperforms the aural. The Visual Identification Results The results for the visual tests show consistency. Figure 1. A graph comparison sample (in the test the target is red and each reference blue). Table 1. Inter-rater Reliability Analysis (Cronbach’s alpha). The graphic representations of LTAS were created from an LTAS object using 100 Hz frequency bins. (Boersma & Weenink, 2005) N of Disguised Voices N of Subjects Alpha The Visual Comparison Test Graphs representing LTAS were created for sixteen disguised voices and paired up with each of the reference samples to be used in a visual identification test performed by ten subjects. The order in which they were presented was randomized. The subjects were all students or employees at the Department of Linguistics, Göteborg University. They had all, at some point, taken an undergraduate course in phonetics and/or speech technology. The subjects compared each disguised voice with all the suspects/references in pairs and 16 10 0.91 The impression based on the comments is that subjects with a preference for pattern and shape rather than distance generally performed better in the visual test. 18 % Correct Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University thought that “no decision” should have been added as an alternative answer. % Correct Visual Identifications / Sample 100 100 90 80 80 80 70 Table 2. Inter-rater Reliability Analysis (Cronbach’s alpha). 60 40 40 20 0 10 0 1 30 2 3 4 5 40 40 40 10 6 7 30 10 8 9 N of Disguised Voices N of Subjects Alpha 20 10 11 12 13 14 15 16 The reliability score is lower in this test compared to the visual test. However, the correlation is high enough to be interpreted as a rather high correlation between subjects. Disguised Voice Samples Figure 2. Percent correct visual identifications per sample (16) for 10 subjects. % Correct Aural Identifications / Sample 100 100 % Correct Figure 2 shows how many correct identifications that were made per disguised voice sample. Some graphs were obviously very difficult to identify. Why that is so, or how those graphs differ, has not yet been investigated. % Correct % Correct Visual identifications / Subject 100 90 80 70 60 50 40 30 20 10 0 16 7 0.83 80 71 71 57 60 43 40 29 20 14 14 43 29 29 14 14 14 14 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Disguised Voice Samples 56 44 31 1 44 31 2 3 4 50 50 38 44 44 Figure 4. Percent correct aural identifications per sample (16) for 7 subjects. 5 6 7 8 9 10 Figure 4 gives a result overview, which may be compared with the corresponding Figure 2 for the visual test. The amount of correct identifications per sample is significantly lower though the maximum is lower (seven subjects vs. ten). Subject Figure 3. Percent correct visual identifications per subject (10) for 16 samples. % Correct Aural Identifications / Subject % Correct Figure 3 shows the identification results for each subject, which varies from nine correct identifications to five. As mentioned above the performance was clearly related to whether the subject used pattern/shape matching more than distance. The average identification score for the visual test is 6.9, which could be considered as rather low, but considering the difficulties presented in the aural test results it is merely the comparison which is taken into consideration in this study. 100 90 80 70 60 50 40 30 20 10 0 38 44 31 31 38 44 19 1 2 3 4 5 6 7 Subject The Aural Identification Results The results in the aural test were less correlated. The reason is simply that subjects found the task much more difficult, i.e. most subjects Figure 5. Percent correct aural identifications per subject (7) for 16 samples. 19 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Figure 5 presents the figures corresponding to table three in the visual identification test. The subjects’ results are significantly lower even though lowest visual score (five) is higher than the highest aural score (seven). Since there seems to be an individual strategy success involved. The result per subject in the aural test also shows a higher degree of variation than the visual. This is probably due to the difficulties they showed in deciding on which reference to choose. References Boersma P. & Weenink D. (2005) Praat: doing phonetics by computer (Version 4.3.01) [Computer Program]. Retrieved from <http://www.praat.org/> Boves L. (1984) The phonetic basis of perceptual ratings of running speech Foris Publications, Dordrecht. Gauffin J. & Sundberg J. (1977) Clinical application of acoustic voice analysis. Part II: Acoustic analysis, results 1977/2-3: 39-43. Grosjean F. (1980). Spoken word recognition processes and the gating paradigm. Perception and Psychophysics, 28, 267-283. Hollien H. & Majewski W. (1977) Speaker identification by long-term spectra under normal and distorted speech conditions. Journal of the Acoustical Society of America 62: 975-980. Keller E. (2004) The analysis of voice quality in speech processing. In Lecture notes in computer science, Springer Verlag, Berlin. Kersta L. G. (1962) Voiceprint identification. Nature 196: 1253-1257. Kitzing P. (1986) LTAS criteria pertinent to the measurement of voice quality. Journal of Phonetics, 14: 477-482. Künzel H. J. (2001) Beware of the 'telephone effect': The influence of telephone transmission on the measurement of formant frequencies. Forensic Linguistics 8: 80-99. Löfqvist A. (1986) The long-time-average spectrum as a tool in voice research. Journal of Phonetics, 14: 471-475. Löfqvist A. & Mandersson B. (1987) Longtime average spectrum of speech and voice analysis. Folia phoniatrica, 39: 221-229. Nordenberg M. & Sundberg J. (2003) Effect on LTAS of vocal loudness variation. In: TMH-QPSR, KTH, 45: 93-100. Rose P. (2002) Forensic Speaker Identification. New York, Taylor & Francis. Stevens K. N. (1993) Lexical access from features. In Speech communication group working papers (Vol. VIII, p. 119-144). Research Laboratory of Electronics, Massachusetts Institute of Technology. Conclusions General advantages with graphic representations are: • Intra subjectively applicable (depending on the amount of data). • Relatively simple fundamentals for calculation. • Rather easy to visualize. General disadvantages are: • Difficult to quantify and substantiate comparisons. • The visualization depends on F0 and vocal loudness variations. • An average always ignores specific events in the speech signal. Considering the categorical, top-down active human speech perception process (Grosjean, 1980), it is interesting to find complementary visual acoustic information to aural methods in forensic speaker identification. When two voice samples are compared, the same input is judged no matter if it is aurally or acoustically. The question is how it is analyzed and how the acoustic visual and the aural perceptual information are processed. If a better understanding between the two is reached, objective methods can be used to judge similarities. Objective acoustic methods can also more easily be excluded on well-grounded arguments as well as subjective aural ones. This could also lead to better statistical data in forensic speaker identification if computer based methods can be used with more confident supervision. It is clear that aural mistakes are made, especially for disguised voices. The graphic representations used in this experiment are not claimed to be complete images reflecting the voice of a speaker. They are but examples showing that in some cases visual acoustic input are better at discriminating between speakers than are ears alone. 20 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University A Model-Based Experiment towards an Emotional Synthesis Jonas Lindh Department of Linguistics Göteborg University comes to synthesizing expressive speech. One simply treats the relationship in a hierarchy where the abstract underlying expression is neutral and the surface expressions are the emotions we want to induce, in this case the basic emotions from discrete emotional theory - anger, sadness and happiness (Levenson, 1994; Laukka, 2004; Tatham & Morton, 2004; Narayanan & Alwan, 2004). A modern state of the art unit selection speech synthesis normally produces a sentence as neutrally as possiblein order to avoid undesired side effects or miscommunication. Neutral in this case means near monotone or containing as few speech fluctuations as possible. This is not always desirable when it comes to for example dialogue systems. To be able to compare whether a system succeeds in expressing a certain emotion or desire, it is obviously also important to study how well people in general succeed in communicating emotions. The development of conversational systems has increased, meaning that understandable, neutral synthetic speech is barely acceptable anymore. Some success has been reached, but the best ones still depend too much on stored data, including a separate emotional speech database. (Bulut et al., 2002) The most successful attempts to synthesize emotions have been built by using additional speech databases containing only recordings representing specific emotions uttered (this applies to concatenative/unit selection synthesis systems). The system has to be able to switch database when a specific emotion is desirable. The system must perhaps also use different algorithms/analyses for the different databases since the acoustic content might differ significantly. The databases needed for such a system also mean a substantial increase of data to choose from. A simpler and computationally more efficient method is to induce rules for expressive speech and resynthesize an utterance produced by the system. Nowadays, most unit selection systems are created by recording a single professional speaker and then using specified parts (nor- Abstract The most successful methods to induce emotions on state of the art unit selection speech synthesis have been built by switching speech database depending on the desired emotion. These methods require a substantial increase of memory compared to a single database and are computationally slow. The model-based approach is an attempt to reshape a neutrally recorded utterance (comparable to the desired output from a modern unit selection system) into simulating a recorded model of a desired emotion. Factors for manipulation of duration, amplitude and formant shift ratio are calculated by comparing the recorded neutral utterance with three recorded, basic emotional models in accordance with discrete emotion theory – sadness, happiness and anger. F0 (regarded as the intonation) is copied from the model and is then imposed on the neutrally recorded utterance. The evaluation of the experiment shows that subjects easily categorize discrete emotions in a forced choice. They also grade the resynthesized emotional quality from the neutrally recorded utterance almost equally high as the naturally recorded models for the male voice. The female voice created more difficulties and contained more synthetic artifacts, i.e. it was judged to have a lower quality than the recorded models. Background and Introduction Creating emotional synthesis has been a research area for quite some time. Formant speech synthesis is easily distinguished from human speech not only because of the underdeveloped naturalness, but also due to the lack of expressiveness. Several attempts to implement emotions in formant synthesis have taken place (Cahn, 1988; 1989; 1990; Carlson et al., 1992). When dealing with emotional content in speech the point of departure is almost always the neutral utterance. What is neutral speech, i.e. speech without emotions? Normally, neutral speech is thought of as a carrier being modulated to reveal the emotions being communicated. Such a concept is rather useful when it 21 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University mally diphones as basic element) of the utterances to concatenate new ones. This normally means that a professional speaker must be available to be recorded for emotional utterances of different lengths. If these recordings are used as models, they will then hopefully not differ more from the utterances that will be produced by the system than there will be differences for a specific speaker. The desire for creating a simpler way of inducing emotions in unit selection synthesis based on rules have been proposed by for example Murray et al. (2000) and Zovato et al. (2004). However, in this paper an experiment using models to calculate differences between a neutral and an emotional utterance is presented and tested. The results show both difficulties and promising results, which are then discussed concerning how to find ways to induce emotions in synthesis. If emotions are to be created by a system they cannot be expected to outperform the communication of emotional content from recorded models. The model-based approach The approach described and tested in this paper is similar to the rule-based idea that is described in Zovato et al. (2004) and Murray et al. (2000), except that the rules are based on interactive calculations compared to models. The model calculations are also applied to the complete utterances and not applied to specific units (i.e. syllables or diphones etc.). A state of the art unit selection synthesis attempts to sound as natural and neutral as possible. If the voice used in the system is recorded to produce models of emotions, the neutrally produced output can be seen as the underlying neutral representation. The representation can then be compared to the produced models to be able to calculate variations for specified parameters. The aim of the model-based approach is to approach the recognition rate for the models themselves and keep naturalness. The limitations are obvious, when stretching and changing too much, PSOLA will create synthetic artifacts. Method Two speakers, one female and one male, were recorded uttering the same sentence (“Jag har inte varit där idag”) in four different expressive styles: natural, sad, happy and angry. The recordings were made in a studio environment using a high quality microphone. The speakers were told to first consider how to express the emotions in speech concerning duration, amplitude and intonation. They were then told to express the emotions as clearly as possible while recorded, even though the semantic content did not suggest a specific emotion. Each recorded emotion was then used, both as a model to induce the specific emotion in the neutrally recorded utterance as well as a reference against which the resynthesized speech should be compared. If one uses the same speaker and the calculated differences from the same utterance with different emotions one should be able to resynthesize at least the specific parameters correctly. Six subjects finally evaluated the results by categorizing and grading the neutral recording, the recorded models and thre three resynthesized objects for the two speakers, i.e. fourteen utterances of the same sentence. Figure 1. Flow chart showing the script procedure for the model based experiment. 22 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Figure 1 shows how a neutral utterance and a model is compared and the neutral utterance finally resynthesized. First, the neutral utterance and model duration and average amplitude are calculated. Equal duration is then calculated for the objects. Pitch tier objects are then created after point processing the framed fundamental frequency values. A point-processed object is a sequence of points (ti) in time, defined on a domain [tmin, tmax]. The index (i) runs from 1 to the number of points. The points are sorted by time (i.e. ti+1 > ti). Points are generated along the entire time domain of the pitch tier, because there is no voiced/unvoiced information, then the F0 contour is linearly interpolated. This means that one can easily exchange the point processed signal tier from one object to another, thus cloning the intonation (Boersma & Weenink, 2005). The formant shift ratio is then calculated for the first three formants and manipulated. Finally the duration (relative to the model) and the average amplitude is modified and resynthesized. As can be observed in Table 1 the F0 values are modified fairly well compared to the models. The formant shift ratio should be individualized to each formant and not changed depending on the general averages from the first three. For the female voice (table 2) the neutral recording contained some traces of creakiness, which led to some failure in the F0 analysis and thereby also the resynthesis. Generally, the values approach the model’s. Evaluation Test Seven subjects with normal hearing and some previous experience of listening to synthesized speech (six employees and one student at the department of linguistics) performed an evaluation. In the evaluation the subjects listened to sixteen samples, eight male and eight female. The samples were the neutral utterance plus the three recorded models of the same sentence and the three resynthesized samples. When hearing the samples the subjects had to categorize each sample belonging to one of the four categories neutral, happy, sad and angry. After categorizing they had to grade the confidence level of their categorization from 1 to 5 (absolutely confident). They finally had to grade the naturalness, meaning a score between very synthetic (1) to sounding like a recorded voice (5). The average results calculated from all subjects in Table 3 and 4 below. Results and Discussion The result of the modulations was calculated by comparing averages and standard deviations for the resynthesized objects and the models. Table 1. Model and modified parameter values for the male voice Male voice F0 mean Neutral 95 Sad Model 153 Resynth 148 Happy Model 133 Resynth 133 Angry Model 84 Resynth 83 F0 std 24 18 16 52 46 8 5 Ampl (dB) 68 69 69 75 72 70 68 F1 mean 519 405 512 528 522 517 507 F2 mean 1482 1300 1508 1464 1451 1367 1452 F3 mean 2644 2592 2517 2602 2629 2672 2629 Table 3. Average results from the categorization and grading (1-5) by seven subjects of the male voice. Male voice Neutral Neutral 4.7 Sad Model Resynth Happy Model Resynth 0.7 Angry Model Resynth Table 2. Model and modified parameter values for the female voice Female Voice Neutral Sad Model Resynth Happy Model Resynth Angry Model Resynth F0 mean 172 328 311 358 349 250 236 F0 std 17 73 25 119 107 53 52 Ampl (dB) 70 67 68 77 73 77 74 F1 mean 573 587 610 707 608 638 614 F2 mean 1670 1535 1651 1661 1734 1658 1689 F3 mean 2687 2783 2681 2709 2767 2649 2686 Sad Happy Angry 4.3 4.2 4.8 3 3.67 3.5 Natural 4.3 3.8 2.7 4.7 3.5 4.33 3.5 The average naturalness score for the four resynthesized male samples is 3.37, while the overall average for the recorded models is 4.29. Whether this decrease in naturalness is an acceptable has not been investigated. Categorizing the male samples created no problems except one uncertain exception (0.7 happy-neutral). This means that there is a trade-off between naturalness and an computationally cheap method. 23 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University [Computer program]. Retrieved March 31, 2005, from http://www.praat.org/ Bulut M., Narayanan S., Syrdal A. (2002) Expressive speech synthesis using a concatenative synthesizer. In Proc. ICSLP (Denver) Cahn J. E. (1990) The Generation of Affect in Synthesized Speech. Journal of the American Voice I/O Society, Volume 8. 1-19. Cahn J. E. (1989) Generation of Affect in Synthesized Speech. Proceedings of the 1989 Conference of the American Voice Input/Output Society . Newport Beach, California. Pages 251-256. Cahn J. E. (1988) From Sad to Glad: Emotional Computer Voices. Proceedings of Speech Tech '88, Voice Input/Output Applications Conference and Exhibition. New York City. Pages 35-37. Carlson R., Granström B., Nord L. (1992) Experiments with emotive speech - acted utterances and synthesized replicas. Proc. ICSLP'92, pp. 671-674 Laukka P. (2004) Vocal Expression of Emotion. Discrete-emotions and Dimensional Accounts, Acta Universitatis Upsaliensis, Comprehensive Summaries of Uppsala Dissertations from the Faculty of Social Sciences 141. 80pp. Uppsala. Levenson R. W. (1994) Human emotion: A functional view. In P. Ekman & R. J. Davidson (Eds.), The nature of emotion: Fundamental questions (pp.123-126). New York: Oxford University Press Murray I. R., Edgington M. D., Campion D., Lynn J. (2000) Rule-based emotion synthesis using concatenated speech, In SpeechEmotion-2000, 173-177. Narayanan S., Alwan A. (2004) Text to Speech Synthesis: New Paradigms and Advances. Prentice Hall PTR, IMSC Press Multimedia Series. Tatham M. & Morton K. (2004) Expression in speech : analysis and synthesis. Oxford [England] ; New York : Oxford University Press. Zovato E., Pachiotti A., Quazza S., Sandri S. (2004) Towards emotional speech synthesis: a rule based approach. Workshop Proceedings, 5:th ISCA Speech Synthesis Workshop, Carnegie Mellon University, Pittsburgh (US). Table 4. Average results from the categorization and grading (1-5) by seven subjects of the female voice. Female voice Neutral Sad Model Resynth Happy Model Resynth Angry Model Resynth Neutral Sad Happy Angry 0.7 2.8 4.5 0.8 1.8 4 0.3 1.2 1 5 0.5 2.8 Natural 4 3.5 1.7 3.5 2 4.7 2.8 The female voice created more difficulties. The samples contained more synthetic artifacts, which was detected by the listeners. The average naturalness score for the resynthesized samples is 2.62. Since the categorization as well as the grading was worse here, it is likely that the synthetic low quality of the output made categorizing more difficult. This might also be an example of what happens when bad models are created (see table 2). Conclusions and Further Developments Categorizing discrete emotions does not seem to be a problem. The difficulty certainly increases as quality degrades. Female voices turned out to be more difficult to resynthesize without degrading naturalness. More research is needed to be able to make a well-grounded comparison. One problem may be individual voice variation. The purpose of a model-based approach is to be able to induce discrete emotions on a neutrally uttered sentence, produced by a state of the art unit selection system. By comparing well-formed models from one individual speaker, characteristics such as intonation, F0changes, formant shift ratios and amplitude can be calculated and induced on a neutrally uttered sentence successfully. More research on which segment level (syllable, diphone etc.) the calculations and inducing should be done is desirable for the future. There also remains several questions regarding what a model should look like and which parameters that really should be modified to reach the model-based goal. References Boersma P. & Weenink D. (2005) Praat: doing phonetics by computer (Version 4.3.07) 24 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Annotating Speech Data for Pronunciation Variation Modelling Per-Anders Jande KTH: Department of Speech, Music and Hearing/CTT – Centre for Speech Technology Abstract Background This paper describes methods for annotating recorded speech with information hypothesised to be important for the pronunciation of words in discourse context. Annotation is structured into six hierarchically ordered tiers, each tier corresponding to a segmentally defined linguistic unit. Automatic methods are used to segment and annotate the respective annotation tiers. Decision tree models trained on annotation from elicited monologue showed a phoneme error rate of 9.91%, corresponding to a 55.25% error reduction compared to using a canonical pronunciation representation from a lexicon for estimating the phonetic realisation. Work on pronunciation variation in Swedish on the phonological level has been reported by several authors, e.g. Gårding (1947), Bruce (1986), Bannert and Czigler (1999) and Jande (2003a, 2003b, 2004). There is an extensive corpus of reports on research on the influence of different context variables on the pronunciation of words. Variables that have been found to influence the segmental realisation of words in context are foremost speech rate, word predictability (often estimated by global word frequency) and speaking style (cf. e.g. Fosler-Lussier and Morgan, 1999; Finke and Waibel, 1997; Jurafsky et al., 2001; van Bael, 2004). Introduction Speech Data The pronunciation of a word depends on the context in which the word is uttered. A model of pronunciation variation due to discourse context is interesting in a description of a language variety. Such a model can also be used to increase the naturalness of synthetic speech and to dynamically adapt synthetic speech to different areas of use and to different speaking styles. The pronunciation of words in context is affected by many variables in conjunction. The amount of variables and their complex relations make data-driven methods appropriate for modelling. Data-driven methods are methods used to create general models from examples, e.g. using machine learning. To use data-driven methods, data (examples) is of course a prerequisite. The method for acquiring data for variables hypothesised to be important for the pronunciation of words is to annotate recorded spoken language with information about the variables. The pronunciation and the set of context variables is thus used as an example, which can be used for finding general structures in the data. This article describes methods for annotating speech data with information hypothesised to be important for predicting the segment-level pronunciation of words in discourse context. The speech data used for pronunciation variation modelling has not been recorded specifically for this project, but has been collected from various sources. The speech corpus includes data recorded or made available for research within the fields of phonetics, phonology and speech technology in different earlier research projects. The speech data has been selected to be dialectally homogeneous, to avoid dialectal pronunciation variation. The language variety used is central standard Swedish. The speech data has been recorded in different situations and speaking style related variables are defined from the speaking situation. The collection of speech data collected for the project includes radio news broadcast and interviews, spontaneous dialogues, elicited monologues, acted readings of children’s books, neutral readings of fact literature and recordings of dialogue system interaction. Methods and software for annotation has been developed using mainly the VAKOS corpus (Bannert and Czigler, 1999) as the target to be annotated. This corpus was originally recorded and annotated for the study of variation in consonant clusters in central standard Swedish. It consists of ~103 minutes of monologue from ten native speakers of central Standard Swedish. 25 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University located manually with support from the word boundary annotation. The phrase tier is segmented utterance-by-utterance using the output of the TNT part of speech tagger (Brants, 2000; Megyesi, 2002a) and the SPARK parser (Aycock, 1998) with a context-free grammar for Swedish (Megyesi, 2002b). Method Automatic methods (with some minor exceptions) are used for annotation of spoken language data, where annotation is not supplied for the corpora used. The word level annotation is the base for all other annotation. The automatically obtained word boundaries and orthographic transcripts are manually corrected. In this way, relatively little work can give a large gain in annotation performance for most types of annotation. The annotation system is built as a serialised set of modules, producing output at different levels. The output can be manually edited and used as input to modules later in the chain. Discourse Tier Annotation The discourse annotation is related to speaking style characteristics and global speech rate. The speaking style/speaking situation variables included in the annotation are the number of discourse participants (monologue, two-part dialogue or multi-part dialogue), degree of formality (formal, informal), degree of spontaneity (spontaneous, elicited, scripted, acted, read), discourse type (human-directed, computerdirected). These variables are manually defined. A number of automatically estimated measures of the average speech rate over the dialogue are also included. Speech rate is estimated by inverse segment duration. Segments were estimated by the canonical phonemes and segment boundaries by the automatically obtained alignment of the phoneme string to the signal. Speech rate estimates based on all segments and estimates based on vowel segments only are calculated. Duration normalised for inherent phoneme length and for speaker, respectively, is used as well as non-normalised duration. Both duration on a linear scale and on a logarithmic scale are used. All combinations of strategies are included in the annotation, resulting in 16 different speech rate measures for each unit. Annotation Structure All annotation is connected to some durationbased unit at one of six hierarchically ordered tires. The tiers correspond to 1) the discourse, 2) the utterance, 3) the phrase, 4) the word, 5) the syllable and 6) the phoneme. Each tier is strictly sequentially segmented into its respective type of units. Some non-word units can be introduced in the word tier annotation to ensure that parts of the signal that are not speech can be annotated, e.g. pauses and inhalations. A boundary on a higher tier is always also a boundary on a lower tier. An utterance boundary is thus also always a phrase boundary, a word boundary, a syllable boundary and a phoneme boundary. Thus, information can be unambiguously inherited from units on higher tiers to units on the tires below. Having the information stored at different tiers enables easy access to the sequential context information, i.e., properties of the units adjacent to the current unit at the respective tiers. Utterance Tier Annotation The utterance tier annotation includes the variables speaker sex, utterance type (statement, question/request response, answer/response) and a set of speech rate measures. Segmentation Each annotation tier is segmented into its corresponding units, beginning at the word tier. Based on the word tier segmentation and information derived from the word units, the tiers above and below the word tier are segmented. The phoneme tier is segmented word-by-word using the orthographic annotation, a canonical pronunciation lexicon and an HMM phoneme aligner, NALIGN (Sjölander, 2003). The phonemes are clustered into syllables with forced syllable boundaries at word boundaries and the syllable tier is segmented using this clustering and the durational boundaries from the phoneme segmentation. Utterance boundaries are Phrase Tier Annotation The phrase tier annotation includes the variables phrase type, phrase length (word, syllable and phoneme counts), prosodic weight (stress count, focal stress count), and measures of local speech rate over each phrase unit and of pitch dynamism and pitch range. A pitch extraction algorithm included in the SNACK sound toolkit (Sjölander and Beskow, 2000; Sjölander, 2004) is used to obtain information about the pitch contour of the speech 26 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University data. A slope tracking algorithm was used for localising minimum and maximum points or plateaus in the extracted pitch contour. The mean pitch is calculated over each segment of the signal corresponding to a unit over which pitch dynamism and range was to be computed. The sum of the absolute distance between the mean and each extreme value is the pitch dynamism. The difference between the largest extreme value and the smallest extreme value is the pitch range. In addition to a normal Hz frequency scale, pitch is also measured on the Mel, ERB (equivalent rectangular bandwidth), and semitone scales. The three latter scales are used to give estimates of pitch differences closer to the perceived frequency differences of human listeners. consonant cluster, consonant cluster length (phoneme count) and the realised phone. Canonical phonological representations of words were collected from a pronunciation lexicon developed at the Centre for Speech Technology (CTT). Phonological forms for words not included in the lexicon were generated using grapheme-to-phoneme rules. Phonetic transcripts are provided by a system using statistical decoding and a set of correction rules. First, a pronunciation network is created. For each phoneme, a list of possible realisations (tentative phones) is collected from an empirically based realisation list. The phone label set is the same as the phoneme label set and includes 23 vowel symbols and 23 consonant symbols. There is also a place filler null label in the phone label set used for occupying the phone positions of phonemes with no realisation in the phonetic string. A finite state transition network is built from the pronunciation net and a set of HMM monophone models (Sjölander and Beskow, 2000; Sjölander, 2003). SNACK tools (Sjölander, 2004) are then used for Viterbi decoding (probability maximisation) given the observation sequence defined by the parameterised speech. A layer of correction rules are applied to correct some systematic errors made in the Viterbi decoding. The rules use phoneme context (including word stress annotation) and tentative phone context as well as estimated phoneme and tentative phone durations as context. The correction rules were compiled using a manually transcribed gold standard to detect Viterbi decoder errors and to evaluate the effects of introduced rules. For each phoneme in the canonical representation, the gold standard phone and the phone produced by the decoder were compared. Each type of deviation from the gold standard was investigated with the aim to find consistencies in the context which could be used in formulating correction rules. Rules were written to minimise the phoneme error rate (PER), however with the restriction that the rules should be generally applicable. The final rule system was evaluated on a gold standard different from the development standard used in the development phase. The evaluation showed similar PERs and error distributions for the evaluation gold standard as for the development gold standard, both generally and when separating different speakers. The PER of the autotranscription system when compared to the evaluation gold standard was 14.37%. Word Tier Annotation In addition to a reference orthographic representation, the variables included in the word tier annotation are word length (syllable and phoneme counts), part of speech, morphology (number, definiteness, case, pronoun form, tense/aspect, mood, voice and degree), word type (content word or function word), word repetitions (full-form and lexeme), word predictability (estimation based on trigram, bigram and unigram statistics from an orthographically transcribed version of the Göteborg Spoken Language Corpus, Allwood et al., 2000), global word probability (unigram probability), the position of the word in the phrase, focal stress, distance to preceding and succeeding foci (in number of words), pause context, filled pause context, interrupted word context, prosodic boundary context and measures of local speech rate over each word unit and of pitch dynamism and pitch range. Syllable Tier Annotation The syllable tier annotation includes the variables stress, accent, distance to preceding and succeeding stressed syllable (in number of syllables), syllable length (phoneme count), syllable nucleus, the position of the syllable in the word and measures of local speech rate over each syllable unit. Phoneme Tier Annotation On the phoneme level, the annotation includes the canonical phoneme and a set of articulatory features describing the canonical phoneme, the position of the phoneme in the syllable and in a 27 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Aycock J. (1998) Compiling little languages in Python. Proc /th International Python Conference. Bannert R. and Czigler P. E. (1999) Variations in consonant clusters in standard Swedish. Phonum 7, Umeå University. Brants T. (2000) TnT – A statistical part-ofspeech tagger. Proc ANLP. Bruce G. (1986) Elliptical phonology. Papers from the Ninth Scandinavian Conference on Linguistics, 86–95. Finke M. and Waibel A. (1997) Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition. Proc Eurospeech, 2379–2382. Fosler-Lussier E. and Morgan N. (1999). Effects of speaking rate and word frequency on pronunciations in conversational speech. Speech Communication, 29(2–4):137–158. Gårding E. (1974) Sandhiregler för svenska konsonanter. Svenskans beskrivning 8, 97– 106. Jande P-A (2003a) Evaluating rules for phonological reduction in Swedish. Proc Fonetik, 149–152. Jande P-A (2003b) Phonological reduction in Swedish. Proc 15th ICPhS, 2557–2560. Jande, P.-A. (2004). Pronunciation variation modelling using decision tree induction from multiple linguistic parameters. Proc Fonetik, 12–15. Jurafsky D., Bell A., Gregory M., and Raymond W. (2001). Probabilistic relations between words: Evidence from reduction in lexical production. In Bybee and Hopper (eds) Frequency and the emergence of linguistic structure, 229–254. John Benjamins. Megyesi B. (2002a) Data-driven syntactic analysis – Methods and applications for Swedish. Ph. D. Thesis. KTH, Stockholm. Megyesi B. (2002b). Shallow parsing with pos taggers and linguistic features. Journal of Machine Learning Research, 2, 639–668. Sjölander K. (2003) An HMM-based system for automatic segmentation and alignment of speech. Proc Fonetik, 193–196. Sjölander K. (2004) The snack sound toolkit. http://www.speech.kth.se/snack/ Sjölander K. and Beskow J. (2000) WaveSurfer - a public domain speech tool. Proc ICSLP, IV, 464–467. Van Bael C., van den Heuvel H., and Strik H. (2004). Investigating speech style specific pronunciation variation in large spoken language corpora. Proc ICSLP, 586–589. Model Performance The annotation has been used for decision tree model induction (initial results are reported in Jande, 2004). The decision tree pronunciation variation model works with phonemes in a canonical phonemic pronunciation representation as its central units. A vector containing all available context information is connected to each canonical phoneme. For each canonical phoneme, the model makes a decision about the appropriate phone realisation given the context associated with the canonical phoneme. Decision tree models trained on annotation from elicited monologue showed a PER of 9.91% when evaluated against the same type of data as they were trained on in a tenfold cross validation setting. This meant a 55.25% error reduction compared to using the canonical pronunciation representation for estimating the phonetic realisation. The decision tree models were pruned to make them more general (less specific to the particular training data from which they were induced). Thus, not all variables were used in the final models. None of the discourse or utterance tier attributes were used in any of the pruned models, probably due to the fact that only one type of speaking style was used. From the phrase, word, syllable and phoneme tiers, many different types of attributes were used. As could be expected, the identity of the canonical phoneme was the primary phone level realisation predictor. Conclusions A system for annotation of speech data with variables hypothesised to be important for the pronunciation of words in discourse context has been described. Automatic methods used for obtaining or estimating variables have been presented. The annotation has been used for creating pronunciation variation models in the form of decision trees. The models show a decrease in phone error rate with 55.25% compared to using canonical phonemic word representations from a pronunciation lexicon. References Allwood J., Björnberg M., Grönqvist L., Ahlsén E. and Ottesjö C. (2000) The Spoken Language Corpus at the Linguistics Department, Göteborg University. Forum Qualitative Social Research 1. 28 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Estonian rhythm and the Pairwise Variability Index Eva Liina Asu1 and Francis Nolan2 1 Institute of the Estonian Language, Tallinn 2 Department of Linguistics, University of Cambridge isochrony to a scalar ‘prominence gradient’ between successive syllables. On average, British English alternates prominent and highly reduced syllables, while French syllables are more even. The research reported by Low, Grabe and Nolan (2000), and Grabe and Low (2002), focusing on the PVI, and that of Ramus, Nespor and Mehler (1999), using rather different measures, has shown that it is possible to achieve useful scalar characterisation (though not discrete categorisation) of the utterance rhythm of different languages. According to the traditional rhythmic dichotomy, Eek and Help (1987) classify Estonian as a ‘syllable-timed’ language that has in its history undergone a Great Rhythm Shift from a rhythmically more complex type. The only study where Estonian has been included in the PVI calculations is Grabe and Low’s (2002) comparison of 18 languages where Estonian showed a vocalic variability roughly similar to Romanian, French and Catalan. The present pilot study pursues the characterisation of Estonian utterance rhythm using the PVI. In particular, since the PVI is a completely general concept, its application is extended to alternative phonetic units such as syllables and feet. Abstract The Pairwise Variability Index (PVI), a measure of how much unit-to-unit variation there is in speech, has been used as a correlate of rhythmic impressions such as ‘syllable-timed’ and ‘stress-timed’. Grabe and Low (2002) included Estonian among a number of languages for which they calculated the durational PVI for vowels and for intervocalic intervals, but other than that Estonian rhythm has not been studied within recent approaches to rhythm calculation. The pilot experiment reported in this paper compares the use of various speech units for the characterisation of Estonian speech rhythm. It is concluded that the durational PVI of the syllable and of the foot are more appropriate for capturing the rhythmic complexity of Estonian, and might provide a subtle tool for characterising languages in general. Introduction The Pairwise Variability Index (PVI) is a quantitative measure of acoustic correlates of speech rhythm which calculates the patterning of successive vocalic and intervocalic (or consonantal) intervals showing how one linguistic unit differs from its neighbour. It was first applied, at the suggestion of the second author of this paper, by Low (1998: 25) in her study of Singapore English rhythm. Among other measures Low compared successive vowel durations and showed that Singapore English had a lower average PVI over utterances than British English. This fits in with the impressionistic observation that Singapore English is more ‘syllable timed’ than British English. ‘Syllable timing’ (Abercrombie, 1967: 97) carries the implication that speakers make syllables the same length, and is opposed to ‘stress timing’, the tendency to compress syllables where necessary to yield isochronous feet (i.e. inter-stress intervals). Attempts to find isochrony of either kind have produced disappointing results, event for languages canonically perceived as syllable-timed (e.g. French) or stress-timed (e.g. British English). The PVI, however, shifted the emphasis from absolute Method Materials The materials used were the first four sentences of a read passage recorded for intonation analysis in Asu (2004). The four sentences comprised 62 syllables and (depending on the speaker) four to seven intonation phrases. This is less than half the 160 or so syllables on which Grabe and Low’s (2002) Estonian results are based, but compensatorily the present analysis uses data from five speakers compared to only one in theirs. The speakers are all female speakers of Standard Estonian who were asked to read the passage in a normal tempo. Three subjects were recorded in a quiet environment in Tartu, Estonia, and two in the sound-treated booth of the phonetics laboratory of Cambridge University. 29 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Analysis Two sets of measurements were made. First, the start time of each vowel and each interconsonantal interval were measured. Mostly this is straightforward in Estonian, but the segmentation of [j] required fine judgement, and, of course, is in theory not possible; nonetheless for completeness it was attempted as best possible on the basis of formant dynamics and listening. From the vocalic and intervocalic measurements could be calculated the vowel PVI (VPVI) and intervocalic PVI (CPVI) for comparison with Grabe and Low (2002). Additionally, the measurements allowed the derivation of ‘pseudo-syllable’ units, consisting of an intervocalic interval and the following vowel, as apparently used by Barry et al. (2003). This unit totally disrespects linguistic syllabification, so that for instance the words ‘hommikust sõi’ ((he/she) ate breakfast), which linguistically would be syllabified as ‘hom.mi.kust.sõi’, yielded the pseudo-syllables ‘ho.mmi.ku.stsõi’. The motivations for calculating the pseudo-syllable are merely that it is the natural corollary of the ‘vocalic-intervocalic’ PVI dichotomy, and, more interestingly, that it corresponds to the ‘Articulatory Syllable’ of Kozhevnikov and Chistovich (1965), which they proposed as the domain of coarticulation. Although many studies disconfirmed this hypothesis, the pseudo-syllable nevertheless has a heritage in research into the organisation of speech production which makes it worth considering. The second set of measurements took as its starting point a traditional phonological syllabification of the utterances. There is relatively little controversy over how Estonian syllabifies (unlike in the case of English): e.g. a long (Q2) or overlong (Q3) vowel or a diphthong forms one syllable but consonant clusters of two or more consonants are split so that the last consonant starts a new syllable. Acoustically some decisions had to be made; for instance long consonants, or sequences of identical vowels at the end of one word and the beginning of another, were simply divided at their mid-point; while sequences of two different vowels at word boundaries were divided at the point which best preserved their acoustic and auditory identity. Most cases, however, were unproblematic. The beginning of each syllable was recorded, and the syllable lengths calculated. A further set of durations was derived from the linguistic syllable, namely the phonological feet. These are considered to consist of a stressed (not necessarily accented) syllable and zero, one, or two following unstressed syllables. Trisyllabic words constitute one prosodic foot if there is no secondary stress on the second or third syllable of the word (Ross and Lehiste, 2001: 51). Phrase-initial unstressed syllables (an ‘anacrusis’) are left out as they do not participate in a well-formed foot. The calculation of foot durations merely involved adding together the durations of the syllables making up the foot. The sentences are listed below with syllable divisions indicated by points, feet enclosed by square brackets, sentence ends marked by ##, and word divisions by spaces: [Sel] . [hom.mi.kul] . [är.kas] . [Tuu.li] . [en.ne] . [ema] . ## Ta . [pani . end] . [ta.sa].[ke.si] . [rii].[des.se] . ja . [läks] . [al.la] . [köö.ki] .## Kui . ta . [ta.lu].[köö.gi] . [pi.ka] . [lau.a] . [ta.ga] . [hom.mi.kust] . [sõi] . [il.mus] . [uk.se.le] . [õ.de] . [A.nu] .## ‘Kas . sa . [lä.hed] . [tä.na] . [koo.li]?’ ## Among other units, Ramus (2002) suggests the foot as a possible alternative unit for the measure of speech rate but as far as the present authors are aware the foot PVI has not been calculated previously in any language. The rationale for calculating it will be discussed with the results. One further point must be mentioned. The PVI can be calculated ‘raw’ (rPVI), where the differences between successive pairs of units are averaged over the material, or ‘normalised’ (nPVI). Normalisation involves expressing each difference as a proportion of the average of the two units involved (e.g. their average duration). The original point of this (Low, 1998: 39) was to neutralise the effect of utterance level rate variation, particularly between-speaker differences in rate and phrase final lengthening or rallentando. There are arguments for and against this (e.g. Barry et al., 2003, Grabe and Low, 2002) as a matter of principle, but the fact that our units are of widely differing size (segment, syllable, foot) means that normalisation is essential. The magnitude of equivalent variation between feet will inevitably be greater in absolute terms than that between syllable-parts, but expressing the variation as a proportion of the two units involved neutralises this difference of magnitude. The resultant fractional value of 30 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University level. The vocalic measure nVPVI is not significantly different from either of the two syllable measures nCVPVI and nSPVI. each normalised PVI is multiplied by 100 to express it as a percentage. Results Discussion The results are summarised in Figure 1 which shows five normalised PVI measures, and five speakers within each measure. The first measure is the normalised vowel PVI (nVPVI). The values range between 39 and 52, averaging 44.6, compared to Grabe and Low’s (2002) 45.4, which reassures us both that our sample is large enough and that there is reasonable consistency between speakers. Grabe and Low’s second measure, the raw intervocalic PVI, is not shown on the graph as it is not comparably scaled (consonantal intervals are generally much shorter). Our group mean of 45.9 is a little higher than their 40.0, but still of a similar order; since this value is not normalised, it would be very sensitive to speech rate. The second measure plotted in Figure 1 is the normalised version of this intervocalic PVI (nCPVI), the mean of the speakers being 57.5. There are, on the face of it, two curious aspects to the PVI tradition. The first is why, when ‘syllable-timing’ is at issue, the pairwise variability not of the syllable but of components of the syllable (vowel and intervocalic interval) have been favoured (except by Barry et al., 2003). Low (1998: 25) attributes her choice of the vocalic PVI to Taylor (1981), who claims that vowel duration is the key to syllable timing; and, pragmatically, the choice also allows researchers to side-step controversies about English syllable-division; but there is little detailed justification in the literature. Subsequently, the intervocalic (CPVI) measure was adopted to capture, in particular, languages’ permitted consonant sequences. The present data, however, give us reasons to question the desirability of splitting the syllable. For one thing, nVPVI and nCPVI have least between-speaker stability (SD around 5.0, compared to 2.0 for the syllable measures and 1.5 for the nFPVI). This suggests that timing regularity below the syllable may not be controlled as tightly as higher up the prosodic hierarchy. For another, the nCPVI reflects two very different influences: the degree of phonotactic permissiveness, and the dynamics of individual consonants (in Estonian, for instance, the presence of a very short tapped /r/ at one extreme and ‘overlong’ (Q3) consonants at the other). The second curious aspect, given the opposing term ‘stress-timing’, is that little if any attention has been paid previously to the PVI of the foot. Admittedly the values for nFPVI in the present paper need to be treated with caution, as they are based on only 28 feet per speaker, but the measure is consistent across speakers; and we suggest that there are good reasons in principle for including this measure in future studies. In Estonian, the nFPVI turns out to be significantly smaller than our syllable-based measures (nCVPVI, nSPVI, and, arguably, nVPVI), but we expect nFPVI to have a variable relationship with these measures across languages which will define rhythmic type in a rather subtle way. Languages perhaps need not, as in the traditional dichotomy, either (like English) squash their unstressed syllables to achieve approximate foot-isochrony, or (like French) keep their syllables fairly even and not Figure 1. Estonian PVI measures for five prosodic units for five speakers (EP, KK, LL, LS, MU). The next two PVI measures in Figure 1 have group means around forty: 40.5 for the PVI of the ‘pseudo-syllable’ comprising a vowel and all preceding consonantal material (nCVPVI), and 45.7 for the linguistic syllable (nSPVI). The foot PVI (nFPVI) has clearly the lowest group mean at 35.3. Although thorough statistical testing has not yet been carried out on this limited pilot data, a t-test shows that nFPVI is different from nSPVI and nCVPVI at the p<0.01 31 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Asu E. L. (2004) The Phonetics and Phonology of Estonian Intonation. Unpublished doctoral dissertation. University of Cambridge. Barry W. J., Andreeva B., Russo M., Dimitrova S. and Kostadinova T. (2003) Do rhythm measures tell us anything about language type? Proceedings of the 15th ICPhS, Barcelona, 2693-2696. Eek A. and Help T. (1987) The interrelationship between phonological and phonetic sound changes: a great rhythm shift of Old Estonian. Proceedings of the 11th ICPhS, Tallinn, Estonia, 6, 218-233. Grabe E. and Low E. L. (2002) Durational variability in speech and the rhythm class hypothesis. In Gussenhoven C. and Warner N. (eds) Laboratory Phonology 7, 515-546. Berlin: Mouton de Gruyter. Kozhevnikov V. A. and Chistovich L. A. (1965) Speech: Articulation and Perception. Translation: Joint Publications Research Service 30-543, US Department of Commerce. Low E. L. (1998) Prosodic Prominence in Singapore English. Unpublished doctoral dissertation. University of Cambridge. Low E. L., Grabe E. and Nolan F. (2000) Quantitative characterisations of speech rhythm: syllable-timing in Singapore English. Language and Speech 43, 377-401. Ramus F. (2002) Acoustic correlates of linguistic rhythm: perspectives. Proceedings of Speech Prosody 2002, Aix-en-Provence, 115-120. Ramus F., Nespor M. and Mehler J. (1999) Correlates of linguistic rhythm in the speech signal. Cognition 72, 1-28. Ross J. and Lehiste I. (2001) The Temporal Structure of Estonian Runic Songs. Berlin: Mouton de Gruyter. Taylor D. S. (1981) Non-native speakers and the rhythm of English. International Review of Applied Linguistics in Language Teaching 19 (3). bother about foot timing. They could also equalise their feet to some degree, but share the ‘squashing’ more fairly in polysyllabic feet. Estonian, with its strong stress but near absence of vowel quality reduction in unstressed syllables, and despite its three-way quantity contrast which sporadically curtails syllable-equality, may be at base such a language. Our prediction for English, therefore, is a higher nSPVI than for Estonian and a similar or lower nFPVI; and for French, a similar or lower nSPVI compared to Estonian and a higher nFPVI. The fourth logical possibility, a language with a steep prominence gradient between stressed and unstressed syllables but with no tendency to footisochrony, seems less likely. A pilot study on one speaker of English appears to confirm our prediction. In a sample consisting of 42 feet the nSPVI was higher (53.5) than for Estonian (45.7) and the nFPVI lower (30.3) as compared to Estonian (35.3). This result is suggestive, although there are considerable problems in English defining both syllables and feet so that further research will be required to confirm this characterisation of the two languages. However, we are confident that syllable and foot PVIs provide independent dimensions which will prove effective in capturing the rhythmic properties of languages. Conclusion This paper has presented a preliminary investigation of Estonian rhythm, comparing a number of measures, each of which expressed the fluctuation in duration between successive phonological units. It has been argued that the common practice of characterising languages in terms of pairwise variability of vowels and intervocalic intervals may be less appropriate than using variability measures of phonological syllables and of stress feet. This is particularly so when the results are to be related to impressionistic characterisations in terms of ‘syllabletiming’ and ‘stress-timing’. However the point is made that these terms are not opposites ranged on a single continuum, but two independent parameters along which languages can vary. References Abercrombie D. (1967) Elements of General Phonetics. Edinburgh: Edinburgh University Press. 32 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Duration of syllable-sized units in casual and elaborated Finnish: a comparison with Swedish and Spanish Diana Krull Department of Linguistics, Stockholm University, Stockholm is no interdependence between vowel and consonant quantity (Engstrand and Krull, 1994). Abstract Recordings of Finnish casual dialogue and careful reading were analyzed auditorily and on spectrograms. Syllables on the phonological level were compared to syllable-sized units (C´V´s) on the phonetic level. Comparisons with existing Swedish and Spanish data revealed several differences: Finnish had much less temporal equalization of syllable-sized units in casual speech than Swedish, and even slightly less than Spanish. Instead, there was a greater decrease in the number of C´V´s. In all three languages, the duration of a C´V´ was partly dependent on its size. In Finnish, however, (in contrast to Swedish and, to a lesser degree, Spanish) C´V´ duration was only marginally affected by lexical stress. Finnish, like Spanish, had rhythmic patterns typical of syllable-timed languages in both speaking styles, while Swedish changed from a more stresstimed pattern in careful reading to a more syllable-timed in casual speech. Methods The Finnish speech material consisted of a lively dialogue between native speakers PJ and EK, and text reading (PJ). The dialogue was recorded in in the early 1990-ies and was used for segment duration analyses (Engstrand and Krull, 1994). The text reading was recorded in 2005. The Swedish and Spanish materials cited for comparisons come from Engstrand and Krull 2001, 2002. All recordings were made in sound-treated recording studios using high quality professional equipment. The digitized Finnish material was segmented into syllable-sized units and labeled using the Soundswell Signal Workstation (http://www.hitech.se/development/products/so undswell.htm). Since casual speech is characterized by numerous coarticulation and reduction phenomena, a conventional morphophonetically based syllabification was not possible. Reliable identification and segmentation of units posed problems; e.g. the Swedish word behandla could be pronounced as [beãla]. Therefore, contoid-vocoid(-contoid) sequences mirroring opening-closing movements were chosen as units (see Engstrand and Krull, 2002). For simplicity, they will be referred to as C´V´ units, where C´ may be a single contoid or a cluster and V´ a single vocoid or a diphthong. The term ‘unit’ is used in a strictly phonetic sense, it may sometimes containin traces of ‘underlying’ segments. The segmentation was carried out auditorily and visually on spectrograms in the same manner as with the Swedish and Spanish material (Engstrand and Krull 2001, 2002). Onsets consisted where possible of a single contoid or a contoid cluster, and a unit was considered wellformed if it agreed with Jespersen’s sonority hierarchy (Jespersen 1926). No consideration was given to the phonotax of a given language or to word and morpheme boundaries. For example, the Finnish words myös kielellisen would be segmented as myö-skie-le-lli-sen re- Introduction The complex Swedish phonotax has been shown to be considerably simplified in casual speech (Engstrand and Krull, 2001). Syllables containing heavy consonant clusters on the phonological level were often represented by alternating simple contoid-vocoid sequences in casual conversation. As a consequence, the durations of syllable-sized units tended to be equalized and the rhythmic pattern of Swedish came closer to that of a syllable timed language such as Spanish (Engstrand and Krull, 2002). The present paper addresses the question: How would the durations of syllable-sized units in different Finnish speaking styles compare with Swedish and Spanish? On the one hand, Finnish resembles Spanish in the simplicity of the phonotax which would let us expect a similarity to the Spanish pattern. On the other hand, there is a segmental short-long contrast as in Swedish, although not limited to stressed syllables. Phonetically, the difference between short and long segments is larger in Finnish and there 33 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University sequences amounted to about 60% of all phonological syllables, in Swedish to no more than around 30%. In the phonetic representation, however, the difference in the amount of open units between the two languages was smaller, mainly due to the more radical simplification of syllables in Swedish. (Several spectrographic examples of such simplifications are given in Engstrand and Krull, 2001). sulting in an initial cluster which is not normal in Finnish. Results Table 1 shows the incidence of phonetic syllable-sized units vs. phonologically determined syllables in Finnish. Swedish data (Engstrand and Krull 2001) were added for comparison. (The duration of units in prepausal positions was not included.) It can be seen that in both languages, the total number of phonetic units was lower than the corresponding number of phonological syllables. The decrease was larger for the Finnish speakers: 12% (PJ) and 15% (EK) in casual speech, while the corresponding amount for the Swedish was 9% (RL) and 10% (JS). In reading condition, the decrease was 10% for Finnish and only 2% for Swedish. In addition, Table 1 shows that the share of open units (i.e. sequences ending in a vowel or vocoid) was larger in the phonetic representation of both languages. The increase was much larger for the Swedish speakers: from 27% to 73% (RL), 40% to 73% (JS), and 31% to 62% (GT, read text). For Finnish, the corresponding increase was from 58% to 80% (PJ), 61% to 79% (EK), and 57% to 77% (PJ, read). Table 2. Mean durations and standard deviations (ms) for open syllable-sized units (C´V´s) in read and casual Finnish. Swedish and Spanish data (Engstrand and Krull, 2002) are added for comparison. Analysis No. of syll. Finn. PJ casual Finn. EK casual Finn. PJ read Phonet. Phonol. Phonet. Phonol. Phonet. Phonol. Swed. RL casual Swed. JS casual Swed. GT read Phonet. Phonol. Phonet. Phonol. Phonet. Phonol. % open syll. 960 1097 584 647 876 954 % % CV CCV and V 73 7 58 0 67 11 61 0 71 7 57 0 822 900 543 491 977 997 61 12 25 2 67 13 37 2 53 9 29 2 73 27 81 40 62 31 Condition Mean Std N Read Casual Finnish EK Casual 156 154 166 48 42 48 675 761 429 Swedish Read Casual 200 178 86 62 350 306 Spanish Read Casual 156 155 59 51 167 287 Table 2 shows mean C´V´ durations and standard deviations for three languages and two speaking conditions. Note the difference in standard deviations between Swedish on the one hand and Spanish and Finnish on the other. Of special interest is the decrease in standard deviation between the read and casual conditions. Figure 1 illustrates the distribution of C´V´ durations in Finnish, Swedish and Spanish graphically. In the Swedish read condition the frequeny distribution was much broader and flatter than in casual speech. The difference between speaking conditions was smaller in Spanish and smallest in Finnish. The data in Figure 2 show a strong dependance of the duration of a C´V´ on the number of segments it contains. Moreover, they reveal an additional difference between Finnish and Swedish. In Swedish, there was a regular difference in duration between stressed and unstressed C´V´s in casual speech and a slightly larger difference in the reading condition. In Finnish, the difference was negligible. For all C´V´s, the difference between the mean duration of stressed and unstressed units was 107 ms for read Swedish and 76 ms for casual Table 1 Phonetic syllable-sized units and. phonological syllables in casual and read Finnish. Swedish data (Engstrand and Krull 2001) are added for comparison. Language, speaker condition Language speaker Finnish PJ 80 58 79 61 77 57 Another difference between the two languages was the share of simple syllables (mainly CV, in a few cases V). In Finnish, such 34 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Finnish Swedish 150 Spanish 50 40 0.14 0.2 120 90 t n u o C 0.12 40 0.1 60 30 Pr o p o r oti n p e r B a r 0.10 30 t n u o C 0.08 0.06 20 0.04 10 0.02 0 0 125 250 Duration (ms) 375 0.0 500 0 0 150 125 250 Duration (ms) 375 50 0.2 32 P or p o t r n oti u o n C p e r B a r 24 8 0 0 0.00 500 0.1 16 125 250 Duration (ms) 375 0.14 0.12 90 t n u o C 0.10 0.08 60 0.06 0.04 30 125 250 Duration (ms) 375 0.00 500 32 40 0.12 P or p o r oti n p e r B a r 0.10 30 t n u o C 0.08 20 0.06 0.04 10 0.10 Pr o p ot rn otiu no pC e r B a r 24 0.08 0.06 16 0.04 8 0.02 0.02 0.02 0 0 0.12 0.14 0.16 0 0 125 250 Duration (ms) 375 0.00 500 0.0 500 40 0.16 0.18 120 Pr o p o r oti n p e r B a r 0 0 125 250 Duration (ms) 375 P or p o r oti n p e r B a r 0.00 500 Figure 1. Distributions of C´V´unit durations (ms) in Finnish, Swedish and Spanish in two speaking conditions: upper row read text, lower row casual speech. Data affected by prepausal lengrhening are removed (The Swedish and Spanish data are from Engstrand and Krull, 2002).. speech. For read Finnish, the difference was 10 ms and for casual speech 11 ms According to Engstrand and Krull (2002), ”The 107 ms stress effect in the Swedish read condition agrees with the expected durational effect of of stress-timed languages (Eriksson 1991), whereas the durational quantum added by stress in the remaining conditions (unscripted Swedish as well as read and unscripted Spanish would be more typical of syllabletimed languages”. For Finnish, the same can be said as for Spanish. structures made up more than half of the syllables on the phonological level, while in Swedish the corresponding amount was less than a third. Compared to Swedish, therefore, Finnish allows less possibilities for syllable simplification. It appears that instead of simplifying syllables, the Finnish speakers dropped them. There was a relatively large decrease in the number of syllables between the phonological representtion and its phonetic counterpart in Finnish, both in casual speech and in text reading. Another difference between Finnish and Swedish was found in the distribution of syllable durations Although Finnish has a comparatively large difference in duration between short and long segments (see Engstrand and Krull, 1994 for data from speakers PJ and EK), the duration of (open) syllables tended to center around a narrow area around a peak (Figure 1). There was not much change in distribution breadth between reading and casual speech. Part of an explanation for this phenomenon may lie in the near-equality of the durations of stressed and Discussion and conclusions In casual Swedish, sound sequences have been shown to be greatly simplified in relation to the phonotactic structures. Specifically, Swedish tends to produce alternating contoid and vocoid articulations which relate to more complex structures on the phonological level (Engtrand and Krull, 2001). In Finnish, the phonotactic structure is much less complex. In the read text of the present study, for example, simple CV 35 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Finnish ) s m ( n oti ar u D Swedish 500 500 400 400 ) s m ( n oi t ar u D 300 200 ) s m ( n oi t ar u D 200 100 100 0 0 300 1 2 3 4 Number of segments 0 0 5 500 500 400 400 ) s m ( n oti ar u D 300 200 2 3 4 Number of segments 5 2 3 4 Number of segments 5 300 200 100 100 0 0 1 1 2 3 4 Number of segments 0 0 5 1 Figure 2. Mean durations (ms) as a function of C´V´ unit size in read and unscripted Swedish and Finnish. Upper graph show casual speech; lower graphs text reading. Filled circles represent stressed units and triangles unstressed units. unstressed syllables in Finnish. In contrast to Swedish, Finnish has quantity distinction also on unstressed syllables. To sum up, the relatively simple syllable structure of Finnish resembles that of Spanish. As a consequence, there was a strong resemblance between these languages in the manner of how phonological syllables were represented on the phonetic level. Moreover, there was little difference between the syllable structures of careful reading and casual speech. Swedish differs from Finnish and Spanish through its complex syllable structure, with heavy consonant cluster in all positions. On the phonetic level, especially in casual speech, these structures were extensively simplified. As a result, the Swedish speech rhythm became more similar to that of Finnish and Spanish. It could be said that with a change from careful reading to casual speech, Swedish changed froma stress-timed to a more syllable-timed language. Further investigation will show whether this phenomenon is specific for Swedish or whether it is generally valid for languages with elaborate syllable structures. References Engstrand, O. and Krull, D. (1996). Durational correlates of quantity in Swedish, Finnish and Estonian: cross-language evidence for a theory of adaptive dispersion. Phonetica Vol. 51, No. 1-3, 1994. Engstrand, O. and Krull, D. (2001). Simplification of phonotactic structures in unscripted Swedish. J.I.P.A. 31, 41-50. Engstrand, O. and Krull, D. (2002). Duration of syllable-sized units in casual and elaborated speech: cross-language observations on Swedish and Spanish. In: Fonetik 2002, TMH-QPSR Vol. 44. Eriksson, A. (1991). Aspects of Swedish speech rhythm. Gothenburg Monographs in Linguistics 9. Department of Linguistics, University of Göteborg. Jespersen, O. (1926). Lehrbuch der Phonetik. 4. Aufl., 190-91, Leipzig and Berlin: Teubner. 36 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The sound of ‘Swedish on Multilingual Ground’ Petra Bodén1, 2 1 Department of Linguistics and Phonetics, Lund University, Lund 2 Department of Scandinavian Languages, Lund University, Lund Greenland (Jacobsen 2000) and in the so-called multi-ethnolect of adolescents in Copenhagen (Quist 2000). In some other language varieties that have developed through language contact, e.g. Nigarian English (Udofot 2003), Maori English (Holmes & Ainsworth 1996) and Singapore English (Low & Grabe 1995), the speech rhythm has been described as approaching syllable-timing. However, in the present paper, we will restrict ourselves to investigating the similarities (and differences) between three varieties of SMG. Abstract In the present paper, recordings of ‘Swedish on multilingual ground’ from three different cities in Sweden are compared and discussed. Introduction In Sweden, an increasing number of adolescents speak Swedish in new, foreign-sounding ways. These new ways of speaking Swedish are primarily found in the cities. The overarching purpose of the research project Language and language use among young people in multilingual urban settings is to describe and analyze these new Swedish varieties (hereafter referred to as ‘Swedish on multilingual ground’, SMG) in Malmö, Gothenburg and Stockholm. Most SMG varieties are known by names that reveal where they are spoken: “Rinkeby Swedish” in Rinkeby, Stockholm, “Gårdstenska” in Gårdstena, Gothenburg and “Rosengård Swedish” in Rosengård, Malmö. However, if you discuss Rinkeby Swedish with young people in Malmö, they will instantly associate with Rosengård Swedish (i.e. with the corresponding Malmö SMG variety), if you play examples of Rosengård Swedish to teenagers in Lund, they will associate with the Lund SMG variety “Fladden” (named after Norra Fäladen), and so on. In other words, obvious similarities are perceived between different varieties of SMG. Method The material comes from the speech database collected by the research project Language and language use among young people in multilingual urban settings. During the academic year 2002-2003, the project collected a large amount of comparable data in schools in Malmö, Gothenburg and Stockholm. The speakers are young people (mainly 17-year-olds) who attended the second year of the upper secondary school’s educational program in social science during 20022003. The recordings are comprised of both scripted and spontaneous speech. The recordings include: (01) interviews between a project member and the participating pupils, (02) oral presentations given by the participating pupils, (03) class-room recordings, (04) pupil group discussions, and (05) recordings made by the pupils themselves (at home, during the lunch break, at cafés, etc.). The recordings were made with portable minidisk recorders (SHARP MD-MT190) and electret condenser microphones (SONY ECM717), and subsequently digitized. The naturalness of the speech material has been obtained on the expense of good sound quality. Acoustic analyses using the speech analysis programs WaveSurfer and Praat have been undertaken when possible, other parts of the material have primarily been examined using auditory analyses. Purpose In the present paper, a first comparison between SMG materials recorded in Malmö, Gothenburg and Stockholm is undertaken with the object of searching for differences and similarities in the varieties’ phonology and phonetics. Previous research Descriptions in the literature of so-called ethic accents or (multi) ethnolects of different languages reveal some similarities. One example of such a similarity is a staccato-like rhythm or syllable-timing. A staccato-like rhythm has been observed in e.g. Rinkeby Swedish (Kotsinas 1990), in the so-called Nuuk Danish spoken by monolingual Danish-speaking adolescents in 37 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University quently. Finally, trilled r sounds can be heard in the Gothenburg SMG material, although here it is not evident that they are more numerous than in the Gothenburg dialect in general. Results In the following, we will restrict ourselves to describing a small set of SMG features that demonstrate interesting differences and similarities between the cities. Prosody Word accents It is a well known fact that L2 learners of Swedish find it difficult to perceive and produce the word accent distinction. Given the close relation between foreign accent and SMG, one possible common feature of the SMG varieties is a lack of word accent distinction. Phonetically, the difference between accent I and II is one of F0 peak timing. The F0 peak of accent I has an earlier alignment with the stressed syllable than accent II. In the Malmö dialect, the F0 peak is found at the beginning of the stressed syllable in accent I words, and at the end in accent II words. The same pattern can be seen in examples of Rosengård Swedish, see Figure 1. Segmentals and t When we ran a perception experiment in Malmö with the object of investigating whom of our informants spoke Rosengård Swedish (Hansson & Svensson 2004), we noted that one of the stimuli contained something typical for SMG at the very beginning of the recording. Instead of listening to the entire 30 second long stimulus, the listeners (adolescents from Malmö) marked their answer after having heard only the first two prosodic phrases (approximately 6.5 seconds). The two phrases in question are given in (1). (1) ja(g) ska gå å plugga lite nu asså hon checkar språket å sånt 400 Apart from the phrase å sånt ‘and stuff’ which adolescents in Malmö perceive as particularly common in Rosengård Swedish, the pronunciation of the word checkar ‘checks’ stands out as being non-representative of the Malmö dialect. The first sound in checkar, //, is pronounced with the affricate [t]. Although not a nonexistent sound in Swedish dialects, it is perceived as foreign in the Malmö dialect, and, by the listeners in the perception experiment, as a typical feature of SMG. The same sound can be heard in the materials recorded in Gothenburg and Stockholm, e.g. in words like chillar [tlar] ‘chill’ and other borrowings. 300 200 100 'bra ti(ll) 'hèm,sidor 0 men 1 Time (s) Figure 1. F0 contour of speaker C41’s production of the accent I word bra ‘good’ and the accent II word hemsidor ‘home pages’. In Stockholm Swedish, the F0 peak is found at the end of the stressed syllable in focussed accent I words. In focussed accent II words, two F0 peaks are found: one at the beginning of the stressed vowel and another one later (midways between the preceding peak and the next accent or boundary tone or, in compounds, in association with the secondary stress). A first look at the Stockholm SMG data reveals that a word accent distinction is used, but it also reveals some deviating patterns. Perceptually prominent accent II words are, e.g., not always assigned two F0 peaks (the focal rise is missing), see Figure 2. R sounds If you ask a Scanian to imitate Rosengård Swedish, he or she will most likely use front r sounds. Indeed, among the SMG speakers in Malmö, the pronunciation of the r sound varies greatly. Out of the ten stimuli perceived as Rosengård Swedish, front r sounds can be heard in five. Among them, there are both fricative and approximant r sounds and the more perceptually salient trilled r sounds. Also in the Stockholm SMG material, the r sounds differ from the regional dialect in that trilled r sounds appear to be used more fre38 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University 400 600 500 300 400 300 200 200 100 de(t är) ''skìt,go(d) 100 de(t är) tju(go)sju p(ro)cent som villha kvar 'mat 1 0 kungen 1.8 0 Time (s) Time (s) Figure 2. F0 contour of speaker L31’s production of the accent II word skitgod ‘very good’. Figure 5. F0 contour of speaker D40’s production of det är tjugosju procent som vill ha kvar kungen ‘it’s twenty-seven percent that want to keep the king’ with an expanded F0 range (female speaker from Malmö). The word accents in the Gothenburg SMG material still remain to be investigated. In summary, the SMG varieties have both features in common and regional features. Intonation An expanded F0 range can be observed in utterances recorded in all three cities. The pattern is found mainly in exclamations and rhetorical questions, see Figures 3, 4 and 5. Discussion How come there are similarities? How come the different SMG varieties share the above-mentioned features? The relation to learner language and foreign accent is of course obvious in both Malmö, Gothenburg and Stockholm, but a foreign accent can sound in a multitude of different ways. One possible explanation is, of course, that all SMG varieties are influenced by the same language or language family. On the other hand, SMG does not sound as one particular foreign accent. Another possible explanation is that the varieties are characterized by features that are typologically unmarked and frequent in the world’s languages. It is either related to the fact that many of those features exist in the teenagers’ first languages, or to the fact that simplification and usage of unmarked features is generally favored in language contact situations (regardless of what the languages in contact are). A third explanation is that it is features in the Swedish language that give rise to the varieties’ similar ‘sound’, e.g. the difficulties encountered when learning Swedish. All three alternatives probably have some explicative power, although either completely accounts for why the varieties sound like they do. Word accents are unusual in the speakers’ first languages, tend to disappear in language contact situations (as in Finland Swedish), and are difficult for second language learners to learn. A word accent distinction is, nevertheless, maintained in SMG. 600 500 400 300 200 100 ve vem e smart ingen av dom kan nåt 1.7 0 Time (s) Figure 3. F0 contour of speaker P11’s production of ve- vem e smart ‘wh- who is clever’ with an expanded F0 range and, for comparison, ingen av dom kan nåt ‘either of them know anything’ (male speaker from Gothenburg). 600 500 400 300 200 100 ja(g)ba m(en) oke(j) ja(g) e hungri(g) 4.7 3.5 Time (s) Figure 4. F0 contour of speaker L31’s production of jag e hungrig ‘I’m hungry’ with an expanded F0 range and, for comparison, jag ba men okej ‘I just okay’ (female speaker from Stockholm). 39 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University A forth explanation is given by the ‘gravity model of diffusion’ (Trudgill 1974) or the ‘cascade model’ (Labov 2003): language change spreads from the largest to the next largest city, and so progressively downwards (i.e. by socalled city-hopping). In other words, the similarities among the SMG varieties can be explained as the result of a spreading of SMG from city to city (i.e. from Stockholm to Gothenburg, from Gothenburg to Malmö, and so on). A spreading from city to city rather than a spreading in a more wave-like pattern does not assume acceptance of the spreading features in the rural areas between the cities. The model thereby explains why SMG cannot be found among young people everywhere between Stockholm and Malmö. What mechanism produces sufficient contact among speakers from different cities for the spreading to occur? Labov (2003) discusses two possibilities: 1) that people from the smaller city come to the larger city (for employment, shopping, entertainment, education, etc) and 2) that representatives of the larger city may travel outwards to the smaller city, and bring with them the dialect features being diffused (e.g. in connection with the distribution of goods). In the case of SMG, the first explanation is the most likely. Spreading through music (like that of e.g. Latin Kings) is also a plausible explanation. In the present paper, we have presented a number of segmental and prosodic features that are common for all SMG varieties, but also discussed a feature that distinguishes them from each other (the word accent realization). Future research will reveal more similarities and differences and, thereby, hopefully shed some light on the relationship between the different SMG varieties on the one hand (e.g. if cityhopping has occurred), and on the relationship between SMG and foreign accent on the other. Acknowledgements The research reported in this paper has been financed by the Bank of Sweden Tercentenary Foundation. References Hansson P. & Svensson G. (2004) Listening for “Rosengård Swedish”. Proceedings FONETIK 2004, The Swedish Phonetics Conference, May 26-28 2004, 24-27. Holmes J. & Ainsworth H. (1996) Syllabletiming and Maori English. Te Reo 39, 75-84. Jacobsen B. (2000) Sprog i kontakt. Er der opstået en ny dansk dialekt i Grønland? En pilotundersøgelse. Grønlandsk kultur- og samfundsforskning 98/99, 37-50. Kotsinas U-B. (1990) Svensk, invandrarsvensk eller invandrare? Om bedömning av ”främmande” drag i ”ungdomsspråk”. Andra symposiet om svenska som andraspråk i Göteborg 1989, 244-274. Labov W. (2003) Pursuing the cascade model. In Britain D. & Cheshire J. (eds) Social Dialectology: In Honor of Peter Trudgill. Amsterdam: John Benjamins. Low, E. & Grabe, E. (1995). Prosodic patterns in Singapore English. Proceedings of the Intonational Congress of Phonetic Sciences, Stockholm 13-19 August 1995, 636-639. Quist P. (2000) Ny københavnsk ‘multietnolekt’. Om sprogbrug blandt unge i sprogligt og kulturelt heterogene miljøer. Danske Talesprog, 143-212. Copenhagen: C.A. Reitzels Forlag. Trudgill P. (1974) Linguistic Change and Diffusion: Description and Explanation in Sociolinguistic Dialect Geography. Language in Society 2, 215-246. Udofot I. (2003) Stress and rhythm in the Nigerian accent of English: A preliminary investigation. English World-Wide 24: 2, 201-220. Differences Despite the similarities perceived between Rinkeby Swedish and Rosengård Swedish by adolescents in Malmö, many are surprised to hear that the Malmö adolescents perceive the soccer player Zlatan Ibrahimovic as a speaker of Rosengård Swedish (and not simply a speaker of the Malmö dialect). How large is the difference between SMG and the regional dialect? How large is the difference between e.g. Rosengård Swedish and Scanian? Although Rosengård Swedish clearly contain a number of non-regional features, not all speakers of Rosengård Swedish use all of those features, and many features of Rosengård Swedish are not distinct from the regional dialect (e.g. the word accents). ‘Swedish on Multilingual Ground’ should, therefore, only be seen as an overarching term for a number of related but distinct varieties. SMG in Malmö appears to be ‘Scanian on Multilingual Ground’ (which incidentally is reflected in the Advance Patrol member’s artist name Gonza Blatteskånska). 40 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The communicative function of “sì” in Italian and “ja” in Swedish: an acoustic analysis. Loredana Cerrato TMH-CTT, Dept of Speech Music and Hearing, KTH, Stockholm - Sweden Investigation of the acoustic characteristics of the function of these short utterances is not only interesting from the phonetic point of view, but may be useful in technical applications, for example it may simplify the interpretation of ambiguous utterances produced by humans talking with computers. Abstract The results of an acoustic analysis and a perceptual evaluation of the role of prosody in spontaneously produced “ja” and “sì” in Swedish and Italian are reported and discussed in this paper. The hypothesis is that pitch contour, duration cues and relative intensity can be useful in the identification of the different communicative functions of these short expressions taken out of their context. The results of the perceptual tests run to verify whether the acoustic cues alone can be used to distinguish different functions of the same lexical items are encouraging only for Italian “sí”, while for Swedish “ja” they show some confusions among the different categories. Materials and Method Four dialogues elicited with the map task technique have been used to study the realization of “sì” in Italian and and “ja” in Swedish. A total of 48 instances of “sì” uttered by two speakers (a female and a male speaker from the area of Naples) were analysed in the two Italian1 dialogues. A total of 40 instances of “ja” uttered by two female speakers (from the area of Stockholm2) were analysed in the two Swedish dialogues. All the items were first annotated in their context, then segmented from their context. A series of acoustic measurements were taken on the items in isolation to find out whether there are systematic differences in prosodic implementation among different functions that each item can carry out. Annotation, segmentation and acoustic measurement were carried out with the help of “Wavesurfer” (Sjölander & Beskow 2000). Finally a perceptual test was run to check whether listeners are able to identify the communicative functions carried out by the different items taken out from their context. Introduction Short expressions, such as "mh" "ah" and “yes” or “no” are widely produced during spontaneous conversation and seem to carry a variety of communicative functions. For instance (Gardner 2000) reports to have recognized eight main types of "mm" in corpora of spoken English. However (Cerrato 2003) reports that one of the most common function that these short expressions carry out is that of feedback. Feedback can have different nuances of meaning; therefore the same expression can be produced in several contexts, to convey different communicative functions. It seems possible that the specific dialogue function of short utterances is reflected in their suprasegmental characteristics. The focus of this paper is on the role of prosody in signaling specific dialogue functions for “ja” in Swedish and “sì” in Italian, (i.e. “yes”) which are frequently used in natural conversational interaction and which are essential for a smooth progressing of the communication process. “Ja” and “sì” are used by dialogue participants to indicate that the current listener is following and willing to go on listening, or that he/she is following, but willing to take the turn, or to signal that the listener has understood what has been said so far, is still paying attention to the dialogue, prompting the speaker to go on. Moreoever “ja” and “sí”, can be used to answer yes-no questions. Analyses A functional categorization of Italian “sí” and Swedish “ja” was first carried out using the labels reported in table 1. The categorization was carried out listening to the short expressions and considering the explicit function that they were carrying in the given context. Short expressions were 1 The two Italian dialogues were recorded in a sound-treated room at the University of Naples, they are part of the Italian corpus called CLIPS. More information about the CLIPS corpus are available on the web page: http://www.cirass.unina.it/ingresso.htm 2 The two Swedish dialogues are not part of a big corpus. They were recorded in a soundtreated room at Stockholm University for the special purpose of analysing pre-aspiration phenomena in Swedish stop consonants. More information on the Swedish map-task dialogues in Helgason, P. Preaspiration in the Nordic Languages: Synchronic and Diachronic Aspects. Doctoral thesis, Stockholm University, 2002. 41 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University categorized either as feedback (F) or as answers (RP). The difference between a positive feedback and an affirmative response is quite subtle, however the criteria followed to assign the label of affirmative response (RP) was that of looking for a positive answer to a polar question. The function of feedback expressions can be further specified as: Continuer, when the interlocutor shows that s/he is willing and able to pay attention, perceive the message and continue the interaction, either by letting the other speaker talk (FCY: you go on) or by getting the turn (FCI: I go on). Acceptance (FA), when the interlocutor acknowledges comprehension and acceptance of the message. The labels for feedback are part of a complete coding scheme developed to code feedback phenomena. (Cerrato 2004). of question intonation, moreover (Kohler 2004, and House 2005), analysing respectively German and Swedish material, found that final rises can pragmatically signal intended social interaction and friendliness. This pragmatical-attitudinal explanation might be the key to understand the different contours shown by Italian and Swedish for the categories FA and RP. This difference is also strictly linked to the cultural difference, which depicts Italian people as being more assertive, categorical and self-confident in expressing their points of view (acceptance, agreement) and in giving their responses, while Swedish people as being oriented to seek consensus, by not showing self-confidence and categoricalness, hence by using a rising pitch contour which denotes uncertainty, openness towards the addressee and friendliness. The intensity was measured relatively intra speaker, per category. The results show that FCY is produced with the lowest relative intensity by all the speakers, while RP, in isolation, is produced with the highest intensity. Table 1 Labels used to code the communicative function Italian “sí” and Swedish “ja”. Function Label Comment Feedback Continuation FCI FCY I want to go on you go on Feedback Acceptance FA Understanding, agreement, acceptance Answer Positive RP Positive answer to a polar question Table 2 Summary of the acoustic characteristics of each category in the two languages. Acoustical measurements included pitch contour, relative intensity and duration. Pitch contour was analysed in terms of rising, flat and falling contour. A summary of the pitch contour characteristics for each category in the two languages is reported in table 2. Feedback expressions showing continuation of attention, perception and understanding, without showing the intention of taking the turn (FCY) are characterized, both in Italian and Swedish, by a rising pitch contour, which is a typical continuation contour. A raised F0 is considered a marker of non assertiveness (Ohala 1983) and in fact the feedback category FCY signals continuation of attention, perception and understanding, but not acceptance or agreement, which is instead signaled by the category FA, in Swedish characterized by a rising F0 and in Italian by a falling F0. The falling contour is typical for assertiveness and categoricalness (Kohler 2004) which is quite consistent with the realization of Italian “sí” having a rising F0 even as a positive answer (RP). In Swedish instead the category FA and RP are realized with a rising F0. Rising F0 can signal feedback continuation of contact, but also nonassertiveness and uncertainty, therefore it is typical Function label Swedish Italian FCY Rising F0, lengthening Rising lengthening FCI Flat F0, (lengthening for speak 1) Mainly Falling F0, short duration FA Rising F0 Falling F0 RP Rising and long F0 Falling F0 (if in context) Falling and short (if in isolation) F0, The duration of all the occurrences of “sì” and “ja” was measured both from spectrograms and waveforms. Most of the items were produced in an utterance of their own, which made the segmentation easier: the onset of word initial was set at the appearance of energy, while the offset was marked at the disappearance of energy. In those cases where the short expressions were coarticulated with preceding or following items, the transitions were included in the segmentation and in the measurement of the duration. Figure 1 plots the duration of Italian “sì” in milliseconds, per speaker and per category. When Italian “sì” is produced as FCY it has a longer duration than when it is produced with any other function, and this difference is significant. This is not surprising since the typical pitch contour for FCY is rising and the typical phonological tonal contour is 42 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University a continuation rise, usually coupled to a longer duration. The most evident difference in the Italian stimuli duration is between the “sì” as Continuer you go on (FCY) and as Continuer I go on (FCI). The “sì” FCI is usually produced at the beginning of a longer utterance, often in coarticulation with what follows; this “sì” is uttered very quickly since the speaker wishes to go on speaking. This might explain why in the Italian dialogues “sì” with FCI function, which signals also the intention to get the turn, does not undergo a typical continuation lengthening phenomenon; however in the Italian dialogues there were very few “sì” with FCI function. 500 450 400 350 300 250 200 150 100 50 0 When Swedish “ja” is produced with FCI function by speaker 1, it has a longer duration than when it is produced with any other function. This might depend on the fact that this speaker, by uttering “ja” at the beginning of a longer utterance, differently from the Italian speakers, prolongs the short expression with a typical continuation rise to signal the intention to keep the floor and go on speaking. Apart from the lengthening phenomena in FCI for speaker 2, there are no evident differences in the duration of Swedish “ja” across categories. Perceptual test The test consisted of two sub-tests, one with the Italian stimuli submitted to 8 Italian listeners and one with the Swedish stimuli submitted to 8 Swedish listeners. The test material consisted of 8 stimuli for each category organized in two blocks of 34 stimuli for Italian and 34 stimuli for Swedish (the first two stimuli in each block being dummies). No manipulations were performed on the stimuli, in order to preserve their naturalness; however for the categories RP and FCI there were not enough instances of stimuli per speaker hence some of them were played twice. Before the experimental session the participants were given written instructions and were involved in a short training session to familiarise with the task. During the experimental session, the stimuli were presented individually over headphones, with randomized order and after each presentation the listener chose, on the answering sheet, one of the 4 available labels (reported in table 1) for the function that they believed the stimulus carried out in the conversation. The results, under the form of confusion matrices, for the Italian listeners judging the “sì” of the 2 Italian speakers, are reported in Table 4a and 4b. For the Italian stimuli all the recognition rates appear to be above chance level. In Italian FA and RP are confused with each other. This might depend on the fact that they have similar acoustic characteristics, in particular similar pitch contour and duration. The only difference consisting in the higher intensity of RP stimuli (+4 dB). FCY for Italian speaker 1 gets high recognition rates, this maybe due to lengthening. Table 3 shows the average duration in milliseconds of Italian “sì” for the 4 functions. The results for the Swedish listeners judging the “ja” of the 2 Swedish speakers are reported in Table 5a and 5b. For the Swedish stimuli not all the recognition rates appear to be above chance level. RP is in fact not distinguished. This might depend on the fact that in Swedish positive answers speak1 speak2 FA Fci FCY RP Figure 1 Duration of Italian “sì” in milliseconds, per speaker and per category The finding that FA and RP items have the same pitch contour and the same duration is consistent with a previous phonological tonal analysis carried out on this materials (Cerrato, D´Imperio 2003), which showed that items classified as FA and RP have the same tonal contour. The results of the duration analysis for Swedish “ja” plotted in figure 2. 500 450 400 350 300 speak1 250 speak2 200 150 100 50 0 FA FCi Fcy RP Figure 2 Duration of Swedish “ja” in milliseconds, per speaker and per category. 43 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University are very seldom expressed by only using “ja”, but rather by a sequence of “ja” and other words, such as: ja visst, ja precis. FA and FCY are confused with each other, since they both show a rising F0. FCI for speaker 1 gets good recognition rates probably because it is characterized by a flat F0 and lengthening. cative functions. Even if the analysis was limited to a particular kind of communicative situation, namely Map Task dialogues, and to only 4 speakers, it is evident from the results that duration cues correlate with some dialogue functions, in particular the most evident difference in duration is between the Continuer you go (FCY) and all the other categories in Italian, while in Swedish the most evident duration difference is between the Continuer I go on (FCI) and the other categories. Table 3 Average duration in milliseconds of Italian “sì” according to the 4 different functions. FCY 275 RP 260 FA 232 FCI 209 References Cerrato L. & D’ Imperio M. (2003), Duration and tonal characteristics of short expressions in Italian, In Proceedings of the ICPhs Barcellona 03. Cerrato L. (2003) A comparative study of verbal feedback in Italian and Swedish map-task dialogues, In Proceedings of the Nordic Symposium on the comparison of spoken languages, 2003, 99-126 Cerrato L. (2004) A coding scheme for the annotation of feedback phenomena in conversational speech LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces, Lisboa, 25-28 Gardner R. (2001), When Listeners Talk John Benjamins Publishing Company House D. (2005) Phrase-final rises as a prosodic feature in wh-questions in Swedish humanmachine dialogues, accepted in Speech Communication Köhler, K.J. (2004) Pragmatic and attitudinal meanings of pitch patterns in gemran syntactically marked questions. In Fant G. et al (eds) From traditional phonology to modern speech processing, 205-214 Beijing Foreign Language Teaching and Research Press. Ohala J.J. (1983) Cross language use of pitch: an ethological view, Phonetica 40, 1-18 Sjölander K, Beskow J. (2000) “aveSurfer - an Open Source Speech Tool, Proceedings ICSLP 2000, Bejing, China Table 4a Confusion matrix for the identification test for Italian speaker 1 FA FCI FCY RP FA 48% 8% 2% 42% FCI 16% 59% 5% 20% FCY 2% 5% 90% 3% RP 38% 11% 6% 45% Table 4a Confusion matrix for the identification test for Italian speaker 2 FA FCI FCY RP FA 45% 13% 5% 37% FCI 14% 59% 5% 22% FCY 9% 8% 69% 14% RP 37% 14% 11% 38% Table 5a Confusion matrix for the identification test for Swedish speaker 1 FA FCI FCY RP FA 48% 2% 30% 20% FCI 2% 83% 14% 0% FCY 33% 2% 55% 11% RP 38% 0% 38% 23% Table 5b Confusion matrix for the identification test for Swedish speaker 2 FA FCI FCY RP FA 41% 9% 40% 9% FCI 29% 52% 2% 17% FCY 33% 5% 59% 3% RP 23% 31% 19% 27% Acknowledgements A special thanks to my supervisor David House, for inspiring discussions about the results of this study. Conclusions The aim of this study was to investigate the acoustic and perceptual characteristics of spontaneously produced “ja” and “sì” in Swedish and Italian, in order to find out whether acoustic cues can be used to render and to recognize different communi44 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Presenting in English and Swedish Rebecca Hincks Department of Speech, Music and Hearing, KTH, Stockholm Unit for Language and Communication, KTH, Stockholm requirements. The aim of the small study described in this paper was to gather data to shed light on the question of how individual speakers might differ in speaking characteristics when presenting in a first or second language. Other research has suggested that a narrowed pitch range is a characteristic of second language speech (Mennen 1998; Pickering 2004), at the same time as it has been shown that using pitch effectively is an important means of structuring instructional discourse. In situations such as exist in Sweden, where students are increasingly judged on tasks performed in a second language, it is of interest to know the extent to which that requirement constrains them. This paper investigates pitch variation levels and speaking rates in the English and Swedish versions of the same presentations. If speakers were found to use less pitch variation when speaking English than Swedish, then second language users could be seen as primary users of a system for encouraging more pitch variation. It was expected that speaking rates would be faster for Swedish than for English; this examination could quantify the differences. Abstract This paper reports on a comparison of prosodic variables from oral presentations in a first and second language. Five Swedish natives who speak English at the advanced-intermediate level were recorded as they made the same presentation twice, once in English and once in Swedish. Though it was expected that speakers would use more pitch variation when they spoke Swedish, three of the five speakers showed no significant difference between the two languages. All speakers spoke more quickly in Swedish, the mean being 20% faster. Introduction Two earlier contributions to the Annual Swedish Phonetics Conference have outlined ideas for a feedback mechanism for public speaking. Briefly, Hincks 2003 proposed that speech technology be used to support the practice of oral presentations. Speech recognition could give feedback on repeated segmental errors produced by non-natives as well as provide a transcript of the presentation, which could then be processed for lexical and syntactic appropriateness. Speech analysis could give feedback on the speaker’s prosodic variability and speaking rate. Hincks 2004 presented an analysis of pitch variation in a corpus of second language student presentation speech. Pitch variation was measured as the standard deviation of F0 for 10-second long segments of speech, normalized by dividing by the mean F0 for that segment. This value was termed PVQ, for pitch variation quotient. Hincks (forthcoming) reports on the results of a perception test of speaker liveliness, where a strong correlation (r = .83, n = 18, p < .01) was found between speaker PVQ and perceptions of liveliness from a panel of eight judges. Though automatic feedback on the prosody of public speaking could be useful for both first and second language users, the abovementioned studies have been done on a corpus of L2 English, where native Swedish students of Technical English were recorded as they made oral presentations as part of their course Method The goal for the data collection used for this paper was to have a corpus where the same speaker used both English and Swedish to make the same presentation, with the same visual material. Because class time could not be wasted with having students hear the same presentation twice, the Swedish recordings needed to be made outside the classroom. All students studying English at KTH in the fall of 2004 –nearly 100 students—were contacted and asked whether they would like to participate. They were told that they would first be recorded in the classroom as they made their presentations in English, and that they would then meet in groups and make the same presentations in Swedish to each other. They were offered 150 SEK as compensation for the extra time it would take. Unfortunately, only five students were able to participate. These five, three males and two females, were all intermediate students. They were first recorded in their 45 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University English classroom, and then met at the end of term to be recorded in Swedish. The audience for the second recording consisted of the other four students, their English teacher, and me. Four of the five students used computer-based visual support for their presentations, and were instructed to use their English slides for the Swedish presentation. This assured that the content of the presentations would be the same. One student, M3, did not use extensive visual support. With WaveSurfer’s (Sjölander and Beskow 2000) ESPS pitch extraction boundaries set at 65-325 Hz for male speakers and 110-550 Hz for female speakers, pitch extraction was performed for up to 10 minutes of speech for the five presentations in each language. All pitch contours were visually inspected for evidence of extraction errors and the location of the errors noted. The F0 values were exported to a spreadsheet program, where the erroneous values were deleted, and the means and standard deviations of 10-second long segments were calculated. The standard deviation of each segment was divided by the mean of each segment to determine the PVQ, pitch variation quotient. Speaking rate was calculated by manually dividing the transcripts of the presentations into syllables and dividing by the total time spent speaking. Because pause time is included in the calculation, the values achieved are lower than what might otherwise be found in studies of spontaneous speech. Another temporal value of interest is the mean length of runs, which is the amount of speech, in syllables, a speaker utters between pauses. This measure has been found to correlate highly with language proficiency (Kormos and Dénes 2004). The minimum pause length was defined as 250 ms. Pitch variation quotient 0.26 0.24 English 0.22 Swedish 0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 M1 M2 M3 F1 F2 Speaker Figure 1. Mean pitch variation quotient for whole presentation in both English and Swedish Temporal measures The male speakers spoke for a shorter length of time when making the presentation in Swedish than when using English, as shown in Figure 2. 700 English 600 Swedish Seconds 500 400 300 200 100 0 M1 M2 M3 F1 F2 Speaker Figure 2. Length of time in seconds to make presentation in English and Swedish Speaking rate Part of the reason the speakers could make their presentations in a shorter period of time is that they spoke on average 20% more quickly. Figure 3 shows the speaking rate per speaker in syllables per second. The mean speaking rate in English was 2.97 sps, and for Swedish was 3.58 sps. M3, the only student to use a lot more pitch variation in Swedish than in English, also spoke much more quickly in Swedish. Note also that the two females are more stable in their speaking rates, and that the fastest and slowest speakers in one language maintain their ranking in the other language. Results Pitch variation quotients The mean PVQs per speaker for the two presentations are shown in Figure 1. For three of the five speakers, there was very little difference in the PVQs when using English and when using Swedish. Only one speaker, M3, had significantly lower PVQ speaking English, but another, F1, had lower PVQ when speaking Swedish. Though there are only five speakers, the mean values reflect the same range as that found in the all-English corpus, with a low of about 0.11 and a high of about 0.24. 46 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University the larger, all-English corpus, where an attempt was made to gather data from every student in a class. It is reassuring, however, that the ranges of prosodic variables for these five speakers reflect nearly the same ranges as that of the first corpus. 5.0 Syllables per second 4.5 English 4.0 Swedish 3.5 3.0 2.5 2.0 Language or performance? The result that three of five speakers showed no significant difference in PVQ depending on the language they were using would seem to indicate that PVQ measures are more speaker dependent than language dependent, at least for native speakers of Swedish. The hypothesis that the speakers would use less pitch variation when speaking English was not at all born out by the study. It seems that the PVQ depends mostly on speaking style, and perhaps the energy one puts into ‘performing’ in a certain situation. The English presentation was a higher-stakes event, where students were speaking to more people and, most importantly, receiving a grade on their work. Speaker F1 performed very well for her first presentation, and with the high mean length of runs combined with higher-than average mean PVQ, probably would have received high liveliness ratings had her speech been part of the perception test. It is interesting that she was the only student to have lower PVQ values and the only student to have lower MLR values in Swedish than in English. This could indicate that she in some way put less effort into performance for the Swedish presentation. Speaker M3, on the other hand, was either hampered by using English or relatively unprepared when making the first presentation. He could have benefited by rehearsing with a feedback mechanism beforehand. For the purposes of a thesis grounded in computer-assisted language learning, these results throw a bit of a wrench in the works. The problems I am proposing to help may not depend on the use of a second language, but on more basic features of speaking style. On the other hand, at advanced levels of language courses, it is difficult to separate the needs of first and second language users. Furthermore, many native speakers as well as non-natives obviously have problems achieving an engaging speaking style, and it has never been my intention to propose a device restricted to nonnative use. 1.5 1.0 0.5 0.0 M1 M2 M3 F1 F2 Speaker Figure 3. Speaking rate in syllables per second for three males and two females in English and Swedish Mean length of runs A variable found to be important in the perception of liveliness in female speech samples (Hincks forthcoming) is the number of syllables between pauses of >250 ms (MLR). Four of the five speakers had higher values for this measure when speaking Swedish (Figure 4). The exception was F1, the same speaker who used less pitch variation in Swedish. 16 English 14 Mean length of runs Swedish 12 10 8 6 4 2 0 M1 M2 M3 F1 F2 Speaker Figure 4. Mean length of runs (number of syllables between >250 ms pauses) using English and Swedish Discussion This study was performed on a small group of speakers, and any results should be interpreted with care. The students who participated were paid volunteers, and in that sense cannot be considered as representative of the population to the same extent as the speakers recorded for 47 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Further work A small study is being planned to test the perception of liveliness in these speakers as they used the two languages. The corpus described in this chapter could be augmented by a small number of speakers over the period of several terms and could provide a wealth of further opportunities for language study. Comparison of the English and Swedish transcripts will allow examination of aspects such as how the speakers use pitch movement in utterances that are comparable content-wise. This could provide insight into transfer of Swedish intonational patterns to English. It is possible that with more speakers, statistically significant differences in PVQ could still be found. The differences in mean speaking rate should also be further investigated—the 20% difference found in this group would be interesting to pursue. Does the average Swedish speaker of English manage to say only 80% of what a native speaker can say during the allotted time at a conference? Documenting such information about first and second language use would give valuable evidence for those in positions of developing language policy. References Hincks, R. (2003). Tutors, tools and assistants for the L2 user. Phonum 9: 173-176, Umeå University Department of Philosophy and Linguistics. Hincks, R. (2004). Standard deviation of F0 in student monologue. Proceedings of Fonetik 2004, Stockholm, Department of Linguistics, Stockholm University. Hincks, R. (forthcoming). Measures and perceptions of liveliness in student oral presentation speech: a proposal for an automatic feedback mechanism. Accepted for publication in System. Kormos, J. and M. Dénes (2004). Exploring measures and perceptions of fluency in the speech of second language learners. System 32: 145-164, Mennen, I. (1998). Can language learners ever acquire the intonation of a second language? Proceedings of STiLL 98, Marholmen, Sweden, KTH Department of Speech, Music and Hearing. Pickering, L. (2004). The structure and function of intonational paragraphs in native and nonnative speaker instructional discourse. English for Specific Purposes 23: 19-43, Sjölander, K. and J. Beskow (2000). WaveSurfer: An open source speech tool. Proceedings of ICSLP 2000, http://www.speech.kth.se/snack/. Acknowledgements My thanks to David House, the student speakers and especially to teacher Beyza Björkman, whose encouragement was important in getting five volunteers for this study. This work was funded by the Unit for Language and Communication. 48 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Phonetic Aspects in Translation Studies Dieter Huber Department of General Linguistics and Culture Studies Johannes Gutenberg Universität Mainz Mainz/Germersheim Germany Abstract Translation Studies cover a subfield of Applied Linguistics and are concerned with the scientific study of translation and interpreting in its various media and forms: oral vs. written, simultaneous vs. consecutive, literary vs. technical, human vs. machine, direct vs. relais, remote vs. in situ, etc. While linguistics in general have a long tradition of both theoretical and experimental research into various aspects of translation and interpreting, the phonetics and phonology of this specialized form of intercultural communication have, until very recently, attracted only little attention within the scientific community. The purpose of this paper is to summarize some recent findings of this research and to indicate potential directions for further studies into the phonetics and phonology of translation. ness and acceptability of the translated product. Even more importantly, translation for film dubbing and synchronization not only involves careful consideration of phonetic and phonological choices at the segmental and suprasegmental level, but also has to carefully monitor lip movements, voice quality, duration patterns, and their compatibility with the paralinguistics of the respective scene. Interpreting Interpreting, both in the simultaneous and in the consecutive mode, involves linguistic choices that have to be made by the interpreter at all levels of language processing at a time when the source language text, mostly oral speech, has yet to be completed. Contents and structure, topic and focus, verbal references, phrasal attachements, presuppositions and often even the actual goals and intentions of the speaker, may, at the very extreme, be entirely unresolved at the time of the original utterance when the interpreter has to perform. On the other hand, empirical studies of the time constraints of simultaneous interpreting show that the décalage, i.e. the time delay between the source language input by the original speaker and the target language output by the interpreter, that is acceptable to normal listeners should not exceed two to three seconds2. This double bind forces the professional interpreter at the very least to develop, in addition to his or her linguistic, mnemotic and anticipatory skills a high degree of vocal and articulatory expertise in order to be able to continuously adjust to the speech rate properties and vocal demands of the actual situation. In addition, as shown among others by Goldmann-Eisler (1972), Černov (1978), Shlesinger (1994) and Ahrens (2004), professional inter Translation Translation1, in the narrow sense, covers all forms of the transfer of meaning from a source language text into one or more target languages. Both written and oral texts are translated, as long as they are presented as a whole in a fixed, finished and thus permanent form. Clearly, the translation of written texts does not normally involve any choices at the phonetic or phonological level. However, as shown among others by Paz (1971), Lefevere (1975), Kohlmayer (1996) and Weber (1998), expressive texts including poetry, lyrics and drama, but also scripted speech, video narrations and advertisements need to be translated in view of their readability and their potential use in stage performance. The successful transfer of rhyme, rhythm, pausing patterns, alliteration, accentuation and word play, based on the segmental and/or suprasegmental qualities of lexical and phrasal units, will often be crucial to the useful- 49 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Lefevere A. (1975) Translating Poetry. Seven Strategies and a Blueprint. Amsterdam: Van Gorcum Manhart S. (1998) Synchronisation. In SnellHornby M. et al. (eds) Handbuch Translation, 264-266. Tübingen: Stauffenburg Verlag Paz O. (1971) Traducción: Literatura y Literalidad, Barcelona: Tusquets Shlesinger M. (1994) Intonation in the Production and Perception of Simultaneous Interpretation, In: Lambert S. & Moser-Mercer B. (eds) Bridging the Gap, 225-236, Amsterdam: Benjamins Snell-Hornby M. (1998) Translationswissenschaftliche Grundlagen: Was heißt eigentlich “Übersetzen”? In Snell-Hornby M. et al. (eds) Handbuch Translation, 37-38. Tübingen: Stauffenburg Verlag preters make systematic use of the prosodic properties of the source language input in order to derive complementary and/or compensatory information about the structural properties of the ongoing and yet incomplete utterance. Pauses, intonation contours, accentuation and final rise/final lengthening patterns have been studied systematically in authentical data by Ahrens (2004) who shows that prosody is not merely transfered or reconstructed but apparently used in a translation-specific way involving particular translation units and tonal riselevel patterns as situation-specific markers. References Ahrens B. (2004) Prosody beim Simultandolmetschen. Frankfurt am Main: Peter Lang, Europäischer Verlag der Wissenschaften. Černov G.V. (1978) Teorija i praktika sinchronnogo perevoda. Moskva: Me dunarodnie otnošenia Fodor I. (1976) Film Dubbing: Phonetic; Semiotic, Esthetic and Psychological Aspects, Hamburg: Buske Gile D. (1995) Regards sur la recherché en interprétation de conference. Lille: Press Universitaire Goldmann-Eisler F. (1972) Segmentation of Input in Simultaneous Translation. Journal of Psycholinguistic Research 1/2, 127-140 Huber D. (1990) Prosodic transfer in spoken language interpreting. Proceedings of the International Conference on Spoken Language Processing. ICSLP–90 (Kobe, Japan), 509512. Kohlmayer R. (1996) Oscar Wilde in Deutschland und Österreich. Untersuchungen zur Rezeption der Komödien und zur Theorie der Bühnenübersetzung. Tübingen: Niemeyer Notes 1 Not to complicate matters, I neglect to include a lengthy discussion of the various uses and ambiguities of the term “translation” in this and other scientific disciplines such as physics, biogenetics, economics, theology, history (cf. translatio imperii) and others. Suffice to say that even within the restricted scope of translation studies per se, “translation” as a scientific term is used somewhat incoherently both in the narrow sense as translation proper (översättning, Übersetzung, traduction) and, in the wide sense, as the generic term to cover the whole field of translation and interpreting (tolkning, Dolmetschen, interpretation). 2 50 Compare Gile (1995) for an overview. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Scoring Children's Foreign Language Pronunciation Linda Oppelstrup, Mats Blomberg, and Daniel Elenius Speech, Music and Hearing KTH, Stockholm models in a hidden Markov model (HMM) system. Although prosody is very important for pronunciation quality, this report is limited to segmental quality. More detailed results are given in Oppelstrup (2005). Abstract Automatic speech recognition measures have been investigated as scores of segmental pronunciation quality. In an experiment, contextindependent hidden Markov phone models were trained on native English and Swedish read child speech respectively. Among various studied scores, a likelihood ratio between the scores of forced alignment using English phoneme models and the score of English or Swedish phoneme recognition had the highest correlations to human judgments. The best measures have the power of evaluating the coarse proficiency level of a child but need to be improved for detailed diagnostics of individual utterances and phonemes. Theory An approximation in this work is that pronunciation quality is composed of two components, knowledge and ability. The knowledge component reflects the speaker’s knowledge of the correct phonetic transcription of a written text. The ability component reflects the speaker’s ability to pronounce the phonemes of the target language correctly. The knowledge score, Sk, can be formulated as a confidence measure that the speaker has chosen the correct transcription, TrCorrect, in his spoken utterance (U) of the written text. This is modeled by: Introduction Automatic evaluation of foreign language pronunciation presents possibilities for computerassisted language learning as well as for prediction of speech recognition performance in a non-native language. Although children constitute a very large portion of foreign language learners, speech technology research in this domain has previously been mainly focused on adults. The current work has been produced as part of the EU project PF-Star, in which one goal was to assess the current possibilities of speech technology for children. Previous work has used the fact that the better you are at pronouncing the new language, the more the utterances should resemble sounds from the target language instead of the mother tongue (Eskenazi, 1996; Matsunaga, Ogawa, Yamaguchi and Imamura, 2003). However, the pronunciation quality of read speech does not depend solely on the ability to produce the phonemes correctly; it also depends on knowledge of how the words should be pronounced. Spectral quality and time-related scores have shown high correlation with human judgment (Neumeyer, Franco, Digalakis and Weintraub, 2000; Cucchiarini, Strik and Boves, 2000). The foreign language considered in this report is English and the mother tongue is Swedish and also Italian in some cases. The scoring procedure used context-independent phoneme Sk = P(U | TrCorrect , λT ) P(U | TrCorrect , λT ) = (1) max[P(U | Tri , λT )] P(U | TrBest , λT ) ∀i where T is the set of target language phoneme models, trained by native reference speakers. In speech recognition terminology, the operation of the nominator is forced alignment and that of the denominator is phoneme recognition. This ratio has been used for pronunciation scoring by Cucchiarini et al (2000). The ability score is a measure of the acoustic quality of the speaker’s realization of the phonemes in the target language. It is possible that some phonemes are pronounced as the most similar phoneme in the mother language. To score the ability of a speaker to produce the correct non-native sounds, we will try the ratio between the probabilities that the target language phonemes were used and that the mother tongue ones were used: λ Sa = P(U | TrCorrect , λT ) P (U | TrCorrect , λM ) (2) if the correct phonetic transcription of the written text to be pronounced is known. M is the set of mother language phoneme models. If the λ 51 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The English material for PF-STAR was recorded in three different regions of England by the University of Birmingham. 158 children of the ages six to fourteen were recorded but only those above eight were included. Each child recorded approximately 90 utterances. The database was used both for training the English phoneme models (ENG-tr) and for performance tests (ENG-te). A part of this material received an increased noise level in this experiment due to unintentional mixing of two channels from a headset and a desktop microphone. The Italian database, ITEN, was recorded near Trento by ITC-irst. The part used here comprises 25 children, ten and eleven years old, reading 75 English prompts each. The Swedish non-native PF-STAR material, SWENG, was recorded in Stockholm. Each of 40 children of both sexes, in the ages 10-11 years read in average 64 English utterances, prompted on a computer screen. Most of the utterances were the same as in ITEN. If the child was uncertain of the pronunciation he/she had the option to listen to a prerecorded pronunciation of that prompt. This option was used in about 15% of the recordings. text or the transcription is unknown, we can use the ratio between the phoneme recognition scores using the two language models: Sa = P(U | TrBest , λT ) P(U | TrBest , λM ) (3) A combined knowledge and ability score can be computed by multiplying Sk and Sa in Eqs. (1) and (3) : Sc = P (U | TrCorrect , λT ) P(U | TrBest , λM ) (4) Implemented Scores In this report we present results of the following pronunciation score parameters: • Knowledge: English forced alignment / English phoneme recognition (EFA/EPR) • Ability: English phoneme recognition / Swedish phoneme recognition (EPR/SPR) • Combined: English forced alignment / Swedish phoneme recognition (EFA/SPR) • Fraction language use (FRAC); see below. • Rate of speech (ROS) • Utterance duration (DUR) Experiments Recognition performance tests and pronunciation evaluation experiments have been performed. Word recognition tests used the SVE and ENG development sets both containing children of ages eight and nine only. The language model allowed any word to follow any other with equal probabilities. The word insertion penalty was experimentally set to minimize WER. In phoneme recognition tests the penalty was non-optimized and equal to zero. The English and Swedish phoneme models were trained on ENG-tr and SVE1, respectively. The phoneme models consist of three states. The 39 elements of the acoustic feature vector are the 13 lowest mel frequency cepstral coefficients (MFCC) including number 0, and their first and second order time derivatives. The mel filterbank is computed with 25 ms Hamming window at a frame rate of 10 ms. The output likelihood values of the recognizer are logarithmic, which turns the implementation of ratio between scores into subtraction instead of division. The scores were measured in different ways, including or excluding non-speech intervals before and after the utterance and optional pauses between words. In this report, the presented In FRAC, both language model sets are active in parallel and the score is the percentage of English language models selected by the recognizer. Speech Data Bases The speech material used in this report comes from five child speech databases. The utterances were separate words and short sentences chosen to make good coverage of all phonemes. All recordings were made with headset microphones. The sampling frequency and amplitude resolution was, when necessary downsampled to 16 kHz and reduced to 16 bits, respectively. The material was split into sets for training, development and evaluation. SVE1 consists of 50 Swedish children in the ages eight to fourteen years in the EUSpeeCon corpus (Iskra et al, 2002). Each child recorded 60 utterances in average. SVE2 (Blomberg and Elenius, 2003) is part of the Swedish PF-STAR corpus of 198 Swedish children between four and eight years old. Only the children above eight were used. Each child recorded approximately 80 utterances. 52 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University ity per child. The correlation in the individual languages is generally low and is increased for the combined group of Swedish and Italian children. Still higher correlation is achieved when including also the native English children. The increase can be due to a larger range of pronunciation quality among the speakers. The knowledge score EFA/EPR, the ability score EPR/SPR, and the combined score EFA/SPR all got high correlation in this case. EFA/EPR and EPR/SPR are shown in Figure 1. The effect of an increased correlation when including native English speakers can also be seen in FRAC. scores include non-speech intervals, which generally performed better. Several other combinations of scores, models and normalization techniques have been studied by Oppelstrup (2005). The pronunciation scores were correlated with human judgment of the utterances. The SWENG and ITEN speech files were scored by an English teacher with phonetic training. Segmental and prosodic qualities were judged separately. Each utterance was scored on a threegraded scale: 3 for correct pronunciation, 2 for small errors and 1 for erroneous utterances. To get grade per child, the average of all graded sentences was calculated. At the time of the experiments, the ENG database had no human judgments but were given the score 3, assuming that all English children had a correct pronunciation. Afterwards, judgments have been made also on the English children and some results including these are given in this report. Table 3. Correlation of pronunciation scores with human judgment for various test sets. S = SWENG, I = ITEN, E = ENG-te. English children are given grade 3, except for values after ‘/’ which are based on human judgment of a part of these. Test set Score S I S+I S+I+E EFA/EPR 0.20 0.61 0.65 0.70/0.75 EPR/SPR 0.18 0.26 0.56 0.80/0.68 EFA/SPR 0.20 0.37 0.60 0.82/0.72 FRAC 0.22 0.09 0.48 0.82/0.71 ROS 0.18 -0.25 0.43 0.57/0.42 DUR -0.11 -0.12 -0.54 -0.47/-0.47 Results Word and phoneme recognition Results of the word and phoneme recognition experiments are shown in Table 1 and 2, respectively. The error rates are generally quite high, which is not surprising considering the combined difficulties of child, non-native speech from different databases and a highperplexity language model. EFA / EPR 0 Table 1. Word error rates for the Swedish and English test sets. Test SVE2 ENG-te SWENG ITEN Voc. size 1051 1097 617 629 -200 Automatic Training SVE1 ENG-tr ENG-tr ENG-tr -100 WER % 94 54 79 85 -300 -400 -500 -600 1 1.5 2 2.5 3 3.5 3 3.5 Human judgment Table 2. Phoneme error rates (PER) in percent for different training and test set combinations. EPR/SPR 3000 2500 2000 SVE2 97 72 1500 Automatic Test Training ENG-te SWENG ITEN ENG-tr 66 103 119 SVE1 92 86 93 1000 500 0 -500 Pronunciation scoring Correlation with human judgment was low for single utterances but was increased when averaging the scores of all utterances by a child. Table 3 lists correlations between automatic scores and human judgment of segmental qual- -1000 -1500 1 1.5 2 2.5 Human judgment Figure 1. Automatic vs human judgment of Swedish, Italian and English children for EFA/EPR (top) and EPR/SPR (bottom). 53 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Another possibility is to replace phoneme recognition in the scoring algorithms by explicit modeling of predicted erroneous pronunciations. Better knowledge of the differences between the phoneme sets of the mother tongue and target language could also help in giving more weight to the difficult phonemes in the target language. Discussion The low accuracy of word and phoneme recognition even for native English children indicates that there is low discrimination between the phoneme models. Child speech recognition is very difficult in itself and the small size of the training material allowed only contextindependent phoneme models to be trained. Another difficulty was the varying recording conditions in the databases. These problems makes the pronunciation evaluation uncertain. Another fact that may lower the correlation with human judgment is that the human listener and the scoring algorithms have different targets for correct pronunciation. Whereas the human listener is likely to compare with neutral British English, the reference models for the system are trained on children with different regional accent. As was expected from previous studies, correlation increased with the amount of available data. Correlation for single utterances was lower than for averages of all utterances from a speaker. The correlation within the Swedish children was quite small. An explanation may be that the scoring algorithms are not sensitive to the limited pronunciation variation in this group. The correlation among Italians is larger and the highest overall correlation is achieved when including children from all the Italian, Swedish and English sets. It is interesting to note that the Swedish phoneme models seemed to work equally well as native models for Italian as for Swedish children. A separate procedure will probably be necessary to reject utterances that match poorly to both nominator and denominator models in the likelihood ratios, since the likelihood ratio of these utterances will be quite random. Acknowledgement This work was conducted as a Master of Science thesis at TMH, KTH, Stockholm, within the EU project PF-STAR, Preparing Future Multisensorial, Interaction Research. The human pronunciation judgments were performed by Becky Hinks. References Blomberg, M. and Elenius D. Collection and recognition of children’s speech in the PFStar project. Phonum 9, Dept. of Philosophy and Linguistics, Umeå University, 2003. Cucchiarini, C., Strik, H. and Boves, L. Different aspects of expert pronunciation quality ratings and their relation to scores produced by recognition algorithms. Speech Communication, Vol 31, pp 109-119, 2000. Eskenazi, M. Detection of foreign speakers’ pronunciation errors for second language learning – preliminary results. Proc. of ICSL ’96, vol 3, 1996. Iskra, D. Grosskopf, B., Marasek, K., van der Heuvel, H., Diehl, F, and Kissling, A. Speecon – speech databases for consumer devices: Database specification and validation. Proc. of ICSLP ’02, 2002. Matsunaga, S. Ogawa, A., Yamaguchi, Y. and Imamura, A. Non-native English speech recognition using bilingual English lexicon and acoustic models. Proc. of ICME ’03, pp 625-628, 2003. Neumeyer, L., Franco, H. , Digalakis, V. and Weintraub, M. Automatic scoring of pronunciation quality. Speech Communication, Vol 30, pp 83-93. 2000. Oppelstrup L. Speech Recognition used for Scoring of Children’s Pronunciation of a Foreign Language, M. Sc. Thesis TMH/KTH, Stockholm, 2005. Conclusion The best scoring functions correlate well enough with human judgments to allow coarse grading of a child’s pronunciation quality. The context-independent models used are too insensitive, however, to allow scoring on the utterance or phoneme level. The most important improvement would be to use context-dependent phoneme models, trained on a large corpus with recordings of children with correct pronunciation and accent. 54 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University On Linguistic and Interactive Aspects of Infant-Adult Communication in a Pathological Perspective Ulla Bjursäter, Francisco Lacerda and Ulla Sundberg Department of Linguistics, Stockholm University child has developed a larger vocabulary and starts to use two-word sentences or more in communication with its environment. If the auditory channel of information is disturbed, the means of integration of stimuli input is disturbed which can result in a linguistic disturbance, which also can affect the ability to produce comprehensible speech (Öster, 2002). Humans seem to have a propensity to integrate the synchronic audio-visual stimuli that is accessible in a communicative situation (Bahrick, 2004). For example, when adults are speaking to infants, they tend to repeat representations of target words as denomination of whatever object the child is focusing on at the moment. Characteristic for this kind of interaction is that the adult pronounces several sentences containing the target word, often in final position, while following the infant’s gaze. Target words that are pronounced isolated in a repetitive way has a significant positive effect on the first stages of the development of vocabulary (Brent & Siskind, 2001). In a perspective of language development, adults’ behaviour can be regarded as an efficient way of producing a correlation between the words and sentences and the object on which the infant is focusing. An implicit meaning for the target word may arise as a result of automatic association between the sensory representations that show highest correlation (Lacerda, 2003). The learning mechanism that builds on associations of different sensory impressions is most relevant for learning of the first words, at the early stages of linguistic development. In a natural speech communication situation, competent speakers and listeners rapidly achieve an effective level of information exchange by adjusting to each others communication needs (Lacerda et al., 2004b). Infants generally learn to use babbling in a communicative way very early in life. When the communication channels are defect in some way, the manner of communication change by force of nature. As the ambient language of infants very commonly is dominated by IDS, Infant Directed Speech (Fernald et al. 1989), this is one of the means of communication the parents of a Abstract This is a preliminary report of a study of some linguistic and interactive aspects available in a adult-child dyad where the child is partially hearing impaired, during the ages 8 - 20 months. The investigation involves a male child, born with Hemifacial Microsomia. Audio and video recordings are used to collect data on child vocalization and parent-child interaction. Eye-tracking is used to measure eye movements when presented with audio-visual stimuli. SECDI forms are applied to observe the development of the child’s lexical production. Preliminary analyses indicate increased overall parental interactive behaviour. As babbling is somewhat delayed due to physical limitations, signed supported Swedish is used to facilitate communication and language development. Further collection and analysis of data is in progress in search of valuable information of the linguistic development from a pathological perspective of language acquisition. Introduction The typical linguistic development during infancy can be regarded as the result of the interaction between biological and environmental factors that leads to the child’s language converts to the surrounding language. According to the Ecological Theory of Language Acquisition (Lacerda et al., 2004a), early language acquisition is an emergent consequence of multi- sensorial embodiment of the information available in ecological adult-infant interaction settings. In agreement with this theory, the basic linguistic referential function emerges from at least two of the sensory dimensions available in the speech interaction scene (Lacerda, 2003; Lacerda, Gustavsson & Svärd, 2003). If there are restraining biological conditions or a lack of adequate interaction with the environment, the child’s linguistic development generally will deviate from the expected age dependent competence of communication. During typical circumstances, a one-year old child starts to use adult-like word forms. By two years of age, the 55 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University child with a congenital perception and/or production handicap has to adjust in order to enhance the child’s linguistic development. This study aims at examining the parentchild interaction when the child has some perception and production disabilities. How does the parent modulate their own and the child’s behaviour to enhance interaction and the child’s linguistic development? In order to investigate how representations of early words may develop in the disabled human infant, analyses will be made on the mothers’ linguistic structure, timing and turn-taking, her repetitions and strategy in adjusting to the infant’s focus of attention. The infants’ vocal productions will be studied in order to observe the progress of the child’s verbal development. tem that measures the child’s eye movements when presented with different auditory representations. Some stimuli are based on the Peabody Picture Vocabulary Test PPTV (Dunn & Dunn, 1981), adapted to the Tobii system. Detailed eye-tracking are used to evaluate the child’s integration of audio-visual linguistic information. A SECDI form (Eriksson & Berglund, 1999), a Swedish version of CDI form (MacArthur Communicative Development Inventory, Fenson et al., 1993) will be administered every six months to observe the development of the child’s lexical production in words and gestures. The speech material will be segmented, labelled and transcribed orthographically in WaveSurfer (www.speech.kth.se/wavesurfer). The parent-child linguistic and gestured interaction will be annotated in Anvil (www.dfki.de/~kipp/anvil) for further analysis. Method A Swedish mother is recorded monthly while spontaneously interacting with her child. On two occasions the father has participated during the recording substituting the mother. Result and Discussion The data is currently being analyzed but there are preliminary indications of increased parental interactive behaviour. Initial analyses indicate that, as a consequence of the child’s handicap, the mother seems to enhance her manner of communication in order to keep the interaction active. Mother-child eye contact is frequent and expanded and turn-taking is strongly encouraged by the mother in their interaction. The mother tends to repeatedly verbalize every representation of target word that currently is under attention of them both, by using the target words in different settings, combined with various physical actions. In interacting with a child, parents often make use of specific body language, with frequent and intense eye contact, exaggerated facial expressions, head nods and shakes, pointing and with concrete physical contact. All these varieties are used in her means of communication with her child. The child has some difficulties in his verbal production. Before the feeding probe was removed and he could start eating proper food he just played with his voice with high vowel-like sounds, as reported by his parents. When the boy was eight months of age, the probe was taken out. Now general babbling started with CV productions. The consonantal sounds produced were, and still are, mostly uvular and velar, like /gV/. Lack of supporting bone structure on the left side of the face affects the motor movements of the tongue and there seem to be a gen- Subject The subject is a Swedish, male infant from the age of 8 months to 20 months with his mother and father. The child was born with Hemifacial Microsomia, i.e. was born without left outer and middle ear, no zygomatic or mandible bone structure on the left side of the face. He has also a slightly cleft soft palate and a split uvula. The child was fed by sub-glottal probe until seven weeks of age and by nasal probe up to 8 months of age. The boy has one older sister. Recording sessions Recording sessions take place in a laboratory at the Department of Linguistics, Stockholm University. The mother receives a selection of toys, with verbal instructions indicating the significance of using onomatopoetic sounds when appropriate. Procedure A digital video camera, Panasonic NV-DS11, focusing on the boy and his parent was used. Both parent and child were recorded by a Fostex CR200 Compact Disc Recorder, with wireless microphones, Sennheiser Microport Transmitters, attached to their clothes, connected to a Sennheiser Microport Receiver EMI1005. Audio-visual perception is studied by Tobii (www.tobii.com), an eye-tracking sys56 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University eral weakness in the mobility on the affected side of the face. The left side of the tongue tends to “slip down” into the cavity of the missing mandible bone structure. As a consequence of a soft palate cleft and split uvula, he has some problem controlling consonant sounds. In situations of imitative interaction with his mother, alveolar and bilabial speech sounds are produced more at random than by will. Speech production is somewhat impaired by a short ligament of the tongue, that will be surgically corrected. This will hopefully help the child in his articulation of single words. At present, his tongue is comparatively immobile and he has difficulties forming any kind of consonantal speech sound, especially alveolar as he has trouble raising the tip of his tongue to the roof of his mouth. The only proper word he pronounces understandably at the age of 20 months is “mamma”, and articulation of the word “pappa” is not yet feasible. As the boy grows his need of making himself understood increases. He makes use of prosodic cues in communicating with his family; by using intonation he tries to convey his intentions in his “utterances”, like when he is calling for his sister or protesting about something. As his verbal mean of communication is impaired and delayed, he often gets impatient and frustrated when failing to make himself understood. The parents has recently introduced sign supported Swedish which diminishes some of the boy’s frustration. With sign supported communication, speech is always used parallel with the signs, which facilitates comprehension of the message and promotes speech development. The boy can now make himself more readily understood, and is able to convey some of his basic needs to his parents. References Anvil: www.dfki.de/~kipp/anvil Bahrick, L. E. (2004). The development of perception in a multimodal environment. In G.Bremmer & A. Slater (Eds.), Theories of Infant Development (1:rst ed., pp. 90-120). Oxford: Blackwell. Brent, M.R. & Siskind, J.M. (2001) The role of exposure to isolated words in early vocabulary development. Cognition, 81, B33-B44. Dunn, L.M. & Dunn, L.M. (1981) Peabody Picture Vocabulary Test – Revised. American Guidance Service. Circle Pines, Minnesota. Eriksson M. & Berglund, E. (1999) Swedish early communicative development inventories: words and gestures. First Language, 19, 55-90. Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B. & Fukui, I (1989) A Cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverbal infants. Journal of Child Language, 16, 477-501. Fenson, L., Dale, P.S., Reznick, J.S., Thal, D., Bates, E., Hartung, J., Pethick, S. & Reilly, J. (1993). The MacArthur Communicative Development Inventories: Users guide and technical manual. San Diego, Ca. Singular. Lacerda, F. (2003) Phonology: An emergent consequence of memory constrains and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 41-59. Lacerda, F., Gustavsson, L. & Svärd, N. (2003) Implicit linguistic structure in connected speech. PHONUM 9, Umeå, Sweden, 6972. Lacerda, F., Klintfors, E., Gustavsson, L., Lagerkvist, L. Marklund, E. & Sundberg, U. (2004a) Ecological Theory of Language Acquisition. In. Berthouse, L., Kozima, H., Prince, C., Sandini, G., Stojanov, G., Metta G & Balkenius, C. (Eds.) Proceedings of the Fourth International Workshop on Epigenetic Robotics (EPIROB 2004) Lund university Cognitive Studies, 117, 147-148. Lacerda, F., Marklund, E., Lagerkvist, L., Gustavsson, L., Klintfors, E. & Sundberg, U. (2004b) On the linguistic implications of context-bound adult-infant interactions. In. Berthouse, L., Kozima, H., Prince, C., Sandini, G., Stojanov, G., Metta G & Balkenius, C. (Eds.) Proceedings of the Fourth International Workshop on Epigenetic Robotics (EPIROB 2004) Lund university Cognitive Studies, 117, 149-150. Conclusion As a consequence of the child’s congenital physical handicap, the mother’s interactive behavior seemed to increase. The child’s verbal production is impaired but steadily improving. Passive verbal language seems to be present, and an active form of verbal language with well articulated words will probably come in time as the physical impediments are attended to. Further collection and analysis of data will give hopefully valuable information of the linguistic development from a pathological perspective of language acquisition. 57 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Tobii: www.tobii.com WaveSurfer: www.speech.kth.se/wavesurfer Öster, A-M. (2002) The relationship between residual hearing and speech intelligibility – Is there a measure that could predict a prelingually profoundly deaf child’s possibility to develop intelligible speech? TMHQPSR, KTH, Vol 43, 51-56. 58 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Durational patterns produced by Swedish and American 18- and 24-month-olds: implications for the acquisition of the quantity contrast Lina Bonsdroff and Olle Engstrand Department of Linguistics, Stockholm University durational patterns irrespective of age. This encouraged the hypothesis that the quantity contrast may be in place in many 24-monthers or even earlier. In particular, it is possible that it is during the interval between 18 and 24 months of age that most children develop an adult-like command of quantity-related durational patterns. The present study was carried out to test this assumption using a larger group of subjects at 18 and 24 months of age. For control purposes, the Swedish data were compared to measurements of similar utterances produced by a comparable group of children reared in an American English speaking environment. Abstract On the basis of previous, small-scale analyses, it was hypothesized that most Swedish children develop an adult-like command of quantity-related durational patterns between 18 and 24 months of age. In this study, VC structures produced in stressed position by several Swedish 18- and 24-month-olds were analyzed for durational correlates of the complementary V:C and VC: quantity patterns. Durations were typically reminiscent of the adult norm suggesting that, at 18 months of age, Swedish children have acquired a basic command of the durational correlates of the quantity contrast. In consequence, quantity development must start well ahead of that age. It was also found that voicing had a considerable, adult-like effect on the duration of postvocalic consonants at both ages. This effect was smaller in the American controls, again indicating the presence of a languagespecific phonetic pattern. The effect of voicing on preconsonantal vowel duration was relatively moderate. The effect was also present in the American 24-monthers, but less substantially than commonly observed in adults’ speech. No significant voicing-induced vowel lengthening effect was found in the American 18-monthers. Methods Subjects were drawn from a larger group of Swedish and American English children at the ages of 6, 12, 18, 24 and 30 months. Audio and video recordings were made as described in Engstrand et al. (2003). These recordings were subsequently digitized and stored on dvd disks. The present study is based on disyllabic words produced by • • • • Introduction 11 Swedish 18-monthers 11 Swedish 24-monthers 14 American 18-monthers 13 American 24-monthers. All measurable disyllabic words produced by these children were analyzed with respect to the durational aspect of quantity. Thus, measurements were made of the duration of vowels and consonants pertaining to the rhymes of word-inital stressed syllables. Cries, screams, whispers were left out as well as utterances contaminated by environmental noise. In some cases, video sequences or parents’ responses were used to interpret word meanings. Segmentation of vowels and consonants were made from spectrograms according to conventional acoustic criteria. In Central Standard Swedish, the quantity contrast has a complementary durational manifestation such that the rhymes of all stressed syllables are either V:(C) or VC:(C). Swedish quantity also entails spectral (i.e., vowel quality) differences. It can be expected that heavy exposure to these rather robust patterns will lead to relatively early ambient language effects in children’s speech production. In Engstrand and Bonsdroff (2004), durational data were reported on disyllabic words produced by three Swedish children aged 30 (two children) and 24 months, respectively. Those limited results suggested adult-like 59 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University times as many instances of short as of long vowels in this speech material. Graphical displays of the above ratios, also indicating average consonant and vowel durations, are shown in figure 1. The graph depicts consonant against vowel durations for the 24-month-olds who produced the respective quantity patterns. Each symbol represents the average pertaining to one single child. The filled circles refer to the V:C structures, and the unfilled circles refer to the VC: structures. It can be seen that the two data sets are well separated, and that vowel and consonant durations tend to be inversely correlated. Results In this section, durational tendencies are reported; a full statistical treatment will be presented elsewhere. Mean vowel-to-consonant durational ratios pertaining to the Swedish 18- and 24monthers are summarized in table 1 for the long vowels (the V:C pattern) and in table 2 for the short vowels (the VC: pattern). For example, table 1 shows that the V:/C average ratio was 1.86 for the 18-month-olds, and that this average was based on a total of 43 productions. Eight of the 11 18-year-olds produced measurable instances of the V:/C pattern. One child turned out to be a far outlier and was left out. The inclusion of 7 out of 11 child averages is marked as 7(11) in the right column of the table. The mean value for the 24-month-olds was similar, 1.96, but this mean was based on more (124) productions. This means that, on average, the long vowels had almost twice the duration of the following consonants. For both age groups, in other words, the durational relationships are not far from what can be observed in adult speech (e.g., Elert 1964). Table 1. Mean vowel-to-consonant ratios for disyllabic words with long vowels produced by Swedish 18- and 24-month-olds. Age (mos.) 18 24 Figure 1. Consonant by vowel durations for the Swedish 24-month-olds. Filled circles: the V:C pattern; unfilled circles: the VC: pattern. Swedish, long vowels Ratio Std. # wds 1.86 0.62 43 1.96 0.46 124 # children 7(11) 9(11) Table 2. Mean vowel-to-consonant ratios for disyllabic words with short vowels produced by Swedish 18- and 24-month-olds. Age (mos.) 18 24 Swedish, short vowels Ratio Std. # wds 0.85 0.36 147 0.89 0.19 310 # children 11(11) 10(11) The similarity between the age groups is equally striking for the VC: pattern. The V/C: ratios are 0.85 and 0.89 for the 18- and 24monthers, respectively. On average, then, vowel durations amounted to 85-90 percent of the consonant durations. This is, again, reminiscent of what can be observed in adult productions. Also note that there were almost 3 Figure 2. Consonant by vowel durations for the Swedish 18-month-olds. Filled circles: the V:C pattern; unfilled circles: the VC: pattern. The corresponding data for the 18-montholds are presented in figure 2. Again, filled and unfilled circles stand for V:C and VC: 60 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Thus, table 4 shows that the average V/C ratios are 1.70 and 1.23 for the 18-monthers’ voiced and voiceless consonant conditions, respectively; the corresponding ratios for the 24-monthers are 2.01 and 1.31, respectively. Thus, the voicing effect appears to be more prominent in the older children. The voicing-induced effects on V/C ratios do not contrast grossly between the two language groups. However, the underlying durational patterns could be assumed to differ. In Swedish, voicing is known to have a considerable effect on consonant duration, as noted above. Thus, voiceless postvocalic consonants exhibit substantially greater durations than do voiced consonants in the same position. Durational effects on preceding vowels are, however, moderate (Elert 1964). This contrasts with the English pattern where, as mentioned, vowel durations are affected more strongly. Thus, given an ambient language influence, these contrasting patterns would also be expected in the present Swedish and American 18- and 24-month-olds. This prediction is partly borne out by the durational data shown in tables 5 and 6. The data represent averages of all instances of voiced and voiceless stops pooled over the respective language and age groups. Stop consonants were chosen here to avoid irrelevant manner effects. The figures shown in table 5 display a tendency for consonant durations to increase as an effect of voicelessness in the Swedish children of both ages. This is particularly notable in the VC: patterns. For the 18-monthers, for example, the mean duration is 161 ms in the voiced condition and 282 ms in the voiceless condition. Also note that the consonant duration effect tends to be accompanied by an opposite but smaller vowel duration effect. In the American data, the voicing effect on stop consonant durations tends to be slighter. Thus, voiced and voiceless stop durations are 116 and 152 ms, respectively, for the 18-monthers and 115 and 143 ms, respectively, for the 24-monthers. The 24-monthers also display an exptected vowel duration increase prior to voiced consonants (199 and 161 ms for the voiced and voiceless conditions, respectively). However, this effect is relatively modest compared to what is frequently found in adult speech. For the 18month-olds, the average vowel duration runs counter to the expectation being even greater structures, respectively. Here too, the data sets are fairly well separated with marginal overlaps only. The effect of consonant voicing on the Swedish V/C ratios is shown in table 3. For the 18-month-olds, long vowels are, on average, about twice as long as the following voiced consonant (ratio=2.08). When the postvocalic consonant is voiceless, the vowel and consonant durations are about equal (ratio=1.18). For the short vowels, the corresponding ratios are 1.28 and 0.51, respectively. Thus, the effect of voicing points in the same direction in the long and short vowel conditions. The 24-month-olds evidence similar effects of voicing conditions and, again, these patterns are reminiscent of what can be found in adults’ speech. Table 3. Mean vowel-to-consonant ratios for disyllabic words containing long or short vowels followed by voiced or voiceless consonants. Subjects: Swedish 18- and 24-month-olds. Age Vowel Cons. (mos.) length type Long Vd 18 Vl Short Vd Vl Long Vd 24 Vl Short Vd Vl Ratio 2.08 1.18 1.28 0.51 2.29 1.53 1.11 0.70 Std. N 0.67 30 0.54 13 0.65 75 0.18 72 0.62 65 0.59 50 0.16 103 0.31 105 Table 4. Mean vowel-to-consonant ratios for disyllabic words containing voiced or voiceless postvocalic consonants. Subjects: American 18and 24-month-olds. Age Cons. (mos.) type 18 Voiced Voiceless 24 Voiced Voiceless Ratio 1.70 1.23 2.01 1.31 Std. 0.68 0.37 0.91 0.79 N 83 51 128 77 English differs from Swedish in that vowel quality differences are more prominent than durational differences. In English, however, consonant voicing is known to have a marked effect on preceding vowel durations (e.g., House and Fairbanks 1953). The data presented in table 4 is compatible with the hypothesis that this durational relationship is acquired by American 18- and 24-monthers. 61 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University prior to voiceless as compared to voiced consonants. This result, however, is clearly statistically non-significant. fore voiced consonants. This, however, was most likely due to chance. It was hypothesized above that most Swedish children develop an adult-like command of quantity-related durational patterns between 18 and 24 months of age. However, the present results suggest that these patterns are essentially in place in most 18-monthers, and that they change only slowly up to the age of 24 months. To the extent that this conclusion is confirmed in the final analysis, the first steps on the child’s path to the quantity contrast must be taken well before 18 months of age. Table 5. Mean duration values (ms) for long and short vowels and the following voiced or voiceless stop consonants as produced by Swedish 18- and 24-month-olds. Voiced Age 18 24 Mn Std N Mn Std N V: 266 83 C 126 68 Voiceless V: 253 77 14 289 83 Voiced V 202 40 9 117 29 36 C 171 61 245 73 Voiceless V 119 47 6 192 73 32 C: 161 74 200 54 C: 282 112 51 175 70 16 145 55 266 140 Acknowledgment 65 This work was supported by grant 421-20031757 from the Swedish Research Council (VR) to O. Engstrand. Table 6. Mean duration values (ms) for vowels and the following voiced or voiceless stop consonants as produced by American 18- and 24month-olds. Age (mos.) 18 24 References Voiced Voiceless V C V C Mean 153 116 166 152 Std 35 37 83 51 N 22 17 Mean 199 115 161 143 Std 61 34 46 49 N 25 28 Elert C.-C. (1964) Phonologic Studies of Quantity in Swedish. Uppsala: Almqvist & Wiksell. Engstrand O., Williams K. and Lacerda F. (2003) Does babbling sound native? Listener responses to vocalizations produced by Swedish and American 12- and 18month-olds. Phonetica 60, 17-44. Engstrand O. and Bonsdroff L. (2004) Quantity and duration in early speech: preliminary observations on three Swedish children. Papers from the 17th Swedish Phonetics Conference, Department of Linguistics, Stockholm University, May 26-28, 64-67. House A.S. and Fairbanks G. (1953) The influence of consonantal environment upon the secondary acoustical characteristics of vowels. Journal of the Acoustical Society of America 25, 105-113. Reprinted in Lehiste E. (ed), Readings in Acoustic Phonetics, Cambridge: The MIT Press, pp. 128-136. Summary and conclusions In general, durations produced by the present Swedish 18- and 24-month-olds tended toward values found in adults’ speech. Thus, the data suggest that these children have acquired a basic command of the durational correlates of the Swedish quantity contrast, including the complementarity governing internal segment durations in V:C and VC: structures. It was also found that consonant voicing had an adult-like effect on the duration of postvocalic consonants at both ages. Considering the smaller effect in the American data, this, again, suggested a near-completed acquisition of a language-specific phonetic pattern. The effect of voicing on preconsonantal vowel duration was relatively moderate, as expected. The same effect was present in the American 24-monthers, but was not as dramatic as commonly observed in adult English speakers. However, the American 18monthers displayed the opposite tendency, vowels being longer before voiceless than be62 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University /r/ realizations by Swedish two-year-olds: preliminary observations Petra Eklund, Olle Engstrand, Kerstin Gustafsson, Ekaterina Ivachova and Åsa Karlsson1 Department of Linguistics, Stockholm University of the first attempts to produce /r/ sounds is scarce. It is expected, though, that, at 2 years of age, most children will display an incipient influence of the ambient dialect. This seems to be the case with, e.g., comparable north German children who have been shown to produce glottal ([h]-like) and velar ([x]-like) /r/ substitutions, i.e., ‘back’ articulations in rough agreement with the regional adult norm (Fox and Dodd 1999). In particular, 2-year-olds reared in a central Swedish linguistic environment would be expected to produce predominantly coronal /r/ approximations with variable manners of articulation. On the other hand, informal observations suggest that a [j]-like palatal approximant may be a frequent /r/ substitution in many young children. Importantly, however, a large amount of inter-subject variability can be expected (cf. Vihman 1993). In this paper, some of these expectations are evaluated on the basis of observations on /r/ approximations produced by Swedish 2-year-olds. The choice of /r/ approximation may also be determined by phonetic context as has been shown to be the cases in adults’ speech (cf. Muminovic and Engstrand 1991). However, this aspect of children’s /r/ realizations will not be discussed in the present report. Abstract We report auditory observations on /r/ approximations produced by 11 Swedish 2-yearolds. About half of the 1291 expected instances of /r/ were either realized as vocoids or just dropped. Most of the contoid realizations were approximants or fricatives whereas taps, flaps, trills, laterals, nasals and stops occurred marginally. This was roughly consistent with the phonetic norms for the ambient language. The most frequently occurring places of articulation were coronals, palatals and, to some extent, glottals. Some of this place variation could be explained in terms of number of attempted word types suggesting that both vocabulary size and ambient-languge-like /r/ productions constitute different aspects of linguistic maturity in young children. Introduction Rhotics (r sounds) are well known for their unusually wide range of variation in terms of manner and place of articulation (e.g., Lindau 1985, Ladefoged and Maddieson 1996, Demolin 2001). More or less common variants are voiced or voiceless vocoids, approximants, fricatives, trills, taps and flaps produced at various places of articulation. Whereas the rhotics tend to occupy the ‘liquid slot’ adjacent to the syllable nucleus and, thus, have a common distribution in terms of the ‘sonority hierarchy’ (Jespersen 1904), they clearly lack an invariant acoustic basis (such as a lowered F3; see, e.g., Lindau 1985). To be sure, the rhotics may be phonetically related in terms of a Wittgensteinian ‘family resemblance’ such that /r/1 resembles /r/2 that resembles /r/3 an so on up to /r/n; however, /r/1 and /r/n may not have a single phonetic property in common (Lindau 1985). But even though the family resemblance metaphor may characterize relationships between category members in an interesting way, it does not serve to delimit the category in the first place. The apparent lack of unity among the rhotics is bound to cause trouble on children’s path to spoken language. However, our knowledge Methods The subjects used in this study were 11 normally developing Swedish 24-monthers, 6 girls and 5 boys. The children were drawn from a larger group of approximately 60 Swedish and 60 American English children at the ages of 6, 12, 18, 24 and 30 months. Subjects and recordings were described in detail in Engstrand et al. (2003). In summary, audio- and videorecordings were made in nursery-like, soundtreated rooms in Stockholm and Seattle. All children were accompanied by a parent (usually the mother). Both parents were representative of the regional standard spoken in the Stockholm and Seattle areas, respectively. The audio and video signals were subsequently digitized and stored on dvd disks. 63 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University All utterances expected to contain one or more /r/ were identified. The video films were used to aid in interpreting the children’s utterances. In total, 1291 instances of expected /r/ were found, classified auditorily and transcribed. To establish a common transcription standard, the first part of the analysis was carried out by the authors jointly. The remaining material was divided up into equal parts and transcribed individually. Results Out of the 1291 expected instances of /r/, 613 (47 percent) were realized as contoids. Whether the remaining /r/s had a vocoid realization or were just dropped was often hard to determine reliably. At a rough estimate, however, approximately 10 percent were vocoids and 43 percent were dropped. The distribution of manners and places of articulation for the contoid realizations is shown in table 1. Table 1. Distribution of manners and places of articulation across the subject group (percent of all contoid /r/ realizations, N=613). N=613 Approx Labio-dental 0.0 Fricative 0.3 Tap or flap 0.0 Lat. approx 0.0 Detal/alveolar 23.2 5.2 4.2 Retroflex Palatal Velar/uvular Glottal Total 5.7 25.8 0.2 0.0 54.8 15.3 0.8 1.3 9.3 32.3 0.0 0.0 0.0 0.0 4.2 The (non-lateral) approximants represented the most frequent manner type occurring in 54.8 percent of the /r/ instances (bottom row, leftmost column). Among the approximants, the palatals, i.e., [j]-like sounds, were the most common (25.8 percent of the cases). Most of the remaining approximants were coronal, i.e., dentals, alveolars or retroflexes. Fricative /r/ realizations were also common with 32.3 percent of all instances. Most of these (20.5 percent) were coronal, but there were also several occurrences of an [h]-like, glottal fricative. The fricatives represented the only manner of articulation that was produced at all places. Taps or flaps were infrequent (4.2 percent) and were always dental or alveolar. The incidence of lateral approximants, i.e., [l]-like sounds, was also low (3.1 percent) as was the case of stop realizations (3.8 percent, either coronal or glottal). Nasal realizations occurred marginally (1.3 percent) as did the trills (0.5 percent). For places of articulation, there were very few instances of labio-dentals (0.3 percent). Together, dentals, alveolars and the retroflexes formed a dominating group of coronals representing 60 percent of the entire material. Palatals also occurred rather frequently (26.6 per- Stop Nasal Trill Total 0.0 0.0 0.0 0.3 3.1 1.3 1.3 0.3 38.6 0.0 0.0 0.0 0.0 3.1 0.2 0.0 0.0 2.3 3.8 0.0 0.0 0.0 0.0 1.3 0.2 0.0 0.0 0.0 0.5 21.4 26.6 1.5 11.6 100 cent). With palatals and velars/uvulars taken together, this group of dorsals constituted 28.1 percent of the entire material. The glottals, finally, accounted for 11.6 percent of all observations. In sum, approximately 54 percent of these children’s contoid /r/ realizations were similar to /r/ allophones normally found in central Swedish adults. These are the coronal approximants, fricatives, taps/flaps or trills. Table 2 shows mean percentages, number of subjects and ranges of variation for the respective manner/place combinations (phonetically unlikely or impossible sound types are marked with asterisks). For each child, the incidence of each /r/ type was calculated as a percentage of all /r/ approximations produced by that child. The mean values shown in the table are group averages of these percentages. For example, the second row of the second column shows that 5 children produced alveolar or dental fricatives (whereas the remaining 6 children did not use this type of /r/ approximation). The proportions of fricatives in these 5 children ranged from 2.4 to 40.9 percent. The mean of the 5 positive values and the 6 zeros (i.e., non64 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University glottal stops (7). However, the extent to which these sound types were used differed widely between individual children (as seen from the ranges of variation). For example, alveolar or dental approximants were used by 8 children, as noted. However, for one of these children, the type accounted for a mere 2 percent of all /r/ approximations whereas, for another child, the corresponding figure was 44 percent. For the palatal approximants and the glottal fricatives, these ranges were even greater. production of this sound type) amounted to 5.4 percent. The table shows a great amount of variability across both children and sound types. It shows, for example, that 5 of the 42 manner/place combinations were used to some extent by more than half of the children. Those sound types were the alveolar or dental approximants (8 children), the palatal approximants (9), the glottal fricatives (10), the alveolar or dental lateral approximants (7), and the Table 2. Mean percentages, number of subjects and ranges of variation for the respective manner/place combinations. Phonetically unlikely or impossible sound types are marked with an asterisk. Approx Fricative Mean 0.0 0.4 # subj. 0 1 Range 0 0 12.6 5.4 Alveolar/ Mean # subj. 8 5 dental Range 2.3-4.2 2.4-40.9 Mean 2.1 5.5 4 2 Retroflex # subj. Range 1.7-17.6 28.3-32.1 Mean 34.8 1.0 # subj. 9 2 Palatal Range 0.5-71.1 1.7-9.8 Mean 0.2 1.5 Velar/ # subj. 1 2 uvular Range 0 7.3-9.1 Mean *0.0 16.3 # subj. 0 10 Glottal Range 0 0.8-83.3 *Unlikely or impossible sound types. Labiodental Tap or flap 0.0 0 0 2.0 3 2-4-16.7 0.0 0 0 0.0 0 0 0.0 0 0 *0.0 0 0 Lat. appr *0.0 0 0 3.7 7 1.6-7.9 0.0 0 0 0.0 0 0 0.0 0 0 *0.0 0 0 Stop 0.0 0 0 1.4 4 1.1-6.7 0.5 1 0 0.0 0 0 0.0 0 0 5.4 7 0.8-33.3 Nasal Trill 0.0 *0.0 0 0 0 0 6.8 0.1 3 2 3.3-66.7 0.5-0.8 0.0 0.1 0 1 0 0 0.0 *0.0 0 0 0 0 0.0 0.0 0 0 0 0 *0.0 *0.0 0 0 0 0 lars and retroflexes), glottals and dorsals (palatals, velars and uvulars), respectively. The straight lines, which are linear approximations to the data points for the respective types, suggest an increase in the number of coronal /r/ realizations as a function of the number of produced word types (r = 0.70). In contrast, the glottal realizations exhibit the opposite, negative trend (r = -0.63). For the dorsals, the effect is negligible (r = -0.18). In this sense, then, children who displayed a larger /r/ vocabulary seemed to conform phonetically to the ambient language to a higher degree than did children with a smaller /r/ vocabulary. It should also be noted that the percentages given in table 2 are based of very different numbers of total /r/ approximations. For example, one of the children produced a total of 6 /r/ approximations only. Two of these were classified as glottal stops and, thus, accounted for approximately 33 percent of the 6 cases. In contrast, another child produced 187 /r/ approximations. Sixty-seven of these, i.e., approximately 36 percent, were alveolar or dental approximants. Thus, similar percentages may differ widely in statistical stability. The variable amounts of /r/ approximations is illustrated in figure 1 Figure 2 shows percent occurrence of coronal, dorsal and glottal /r/ realizations by number of word types. Circles, squares and diamonds represent coronals (dentals, alveo65 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University size and ambient-languge-like /r/ productions provide two independent indices of linguistic maturity in young children. It is also possible that vocabulary growth forces an increasing attention to phonetic detail in production. It should be noted, however, that these conclusions are wholly based on auditory judgments which will, to the extent possible, require instrumental verification. Acklowledgments This work was supported by grant 20038460-14311-29 from the Swedish Research Council (VR) to O. Engstrand. Figure 1. Frequency distribution showing the variable amounts of /r/ approximations. Notes 1. Names in alphabetical order. References Demolin D. (2001) Some phonetic and phonological observations concerning /R/ in Belgian French. In H. Van de Velde and R. van Hout (eds.), ‘r-atics. Sociolinguistic, phonetic and phonological characteristics of /r/. Etudes & Travaux, Institut des langues vivantes et de phonétique, Université Libre de Bruxelles, 63-73. Engstrand O., Williams K. and Lacerda F. (2003) Does babbling sound native? Listener responses to vocalizations produced by Swedish and American 12- and 18month-olds. Phonetica 60, 17-44. Fox A.V. and Dodd B.J. (1999) Der Erwerb des phonologischen Systems in der deutschen Sprache. Sprache-StimmeGehör 23, 183-191. Jespersen, Otto. 1904. Phonetische Grundfragen. Leipzig and Berlin: Teubner. Ladefoged P. and Maddieson I. (1996) The Sounds of the World’s Languages. Oxford: Blackwell. Lindau M. (1985) The story of /r/. In Fromkin V.A. (ed) Phonetic Linguistics: Essays in Honor of Peter Ladefoged, 157-168. Orlando: Academic Press. Muminovic D. and Engstrand O. (2001) /r/ in some Swedish dialects: preliminary observations. Working Papers (Dept. Linguistics, Lund University) 49, 120123. Vihman M.M. (1993) Variable paths to early word production. Journal of Phonetics 21, 61-82. Figure 2. Percent occurrence of coronal, dorsal and glottal /r/ by number of word types. Circles, squares and diamonds represent coronal, glottal and dorsal realizations, respectively. Conclusions Auditory observations on 11 Swedish 2-yearolds have shown a high degree of variation in the phonetic realization of /r/. On the whole, however, approximants and fricatives constituted the dominating manners of articulation, whereas taps, flaps, trills, laterals, nasals and stops occurred marginally. This is roughly in accordance with the phonetic norms for the ambient language (cf. Muminovic and Engstrand 1991 for similar patterns in a number of Swedish dialects). The most frequent places were coronal, palatal and, to some extent, glottal. The glottal articulations were counter to expectations since they are foreign to central Swedish. So are [j]-like /r/ realizations, but these were nevertheless expected from informal observations. Some of the place variation could be explained in terms of ‘vocabulary size’ in the sense of number of attempted /r/ words. It may be that vocabulary 66 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Tonal word accents produced by Swedish 18- and 24month-olds Germund Kadin and Olle Engstrand Department of Linguistics, Stockholm University accent contrast and that, in consequence, Swedish word accent acquisition typically takes place during the 18-24 months age interval. In the present study, this assumption was tested using a group of 18-month-olds. To verify the language-specificity of observed accent patterns, comparable groups of American English 18- and 24-month-olds were used as controls. Abstract F0 measurements were made of disyllabic words produced by several Swedish and American English 18- and 24-month-olds. The Swedish 24- and 18-monthers produced accent contours that were similar in shape and timing to those found in adult speech. The Swedish 18monthers, however, produced very few words with the acute accent. It is concluded that most Swedish children have acquired a productive command of the word accent contrast by 24 years of age and that, at 18 months, most children display clear tonal ambient-language effects. The influence of the ambient language is evident in view of the F0 contours produced by the American English children whose timing of F0 events tended to be intermediate between the Swedish grave and acute contours. The relative consistency with which grave accent contours were produced by the Swedish 18-monthers suggest that some children are influenced by the ambient language well before that age. Methods Subjects were drawn from a larger group of approximately 60 Swedish and 60 American English children at the ages of 6, 12, 18, 24 and 30 months. Audio and video recordings were made as described in Engstrand et al. (2003). The present study is based on recordings of • 11 Swedish 24-month-olds (6 girls, 5 boys) • 13 American 24-month-olds (6 girls, 7 boys) • 11 Swedish 18-month-olds (6 girls, 5 boys) • 16 American 18-month-olds (9 girls, 7 boys), i.e., a total of 51 children, including the 24monthers used in Engstrand and Kadin (2004). All disyllabic words with stress on the first syllable were analyzed according to criteria described in Engstrand and Kadin (2004). F0 was measured at five points in time: at 1) the acoustic onset of V1, 2) the F0 turning-point in V1 (if the F0 contour was monotonic throughout the vowel, the turning-point was assigned the value of the onset), 3) the acoustic offset of V1, 4) the acoustic onset of V2, and 5) maximum F0 in V2 (if F0 declined throughout the vowel, maximum F0 was assigned the value of the onset). A Fall parameter was defined as the F0 difference between V1 turning-point and offset, and a Rise parameter was defined as the F0 difference between V2 maximum and V1 offset. All measurements were made using the Wavesurfer program package. Introduction Swedish has a contrast between a ‘grave’ and an ‘acute’ tonal word accent. The acute accent is associated with a simple, one-peaked F0 contour. The grave accent typically has a twopeaked F0 contour with a fall on the primary stress syllable and a subsequent rise towards a later syllable in the word (Bruce 1977, Engstrand 1995, 1997). A preliminary report on Swedish children’s acquisition of the word accents was presented in Engstrand and Kadin (2004). The results, which were based on 6 Swedish and 6 American English 24-month-olds, suggested that, at that age, Swedish children are well on the way to establishing a productive command of the accent contrast. The present study was carried out to test this preliminary conclusion using additional subjects. In addition, previous studies (Engstrand et al. 1991, Engstrand et al. 2003) have suggested that most 17-18-montholds display a much less consistent use of the Results Auditory judgments first suggested that a majority of the words produced by the Swedish children (both 18- and 24-month-olds) had a grave-like tonal contour and that, in general, 67 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University grave and acute accents were assigned to words in accordance with the adult norm. In contrast, none of the American English word productions sounded convincingly grave. this fall was, on the whole, less marked than for the older group (the grand means were 76 and 50 Hz for the 24- and 18-monthers, respectively). The Rise parameter, too, was positively evaluated for most 18-monthers but, again, less consistently than for the 24monthers (grand means 111 and 55 Hz, respectively). Two of the 18-monthers (SW18F3 and SW18M5) did not evidence any rise into the second vowel. Measurement results are summarized in table 1-5 (a full statistical treatment will be reported elsewhere). The tables present means and standard deviations for the Fall and Rise parameters. The bottom line of each table shows grand means and standard deviations. In the left column, SW and AM represent Swedish and American English, respectively, 18 and 24 indicate the respective ages in months, and F and M stand for sex (female or male). The last figure is a reference number that identifies the individual child. Thus, for example, SW24F1 stands for a Swedish 24 months old girl with the reference number 1. Table 2. Means for the Fall and Rise parameters in grave accent words produced by the Swedish 18monthers. Child SW18F1 SW18F2 SW18F3 SW18F4 SW18F5 SW18F6 SW18M1 SW18M2 SW18M3 SW18M4 SW18M5 Grand mean Table 1. Means for the Fall and Rise parameters in grave accent words produced by the Swedish 24monthers. Child SW24F1 SW24F2 SW24F3 SW24F4 SW24F5 SW24F6 SW24M1 SW24M2 SW24M3 SW24M4 SW24M5 Grand mean GRAVE ACCENT Fall (Hz) Rise (Hz) Mean SD Mean SD 69 62 102 121 60 29 69 40 62 45 70 66 113 88 93 129 72 41 99 124 77 38 127 125 102 141 392 286 59 42 70 84 101 122 106 136 70 59 42 78 60 35 50 44 76 64 111 112 N 21 22 19 26 23 12 2 23 23 21 23 GRAVE ACCENT Fall (Hz) Rise (Hz) Mean SD Mean SD 41 44 139 180 59 41 94 67 30 28 1,0 65 64 17 33 24 114 107 52 219 27 18 70 98 43 46 58 85 23 38 45 44 39 33 38 134 35 29 83 34 80 93 -3,0 27 50 45 55 89 N 2 14 4 4 7 2 17 4 6 13 2 75 The Swedish acute productions were relatively few – 37 in all produced by 6 of the 11 24month-olds. Thus, 5 of the Swedish 24monthers lacked acute productions altogether (table 3). Again, the Fall parameter had positive values which were, however, smaller than in the grave productions (and, as we shall see, with a different timing of the F0 peaks). The Rise parameter values differ markedly from those pertaining to the grave words in that they were consistently negative. This means that the acute words displayed a continuous F0 decline from an early peak in the first syllable. In other words, the acute words had a clearly ‘onepeaked’ F0 contour. 215 Swedish grave words produced by the 24month-olds consistently displayed positive values for both the Fall and the Rise parameters (table 1). This means that 1) F0 declined from a turning-point in the primary stress vowel reaching a relatively low value at the end of that vowel, and 2) rose to resume a relatively high position in the second vowel resulting in a ‘two-peaked’ F0 contour. Acute productions by the Swedish 18monthers were too few to provide a basis for reliable generalizations. However, parameter values tended to differ from those for the grave productions and to resemble those pertaining to the 24-monthers. The Swedish 18-month-olds show a similar, but somewhat less consistently two-peaked contour (table 2). The positive values of the Fall parameter indicate that F0 declined from a turning-point in the primary stress vowel, but 68 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Table 5. Means for the Fall and Rise parameters in words produced by the American 18-monthers. Table 3. Means for the Fall and Rise parameters in acute accent words produced by the Swedish 24monthers. Child SW24F1 SW24F2 SW24F3 SW24F4 SW24F5 SW24F6 SW24M1 SW24M2 SW24M3 SW24M4 SW24M5 Grand mean ACUTE ACCENT Fall (Hz) Rise (Hz) Mean SD Mean SD 29 11 46 78 96 49 11 2,0 42 9,5 -3,7 25 -79 106 -29 22 -52 1,7 16 2,7 35 31 -73 57 43 15 -37 27 AMERICAN ENGLISH Fall (Hz) Rise (Hz) Mean SD Mean SD N AM18F1 41 68 -47 67 5 AM18F2 19 27 84 171 6 AM18F3 0 -11 1 AM18F4 39 49 -33 190 10 AM18F5 30 52 1 AM18F6 27 24 -48 102 12 AM18F7 28 22 -27 18 4 AM18M1 11 159 1 AM18M2 36 36 24 73 8 AM18M3 36 25 25 83 10 AM18M4 44 78 4,8 79 5 Grand mean 28 41 17 98 63 Child N 0 6 7 3 9 0 0 7 0 5 0 37 The above tables have shown differences between accent types and ambient languages in terms of the Fall and Rise parameters. Thus, timing has so far been disregarded. However, timing of F0 events in relation to segmental structure is crucial as illustrated in figure 1. The figure shows mean data for the Swedish and American English children in both age groups (symbols are explained in the figure legend). Grave and acute productions are shown for the Swedish 24-monthers. F0 values are timealigned to the first measurement point, and the data points are connected by smoothed lines that bear a certain resemblance to authentic F0 contours. The measurement points correspond to acoustic events as described above. F0 values pertaining to the American children tended to resemble those for the Swedish acute productions (see tables 4 and 5; five of the 16 American 18-monthers did not produce any usable utterances and have been left out). The Fall parameter values were moderately positive and the Rise parameters tended to be on the negative side. Thus, the American data suggest a moderate F0 decline in the first vowel which, on average, continued into the second vowel. Table 4. Means for the Fall and Rise parameters in words produced by the American 24-monthers. AMERICAN ENGLISH Child Fall (Hz) Rise (Hz) Mean SD Mean SD N AM24F1 69 82 -53 175 12 AM24F2 58 41 -4,5 45 16 AM24F3 33 23 -2,8 57 13 AM24F4 55 71 -80 81 15 AM24F5 32 31 -28 57 13 AM24F6 51 24 -2,9 76 7 AM24M1 20 27 -6,4 58 16 AM24M2 50 75 5,6 27 16 AM24M3 29 36 -40 82 16 AM24M4 93 149 -0,71 90 7 AM24M5 26 21 31 45 10 AM24M6 27 26 2,6 47 9 AM24M7 34 17 1 Grand mean 45 51 -10 70 151 The main timing effect is found in the first F0 maximum which, for the Swedish children, appears early in the stressed vowel of the grave words and near the following vowel-consonant boundary in the acute words. The productions pertaining to American children are intermediate with an F0 maximum approximately at or after the middle of the stressed syllabe. Also note the differences in the second vowel (mesurement points 4 and 5). The contour describes a steady slope in the Swedish acute productions, whereas the grave contour reaches as new peak in the second vowel. Thus, the acute contour differs from the grave contour both in having a late turning-point in the first vowel and in lacking a secondary peak in the following vowel. Even though the American English contours reflect occasional rises toward the 69 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University second vowel, these rises are smaller and less systematic than those of the Swedish grave contours. months, many Swedish children begin to produce grave-like F0 contours and to mark the appropriate words with these contours. Engstrand et al. (2003) reached a similar conclusion on the basis of listening tests. Based on those studies as well as preliminary analyses of the present material, Engstrand and Kadin (2004) hypothesized that acquisition of the Swedish tonal word accents typically takes place in the 18-24 months age interval. However, the relative consistency with which grave accent contours were produced by the present 18-monthers would suggest that some children are influenced by the ambient language well before this age. This is in agreement with results of listening tests suggesting occasional grave-like tone contours as early as at 12 months of age (Engstrand et al. 2003). Time (ms) Figure 1. Average F0 contours derived from mean parameter values shown in tables 1 - 5. Symbols: filled diamonds=SW24 grave, unfilled diamonds=SW18 grave, gray diamonds=SW24 acute, filled squares=AM24, unfilled squares=AM18. Acknowledgment Summary and conclusions This work was supported by grant 2003-846014311-29 from the Swedish Research Council (VR) to O. Engstrand. Auditory judgments have shown that disyllabic words produced by Swedish 24- and 18-montholds mainly carry the grave tonal word accent. This is an expected influence of the ambient language, since a majority of disyllabic Swedish words are characterized by that accent. Whereas the Swedish 24-monthers also produced a significant number of acute words, the acute accent occurred very rarely in the younger group. F0 contours were usually shaped and timed according to the adult norm. This was the case in all 24-month-olds and in all but two of the 18-month-olds. This suggests that most Swedish children at 24 months of age have established a productive command of the word accent contrast, and that many 18-montholds are in a fair way to acquiring the grave accent. However, the virtual absence of acute words in the 18-month-olds makes it hard to determine whether a systematic accent contrast has been established at that age. References Bruce G. (1977) Swedish Word Accents in Sentence Perspective. Lund: Gleerup. Engstrand O. (1995) Phonetic interpretation of the word accent contrast in Swedish. Phonetica 52, 171-179. Engstrand O. (1997) Phonetic interpretation of the word accent contrast in Swedish: Evidence from spontaneous speech. Phonetica 54, 61-75. Engstrand O., Williams K. and Strömqvist S. (1991) Acquisition of the tonal word accent contrast, Actes du XIIème Congres International des Science Phonétiques, Aix-enProvence, vol. 1, pp. 324-327. Engstrand O., Williams K. and Lacerda F. (2003) Does babbling sound native? Listener responses to vocalizations produced by Swedish and American 12- and 18-montholds. Phonetica 60, 17-44. Engstrand O. and Kadin G. (2004) F0 contours produced by Swedish and American 24month-olds: implications for the acquisition of tonal word accents. Proceedings of the Swedish Phonetics Conference held in Stockholm 26-28 May 2004, pp. 68-71. The influence of the ambient language is even more evident in view of the F0 contours produced by the American English controls. The timing of those contours tended to be intermediate between the Swedish grave and acute contours. Out of the five Swedish 17-18-month-olds observed in a similar study (Engstrand et al. 1991), three showed a two-peaked, grave-like F0 contour in grave words (even though a rise was consistently present on the second syllable). It was tentatively concluded that, at 17 70 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Development of adult-like place and manner of articulation in initial sC clusters Fredrik Karlsson Department of Philosophy and Linguistics, Umeå University to acoustic manipulation of stimuli in the plosive- nasal and labial- dental ranges , Miller and Eimas (1976) presented a simi lar pattern of categorical perception for both investigated feature manipulations. Furthermore, the results from Miller and Eimas (1976) provided evidence for a right ear predominance for both place and manner processing. Miller and Eimas in terpreted their findings as indications of similar processing of the phonetic features place and manner. In addition , the per ceptual processing of the features place and manner of articulation have been shown to be stable, compared to other features, even with a reduced set of acous tic cues. Singh (1971) investigated the per ceived closeness of consonants in six con ditions of reduced acoustic information. The results provided evidence of nasality being the most resistant feature across conditions . Singh and Black (1966) pro posed that place of articulation was sec ond only to nasality in feature strength. Place of articulation is therefore hypothe sized to be mastered first in the progres sion towards an adult- like production of an initial consonant cluster. Regarding the structure of the output form, sC clusters may, in the early stages of acquisition, be reduced to a single con sonant. This change in output form of the child's production has been described as an application of a phonological process of cluster reduction, where the cluster is reduced to one of it's elements, and clus ter simplifying processes, where the con sonant cluster is substituted by a different consonant combination or by a segment that is not an element of the target conso nant cluster (coalescence). McLeod et al. (2001) noted that a trading relation had been observed in the literature between the frequency of consonant cluster reduc tion and simplification: cluster reduction has been reported more frequently in ear ly intermediate forms and replaced by simplification in forms produced in the Abstract Previous investigations have proposed that nasality in consonants are more perceptu ally stable than place of articulation in constrained conditions. This paper investi gates the progression of initial consonant clusters from a reduced to an adult- like form in terms of manner and place of articulation in the speech of children between the age of 1;6 and 3;5. The results show an earlier onset of stable production of manner compared to place for in both full clusters and in the reduced form. The results are interpreted as evidence of the importance of perceptual salience of segmental properties in the acquisition initial consonant clusters. Introduction Initial sC clusters occur frequently in spontaneous Swedish (Bannert & Czigler 1999) and are the refore a predominant feature of the ambient language of children learning Swedish. Previous reports concerning children's productions of word- initial word- initial sC clusters have shown that early output forms of the sec ond consonant of the cluster may have a deviating phonetic quality com pared to the adult model form (see McLeod, van Doorn and Reed (2001) for a summary of discovered trends in consonant cluster acquisition). In clusters with a plosive as the second element, the reduced form may involve an change in place of articulation caused by an application of a hypothesized fronting rule. Non- plosive consonants may, in ad dition, be substituted by a consonant with a different manner of articulation com pared to the target consonant, e.g. through application of a stoping process . For adult speakers, the articulatory features of place and manner of articulation have been shown in the literature to be correlated regarding their perceptual sta bility. In a study the perceptual response 71 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University stage before production of a complex syllable onset. The aim of this study was to investigate the acquisition of an adult- like production in terms of place and manner in complex syllable onset in word- initial position. From the milestone of complex onset pro duction, the change in the articulatory features of place and manner was investi gated throughout the progression of words with an initial [sn], [st] or [sk] clus ter. Acquisition of a stable production of manner before place of articulation would be interpreted as further support for the perceptual hierarchy found in the previous research. In addition, it would be interpreted as support for extending the effect of perceptual hierarchies found for features singleton segments to the acqui sition of consonant clusters. material. Based on the tabulated progressions of each target onset, the productions investi gated in this study were extracted accord ing to two criteria: 1) that the initial out put forms produced by the child was not produced in an adult- like manner in terms of the feature set of the consonant, and 2) that a progression should be ob served in the data in terms of place or manner of articulation. As a results of these criteria, productions made by seven subjects, three female and four male, were selected for further analysis. The age ranges of the investigated subjects at the time of recording are presented in table 1. Age of acquisition of stable and adultlike production was determined separate ly for the articulatory features place and manners as well as for the complexity of the syllable onset. Furthermore, the age of adult- like production of the second element of the target cluster (stop or nasal consonant) was established for place and manner of articulation. For all investigat ed features, onset of a stable production was determined to be the session at which four out of the following five productions was produced with the same value in the investigated featur es. Method Speech material The data set investigated in the present paper was extracted from a corpus con sisting of 5311 productions collected in order to investigated the development in output forms of word- initial consonant clusters in Swedish children between the age of 1;6 and 3;6. In this corpus, record ings were conducted on a monthly basis in a sound- treated recording studio. The target words were elicited by the accom panying adult using black- and- white pic ture prompts. Table 1 The age of each subject (in weeks) at the first and last recording session. Procedure A narrow transcription of the produc tions were subsequently produced by the author. The transcribed segment labels were then substituted by a feature vector describing the segment in terms of articu latory features, including place and man ner of articulation. Consonant segments in the onset of the first syllable of the pro duction were marked by the position in the onset and the progression of the target words tå ([t h o:]), snår ([sno:r]) and skal ([skA:l]) were tabulated according to target word and subject's age. 159 productions of skal , 132 productions of snor and 198 productions of stå were included in the Speaker First session Last session F1 105 151 F2 109 158 F3 77 128 M1 79 130 M2 124 178 M3 90 142 M4 84 129 Results The ages when a steady production of place and manner in the C consonant as well as in the full sC cluster are presented in figure 1. For subjects F1, F2 and M3, place and manner of articulation was sta ble in singleton consonants from the on set of sampling. The bottom circles of F1 72 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University and F2 and the bottom row of squares for the subject M3 therefore indicate that cor rect production had been acquired before or at this point in development. There are no data available on the delay of this point compared to the real onset of adult- like productions; these data points are there fore not be included in calculations below. Table 2 presents the differences in weeks between the time of stable adultlike production of manner or place of ar ticulation in simple and complex syllable onsets. The mean delay in manner was 24 weeks (27 weeks for [sk], 20 weeks for [st] and 23 weeks for [sn]). For place of articu lation, the mean delay in onset of stable adult- like production was 29 weeks (8 weeks for [sk], 34 weeks for [st] and 30 weeks for [sn]. Table 3 Difference in weeks between the the onset of stable productions of complex onsets and stable production of adult- like manner or place. Empty cells indicate that stable produc tion was not acquired in the investigated time window Speake r F1 F2 F3 M1 M2 M3 M4 SkalCC man 4 0 0 13 0 10 Sto CC man 6 10 13 0 Sno CC man 0 0 5 10 SkalCC place 0 0 StoCCplace 0 35 29 0 0 SnoCC place 0 38 0 6 0 4 15 Table 2 Difference in weeks between the time of stable adult- like production of adult- like manner or place in simple and complex syllable onsets. An asterisk indicates an excluded value due to an adult- like production in the singleton at the onset of recording. Empty cells indicate that stable production was not acquired in the investigated time window. Speaker F1 F2 F3 M1 M2 M3 M4 Skal ma nner * * 8 41 24 * Sto: manner * * 15 23 21 Snormanner * * 19 30 21 Skal place * * Sto: place * * 51 28 23 Snors place * * 24 30 36 34 8 The differences in weeks between the the onset of stable productions of complex onsets and stable production of adult- like manner or place are presented in table 3. The mean delay for manner was 5 weeks (4 weeks for [sk], 7 weeks for [st] and 4 weeks for [sn]). The mean delay for place was 10 weeks (0 weeks for [sk], 13 weeks for [st] and 14 weeks for [sn]). A Pearson correlation matrix of the data presented in figure 1 showed a strong correlation between the age of stable pro duction of manner (r=0.98) across target words. Figure 1The age of stable production of place and manner of articulation are presented above for each investigated speaker. Each cell shows the progression of output forms of the target words skal , snår and stå. For subjects F1, F2 and M3, place and manner of articula tion was stable in singleton consonants in the initial recording session. 73 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University A high correlation was obtained for place of articulation between the clusters [sn] and [st] (r=0.94). The production of complex syllable onset in target words were also highly correlated (r>0.95 for all correlations). showed that manner does meet a criterion of 75% in that session. Therefore, the ap parent reversal of the age acquisition of place and manner found in M1 may be an artifact of the chosen criterion. Therefore, it is concluded that the per ceptually established ordering of manner before place of articulation in terms of feature strength is in agreement with the age of stable acquisition of these features in complex syllable onsets. The develop ment of complex syllable onsets is there fore viewed as strongly influenced by con straints in perception. Discussion and conclusion The previous research done in on the strength and stability of the feature place and manner of articulation have estab lished a similarity in the perceptual pro cessing of these features (Miller and Eimas 1977). However, the work done by Singh and Blank (1966) and Singh (1971) sug gests that nasality, as an instance of man ner of articulation, is more perceptually stable than place of articulation in a noisy environment. The obtained age of acquisition of a stable production of sk, st and sn clusters presented in figure 1 is in close agreement with the perceptual hierarchy proposed by the Singh (1971). Manner was generally acquired before, or at the same time as, place of articulation in productions that was reduced to a singleton consonant. Syllable onsets with more than one member were generally produced in a sta ble way after the adult- like production of place had been achieved. Following the acquisition of complex syllable onset pro ductions, manner of articulation was sta bilized first. In the seven instances when place of articulation was acquired, the on set of a stable and adult- like production occurred after the acquisition of manner. The same pattern is observed in com plex syllable onsets. Mean delay in acqui sition from the onset of complex structure in the syllable onset (table 3) is greater for place than for manner. In fact, successful acquisition of a stable production of adult- like place is never achieved before manner for the full cluster. One exception to the above described pattern of progression exists in the data: the onset of a stable production of place in the reduced form of the target word skal produced by M1. In this stage or pro duction, stable production of place were acquired before manner. Inspection of the production made in manner age the age of acquisition of place plotted in figure 1 Acknowledgements The author would like to thank the children who participated in this study and the children's parents for bringing the children to the recording studio and for participating in the elicitation of the target words. References. Bannert, R. and Czigler, P. E. (1999) Variations in consonant clusters in Standard Swedish. PHONUM 7. McLeod, S., van Doorn, J. and Reed, V. A. (2001) Consonant Cluster Develop ment in Two- Yeas-Olds: General Trends and Individual Difference. Journal of Speech, Language and Hear ing Research 44. 1144- 1171. Miller, J. L. and Eimas, P. D. (1977) Studies on the perception of place and manner of articulation: A comparison of the labial- alveolar and nasal- stop distinc tions. Journal of the Acoustical Society of America 61(3), 835-845. Singh, S. (1971) Perceptual similarities and minimal phonemic differences. Journal of Speech and Hearing Research 14, 113-124. Singh, S. and Black, J. W. (1966) Study of twenty- six intervocalic consonants as spoken and recognized by four lan guage groups. Journal of the Acoustical Society of America 39, 372-387. 74 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Phonological Interferences in the Third Language Learning of Swedish and German (FIST) Robert Bannert Department of Philosophy and Linguistics, Umeå University Abstract Aim In general, the teaching of pronunciation has no high priority. This is also true of the beginner courses of German and Swedish at the academic level. A tailored pronunciation programme must be constructed upon empirically founded knowledge about the difficulties the learners encounter. A large material of both target languages compiled during three terms from approximately 30 students per language will be collected in the data base FIST and all deviations from the pronunciation norm will be marked systematically. Thus quantitative statements about the learning problems will be possible. Difficulty profiles are obtained for each individual, each group and each text type. The aim of the project is to study the pronunciation problems of the students. The following questions, among others, will be answered: (1) What role does the first language (L1) play for the learning of the target language’s pronunciation? Special concern will be given to each learner’s dialect or regional variety of the Standard languages. (2) What role does the pronunciation of the second or third language play? (3) Which phonological and phonetic targets are easy and which are difficult related to the structural similarities in both languages? (4) Which interplay between the various difficulties is to be discovered? Which implications can be observed? The answers to these questions will constitute the scientific basis on which new and well adopted learning material can be developed later by didactic and pedagogical experts for both language groups. Introduction In second language learning research of adults it is agreed that in the area of phonology and phonetics a clear negative transfer, interference, can be observed. This is due to the influence of the first language (L1). However, this is not the only reason for foreign accent; interlanguage, too, plays an important rôle. Only recently research has paid attention to the special case of learning in a multiple language setting. A few years ago, beginner courses in German were introduced at the academic level in Sweden. In the German speaking countries, due to a long tradition, beginner courses in Swedish attract many students. For both groups, the teaching of pronunciation is allotted only a small amount of time. As a consequence of this, the learners' target language is characterized by a strong accent which in most cases becomes fossilised. In order to prevent this, a pronunciation programme should be constructed that is tailored just for the special preconditions of the learners, namely their first language (L1) and the multiple contexts: all learners have already learned at least one foreign language. An extraordinary challenge lies in the fact that German and Swedish are linguistically very close to each other. Therefore it should be rather easy to help the learners to a good pronunciation right from the start. Research situation Research on second language learning has centred around the question whether or not the first language affects the target language. Experience tells us that transfer and interference do occur when pronunciation, i.e. the learning of phonology and phonetics, is concerned. While the literature is abundant with studies of syntax and morphology, the learning of pronunciation has not been studied to a greater extent. Hammarberg has studied certain aspects of Swedish as a second language (1985, 1988, 1997). He (1993) made a study of third language acquisition investigating his co-author. Since the middle of the seventies, Bannert has done research on several aspects of learning Swedish pronunciation (1979a, b; 1980, 1984, 1990) and on the German sound systems and prosody (1983, 1985, 1999). A large and extensive survey "Optimizing Swedish Pronunciation" (Bannert 2004) was carried out in the late seventies in Lund. Swedish being the target language, the pronunciation difficulties of 70 adult learners representing 25 75 Consonants Swedish: voiceless fricatives [Ç, Í] tjugo (twenty), sju (seven), retroflexes [Ê, ˇ, ß, =, ] fort (fast), bord (table), mars (March), barn (child), Karl (Charles). German: voiceless fricatives [S, ç] Schuh (shoe), mich (me), glottal stop [/] Theater (theatre). first languages were studied. The survey included also German represented by three learners from different regions: Northern Germany, Bavaria and Switzerland. Due to a grant from Vetenskapsrådet, it was possible to conduct several initiating pilot studies for the project. Recordings of students in Umeå and Freiburg were made and analysed. Students were interviewed about their introspection of their pronunciation difficulties and “Think-aloud” protocols were written. A demo version of the database (www.ling.umu.se/FIST) was programmed showing the labels to be used. Socio- and psycholinguistic background variables were collected. Thus the project is based on a secure and safe ground. Prosody Swedish: two word accents [acute, grave] 'buren - 'bùren (the cage - borne), focus accent manifested separately, complementary length of stressed vowel and consonant, stress pattern (speech rhythm) German: short consonants, word accentuation, stress pattern (speech rhythm). Theoretical approach Phonological processes Swedish: retroflexation of [r] + [Ê, ˇ, ß, =, ] across morpheme and word boundaries: mer tid (more time), har du (do you have), när som (whenever), har nu (have now). German: final devoicing of [b, d, g] to [p, d, k]: Sieb (strainer), Rad (wheel), Steig (path); initial [s] to [z] See (lake); [s] to [S] in consonant clusters initially: Stein (stone), springen (jump); voicing of medial [s] to [z]: lesen (read); deletion of unstressed [e] in endings -el, -en: Himmel (heaven), Zeiten (times); vocalisation of post vocalic [r] to [å]: Wasser (water); assimilation of place of articulation of [n] to [m], [N] after deletion of unstressed [e]: Lippen (lips), Banken (banks). From long experience we know that phonological transfer is typical of the language learning processes, especially of adult learners. This characteristic phenomenon of foreign accent is caused by the phonological system, including orthography, of L1. However, with our student groups, interferences from L2 and L3 must also be responsible for the deviating pronunciation. Furthermore, contributions of the learners’ interlanguage (Selinker 1972) are to be expected. Therefore each deviating feature in the performance of the students will be coded. Each deviating sample in the speech signal will be labelled according to these codes. Thus it is easy to cross search the whole material and do a thorough inspection and statistics of the observations. This will allow us to make quantified statements about the learning processes. Grapheme-Phoneme-Relationships Swedish: <o> signifies [u] and [o] skola (school), sova (sleep); <å> signifies always [o] måla (paint), åtta (eight); <g, k, sk> becomes palatalised to [J, Ç, Í] gift (poison), källa (well), skinka (ham); initial <h> is not pronounced in <dj-, gj-, hj-, lj-> djup (deep), gjuta (pour), hjul (wheel), ljuga (lie). German: <o> signifies always [o] Sohn (son), Sonne (sun); <z> signifies [t•s] Zahn (tooth); final <b, d, g> become [p, t, k] (final devoicing): Sieb (strainer), Rad (wheel), Steig (path). Contrastive aspects The phoneme systems of vowels and consonants, phonological processes are rather similar in both languages; however, prosody and the grapheme-phoneme relationships show some differences. Only the salient differences will be pointed out. Vowels Swedish: long [A:] gata (street) and [ú:] duk (cloth) , short [Ú] hundra (hundred). German: lax and short [I, Y, U] Mitte (middle), Hütte (hut), Mutter (mother), long [a:] Vater (father), diphthong [a•O] Haus (house). Pronunciation norms The impression of foreign accent, to the greatest extent, is caused by segmental and prosodic deviations from the pronunciation norm of the target language. This is spoken with parts of the 76 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University phonology of the first language. The notion of norm, however, stands for a very complex phenomenon. In the pronunciation dictionaries for both languages, the correct pronunciation is given for isolated words spoken very distinctly. Phrase and sentence level perspectives (assimilation and elisions) are not dealt with nor are different speaking styles, speech rhythm and intonation. Swedish: Standard (Stockholm) Swedish, described by Elert in Hedelin (1997). Problems: While Standard Swedish has an apical trill or fricative, Southern Swedish has an uvular trill or fricative and therefore no retroflex consonants. As three main Standard pronunciations are generally accepted in Sweden (Standard, Stockholm or central Swedish, Finland Swedish), both /r/-types are included in the norm. The /r/-type for each speaker is noted. German: Standard German, described in the DUDEN (Umgangslautung, 2001). The r-sound is an uvular trill or fricative [R, Â]. Problems: The unstressed /e/ in the endings <-el, -en> is always dropped in conversational speech and sometimes in formal style too. This gives rise to certain assimilations of the nasal [n]. Postvocalic [R] is always vocalised. Material, recordings and analyses All the material recorded, except the short story, was well known by the students, it was covered during their lessons. The material consisted of three texts, two descriptions of pictures and one short story. A DAT-recorder was used. The recordings were fed into a computer and the analysis of the material was done auditively, supported by the speech wave from the WaveSurfer program. The portion of analysed speech amounts to approximately 45 minutes for the Swedish learners and 35 minutes for the Germans. Preliminary results A representative choice of deviations for each group is shown in the following tables. Group results are presented according to their frequencies of appearance. Together with the code number and the frequency of appearance of each deviation, the target symbol and its replacement (deviation) is given. Swedish target replacement A a å Er P u Vå Vr stressed wrong syllable ys yt Vr 0 o ç ¨ y -d# t ¨ u u <o> ç Os Ot ∆ <g> g -v# f e E -b# p V: V V V: ´ 0 Ó S ç <o> u s z sk S -g∆ å a Coding system Each deviant pronunciation from the norms defined above, due to different causes, is labelled by a special mark, a code number, separately for each language. Code numbers are listed for different areas of interest: vowels, single consonants, consonant clusters, phonological processes, prosody, grapheme-to-phoneme relationships and use of first language forms. Although a number of deviations is identical in both languages, language users show a great variety of different labels. Most of the observed code numbers and their labels are presented below (results). The coding system allows different statistical treatments of the data, especially the quantifying of deviations. Thus it is possible to calculate each learner's profile of pronunciation difficulties, those for each kind of material: read aloud texts, descriptions of pictures and a narrative, for the male and female learners as sub groups and finally for all the learners together. This will be an important aspect and the basis for the construction of a self-going program for the learning of pronunciation. 77 code frequency S114 S308 S110 S309 S501 85 83 47 39 36 S104 S231 S112 S107 S302 S108 S111 S105 S409 S304 S102 S301 S116 S117 S305 S213 S113 S212 S416 S415 S404 33 27 27 25 23 18 15 14 11 11 11 10 10 8 7 6 6 5 4 4 3 Ê pÓ A a P s/Vpal Ç/<k>Vpal -t-t= ˛ b eÜ d -gß Ó <sk-> ∆ rt p ç A y S k t•s d rn S p e•i t 0 rs sk dZ S221 S201 S118 S115 S109 S418 S408 S236 S234 S223 S211 S204 S119 S235 S232 S224 S414 S216 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 g/Vpal -er# S eÜ Obstr +Voice S ç s aÜ -ieht ie ie -zts -er -ert d -er9 x E•å Obstr -Voice s x z AÜ Ç a•E i•e S z -r -et t T402 T317 T222 T111 T218 3 3 3 3 2 T215 T214 T212 T112 T418 T415 T414 T223 T220 T219 T216 T209 2 2 2 2 1 1 1 1 1 1 1 1 Acknowledgement German target replacement "zs å <-er> -er -Vå -Vr ç Ç -z-st•s s SC sC V: V yt ys -t# d -(e)l -el C C: V V: eÜ E / 0 Pt Ps E/-r œ -p# -b SCC sCC oÜ OÜ y u -k# g ç x ü u -ig/ç g u y stressed wrong syllable rs ß h h code frequency T206 T308 T314 T203 T207 T202 T304 T104 T101 T302 T306 T505 T105 T110 T504 T113 T102 T301 T305 T103 T411 T303 T208 T412 T409 T405 T501 170 147 114 113 58 43 35 35 32 27 24 15 15 13 12 12 12 11 9 9 8 8 8 6 5 5 4 T320 T413 4 3 The expert and skillful technical support by Thierry Deschamps is gratefully acknowledged. References Bannert Robert. 1979a. Ordstruktur och prosodi. I: Svenska i invandrarperspektiv, pp. 132-173. Hyltenstam Kenneth (ed.). Lund. Bannert Robert. 1980. Phonological strategies in the second language learning of Swedish prosody. PHONOLOGICA 1980, pp. 29-33. Dressler W.U., Pfeiffer O.E. and Rennison J.R. (ed.). Innsbruck. Bannert Robert. 1984. Prosody and intelligibility of Swedish spoken with a foreign accent. Acta Universitatis Umensis 59, pp. 8-18. Elert Claes-Christian (ed.). Bannert Robert. 2004. På väg mot svenskt uttal (including CD-rom). Lund: Studentlitteratur. Bannert Robert. 1999. (with Johannes Schwitalla). Äusserungssegmentierung in der deutschen und schwedischen gesprochenen Sprache. Deutsche Sprache. Zeitschrift für Theorie und Praxis 4, pp. 314-335. Duden. 2001. Aussprachewörterbuch. Mannheim: Dudenverlag. Hedelin Per. 1997. Norstedts svenska uttalslexikon. Stockholm: Norstedts. Selinker Larry. 1972. “Interlanguage”. International Review of Applied Linguistics 10, 20978 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Word accents over time: comparing present-day data with Meyer´s accent contours Linnéa Fransson and Eva Strangert Department of Philosophy and Linguistics, Umeå University, SE-901 87 Umeå variations between neighboring dialects and, complemented with recent data from the SWEDIA 2000 project (http://www.swedia.nu) to shed light on more recent phonetic developments of the word accents”. Abstract A comparative analysis of new dialect data on word accents in Dalarna and accent contours published by Meyer (1937, 1954) revealed differences indicating a change, primarily in the realization of the grave accent. The change, a delayed grave accent peak, is tentatively seen as a result of a spread towards north-west of word accent patterns formerly characterizing dialects of south-eastern Dalarna. Figure 1. Accent contours for acute (left) and grave (right) representing typical Dalarna dialects. Background To that end they performed measurements on digitized versions of the stylized Meyer curves and reported data on “the location in time of acute and grave tonal peaks relative to the VC boundary as indicated in Meyer’s contours.” (Engstrand and Nyström used arbitrary time units, as Meyer´s contours had no time scale.) Their analysis suggested a specific pattern in the grave accent; the timing of the tonal peaks tended to vary systematically from south-east (relatively late peaks) to north-west (relatively early peaks), see Table 1 and map in Figure 2. Much of the inspiration for research on Swedish word accents can be traced back to the work by Ernst A. Meyer, published in two volumes 1937 and 19541, respectively. Meyer collected his material from one or several speakers for each selected dialect, and in these volumes, original contours from each speaker can be found beside time-normalized and averaged stylized contours for each speaker and dialect. Meyer’s data underlie work on accentbased dialect typologies by Gårding and Lindblad (1973), Gårding (1977) and Bruce and Gårding (1978). The typologies differentiate a number of dialect areas by the number and timing of tonal peaks. The typical Dalarna dialects, the focus of the present study, are characterized as onepeak accents, both the acute and grave accent having one peak, but with a later timing of the grave accent peak. The typical pattern is illustrated for a two syllable-word in Figure 1, were the accent peak occurs before the VC (syllable) boundary in the acute and at, or very close to, the boundary in the grave accent, the VC boundary indicated by the vertical line in the curve. Engstrand and Nyström (2002) set out to study variations within the Dalarna dialects basing themselves on the stylized contours in Meyer (1954). Their specific purpose was to look for “continuous variation within the broad categories”, as this “would provide a possibility to observe small but systematic accentual Table 1. Timing of tonal peaks in Dalarna dialects. Negative values represent peaks before and positive values peaks after VC boundary. (From Engstrand and Nyström, 2002). 79 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University similar to those found in more southern dialects like Ål and Djura by Meyer, see Table 1. This, and previous dialect influences progressing from south-east to north-west in Dalarna (see discussion in Engstrand and Nyström, 2002) gave the inspiration to test the idea of an ongoing change in accent patterns. Thus, to shed light on the possibility of such a change and also to confirm the previous (pilot study) results, an extended study (see also Fransson, 2004) was undertaken including Leksand speakers and in addition speakers from Rättvik some 20 kilometers north of Leksand, see the map in Figure 2. In addition a more controlled material was chosen with target words differing only as to their word accent pattern. Figure 2. Map of the province of Dalarna. They further hypothesized that this pattern reflected a historical spread from south-east to north-west. Extending the material – Accent patterns in Leksand and Rättvik A pilot study Method New recordings were made of speakers between 20 and 50 years of age, all having lived in Leksand and Rättvik for most of their lives. Data were collected from both female and male speakers, in total 11 from Leksand and 13 from Rättvik. The speakers were recorded in their homes (or at work or school). The material consisted of two words produced in isolation, Polen /'po:len/ ‘Poland’ (acute), and pålen /'pò:len/ ‘the pole’ (grave). They were elicited in random order (together with other words, not reported on here) through cards with the respective words written on them. Each speaker produced at least five repetitions of each word, but some produced as many as eight and even more. Digitized versions of the material were analyzed and the location of the f0-peak was measured (in msec) relative to the VC boundary. A position of the peak before and after the boundary resulted in negative and positive values, respectively. In addition to absolute durations, percentages were calculated (the distance (in msec) of the peak from the VC boundary relative to the duration of the segment before or after the boundary) to neutralize speaking rate variation. Peak positions were sometimes difficult to identify; many contours had plateaus rather than peaks, and laryngealizations and other voice quality features added to the problems. Dubious cases were therefore eliminated and the reported data Data collected in the SWEDIA 2000 project were used to study possible age and gender differences in accent patterns in the Dalarna dialect of Leksand (see Fransson, 2004). The material were two two-syllable words, one acute, dollar /'dol:ar/ ‘dollar’, and one grave, kronor /'krò:nor/ ‘Swedish crowns’, produced in final position in a sentence context. The analysis revealed minor age and gender differences; the younger generation tended to have a greater contrast (a greater separation in time) between the two accent peaks. What was surprising, however, was that the majority of the speakers (15 out of 16) had a different timing of the grave accent peak as compared with Meyer’s (1937) contours for the Leksand dialect. While in the Meyer curves the acute and grave accent peaks – though separated in time – were both located before the VC boundary, the acute accent peak occurred before, and the grave accent after, the VC boundary in the SWEDIA material. That is, the grave accent peak appeared to be delayed compared to what had previously been reported. This result raised several questions. Were the results (based on a rather restricted material) reliable, and if so, was the change isolated to the Leksand dialect, or part of a more general shift of accent patterns in Dalarna? What we saw in the data of the more northern dialect of Leksand was a pattern 80 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University were reduced to include five speakers from Leksand, two of them female, and six speakers from Rättvik, one of them female. Leksand and Rättvik speakers behave very much in the same way; the acute tone peak occurs well before the VC boundary and the grave after. Thus, the data reported here indicate that a change in the realization of the grave accent has taken place since Meyer (1937, 1954) collected his material, a shift towards the pattern of more southern dialects e.g. Djura and Grangärde. In Figure 3 containing the contours presented by Meyer (1954), the arrows indicate the average grave-accent peak locations by our speakers. These locations were calculated as percentages relative to the entire durations of the second syllable and were 17% and 20% respectively for Leksand and Rättvik. Results Table 2 shows the individual results (absolute mean durations and standard deviations) for the two target words produced by the Leksand and Rättvik speakers. (The number of each word analyzed for each speaker varied between 5 and 14.) Apart from individual durational differences, the pattern is the same for all but one of the speakers; the acute word has its peak located before and the grave word after the VC boundary. (Although measurements were made of peak positions both in terms of absolute durations and percentages relative to the VC boundary, only absolute durations are reported here, as very similar patterns resulted from the two types of measurement.) Table 2. Timing of grave and acute accent peaks (means and standard deviations) by 5 Leksand (L1L5) and 6 Rättvik (R1-R6) speakers. Negative values represent peaks before and positive values peaks after VC boundary. Speakers L1 L2 L3 L4 L5 R1 R2 R3 R4 R5 R6 Figure 3. Accent contours (from Meyer, 1954). Arrows indicate the positions of the grave-accent peak in the present data. Such a change among the more northern dialects is supported by an analysis of SWEDIA material (dollar and kronor) from speakers of Malung, a location west of Leksand and Rättvik. Similarly as for Leksand and Rättvik, the grave-accent peak had “moved” to a position after the VC boundary, while in Meyer’s collected data it was located before it. Peak time in msec from VC boundary Acute Grave sd mean sd mean 30 29 -60 16 56 27 -52 19 64 19 -66 25 68 13 -54 13 76 36 -80 30 -4 43 65 79 94 120 31 37 23 21 11 19 -93 -56 -84 -66 -42 -21 General discussion – Are accent realizations changing? 31 31 34 17 22 6 Although there are no time scales in the contours supplied by Meyer (1937 and 1954), it is possible to conclude from our data that the grave tonal peaks today take another position within the word. The vertical line in Meyer’s contours gives the position of the VC boundary, and we can clearly see that the acute accent remains within the syllable before the boundary, while the grave has crossed it. Moreover, the change does not appear to be restricted to one single dialect, but occurs in all the three studied here. This speaks for a more general trend among the dialects of Dalarna, or at least those in the region between the northern and the southern part. We might conjecture a spread of the more southern type of realization Although the Rättvik speakers have a greater spread of means, the grand means for the grave accent word are very similar for the Leksand and Rättvik speakers, 59 msec and 67 msec, respectively. The corresponding grand means for the acute word are -63 msec and -60 msec. Discussion Thus, the new material confirms the results of the Leksand pilot study. In addition, the 81 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University of the grave accent to the north and north-west. Thus, in the light of the varying grave accent peak locations as demonstrated by Engstrand in Nyström (2002), the southern type of accent realization (represented by Djura, Ål and Grangärde) would have progressed further to the north and north-west. That a change has taken place seems obvious when comparing the present-day data with the stylized contours in the second volume of Meyer’s work (1954). However, the contours appearing in the 1937 volume are not exactly the same as the later ones. Though stylized, too, they are somewhat more detailed and show alternative peak locations for acute as well as grave accents. The grave-accent peaks tend to either co-occur with the VC boundary or are located to the right of the boundary. That is, some of the peaks have a similar position as in the present-day data. This occurs both for the Leksand and Rättvik dialect. The variation in the 1937 stylized contour drawings corresponds to variation in the original tone curves as registered and measured by Meyer. Figure 4 shows examples of this variation of grave accent curves by one speaker from Leksand. The stylized drawings in the 1954 volume thus hide some of the variation appearing in the raw data (and in the less simplified stylizations in the 1937 volume). innovations having spread from south-east to north-west in Dalarna, see discussion in Engstrand and Nyström (2002). Conclusions A comparative analysis of present-day dialect data on word accents in Dalarna and accent contours published by Meyer (1937, 1954) has revealed differences indicating a change in the realization of the grave accent. This change, a delayed grave-accent peak, is tentatively seen as a result of a spread towards north-west of accent patterns formerly characterizing dialects of the south-east of Dalarna. Clearly, however, this assumption has to be confirmed by extending the material for analysis. Acknowledgements We are grateful to Olle Engstrand and Gunnar Nyström for allowing us to include figure 2 in this study. This work has been supported by a grant from the Bank of Sweden Tercentenary Foundation, 1997-5066. Notes 1. This volume was published posthumously. References Bruce G. and Gårding E. (1978) A prosodic typology for Swedish dialects. In Gårding E., Bruce G. and Bannert R. (eds) Nordic Prosody, 219-228. Department of Linguistics, Lund University. Engstrand O. and Nyström G. (2002) Meyer’s accent contours revisited. TMH-QPSR 44, 17-20. Fransson L. (2004) Fyra daladialekters ordaccenter i tidsperspektiv: Leksand, Rättvik, Malung och Grangärde. Thesis work in phonetics, Umeå University. Gårding E. (1977) The Scandinavian word accents. Malmö: CWK Gleerup. Gårding E. and Lindblad P. (1973) Constancy and variation in Swedish word accent patterns, Working Papers, 7. Phonetics Laboratory, Lund University. Meyer E. A. (1937) Die Intonation im Schwedischen I. Stockholm: Fritzes förlag. Meyer E. A. (1954) Die Intonation im Schwedischen II: Uppsala: Almqvist & Wiksell. Figure 4. Grave accent contours (Leksand speaker 12, Per Jonsson; Meyer, 1937). Peaks located after the VC boundary thus occur also in Meyer’s data. However, while Meyer’s speakers varied in their location of the peak (just before the VC boundary or following it at a short distance), the speakers in the present study are very consistent in locating the peak after the boundary, about 60 msec on average. (The single exception, Speaker R1, has a mean grave accent peak 4 msec before the boundary.) A change in accent realizations therefore seems to have taken place among the dialects studied here. A reasonable assumption then is that accent patterns that in the past characterized more southern dialects have spread to the north and north-west. Such a spread would not be unreasonable in the light of other examples of 82 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Multi-sensory information as an improvement for communication systems efficiency Lacerda F., Klintfors E., Gustavsson L. Department of Linguistics, Stockholm University multi-sensory information available to the young infant (Lacerda et al., 2004a; Lacerda, 2003; Lacerda & Lindblom, 1997). In the early stages of language acquisition the infant and the adult tend to communicate about objects or occurrences in the infant’s immediate neighbourhood. Although the speech signal that the infant is exposed to may indeed refer to absent objects or abstract concepts, the gist of the infant-directed speech tends to be focused on very tangible objects that the adult assumes to be in the infant’s focus of attention. Under such ecologically relevant scenarios of adult-infant interaction, there is an almost inevitable correlation between what the infant hears and its visual, tactile or gustative sensations. In other words, because spoken language is used to refer to objects or events, the sound structure of the speech signal representing those referents must be highly correlated to at least some of the sensory representations of the objects or events it refers to. As a consequence, the very cooccurrence of certain sequences of speech sounds and sensory representations of the objects they are associated with conveys implicit information on the speech signal’s linguistic referential function. To be sure, the relationship between the auditory representation of the speech signal and the sensory representations of its referents is far from being deterministic. There is no guarantee that a given instance of speech signal will be referring to the object that happens to be in the young infant’s field of attention. On the other hand, because the assumption of poverty of stimulus implies that probability of recurrent matches between the auditory representation of the speech signal and the sensory representation of its referent is vanishingly small given language’s potentially unlimited combinatorial possibilities, even a barely resembling repetition of the co-occurrence of the speech signal with its referent is extremely significant. Indeed repetition of shorter or longer utterances is the hall mark of speech directed to very young infants (Lacerda et al., 2004b). Abstract The paper addresses the issue of extraction of implicit information conveyed by systematic audio-visual contingencies. A group of adult subjects was tested on a simple inference task provided by short film sequences. The video materials were encoded and submitted to processing by two neural networks (NN) that simulated the results of the adult subjects. The results indicated that the adult subjects were extremely efficient at picking up the underlying information structure and that the NN could also perform acceptably on both classification and generalization tasks. Introduction Language acquisition can be described as a process through which infants derive the underlying linguistic structure of their ambient language. In spite of the complexity and variability of the language input it is an undeniable fact that within about two years of life, typical infants are able to pick up the linguistic regularities of the ambient language. Making sense of linguistic information that is implicitly conveyed in a diversity of speech communication situations appears to be such an insurmountable task that researchers are prone to consider that some sort of initial guidance is necessary to home in on the ambient language’s underlying principles (Chomsky, 1968; Pinker, 1994). The present paper attempts to challenge this established view by sketching a scenario in which linguistic information may be derived in the absence of pre-knowledge or dedicated linguistic biases. Indeed language can be seen as an emergent consequence of the interplay between the infant and its environment, where the richness and structure of the sensory flow may contain enough information to trigger language development (Jusczyk, 1985; Elman, Bates, Karmiloff-Smith, Parisi, & Plunkett, 1997). More explicitly the language acquisition hypothesis to be tested in this paper relies on the assumption that linguistic structure is implicit in the 83 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University ously to the embedded meanings of the nonwords. After this first phase the video sequence that the subject just had seen was played in loop at the same time that the subject attempted to answer multiple choice questions concerning the nonsense names of the colors and the shapes of the objects. Whatever responses produced by the subjects were used as indications of the learned sound-object attribute pairing. The task was obviously very simple for the adult subjects and it turned out that the vast majority of the responses simply reflected the built-in contingencies embedded in the stimuli. The results of 21 subjects showed 97.6% correct generalization (see figure 2). The errors made by two subjects were incorrectly named shapes of the objects. The colour attribute generalization was 100% correct. Combining Auditory and Visual Information in a neural network (NN) model The NN model in this study combines visual and auditory information. The model is based on data used to test adults’ spontaneous propensity to extract latent information from a short video sequence. Tests on young infant’s ability to extract referential information from audio-visual contingencies are also being conducted and will be reported later. Nela Dule (Red Cube) Nela Bima (Red Circle) Lame Dule (Yellow Cube) Lame Bima (Yellow Circle) % correct generalizations (Adults) Figure 1. Illustration of the film sequence shown to the adults. The four pictures demonstrate 6 sec long film sequences each. The objects moved smoothly across the screen while sound track played two repetitions of the two-word sentences formed by the non-words that had been arbitrarily assigned to represent the colors and the shapes of the objects. 1.00 0.80 0.60 0.40 0.20 The videos shown to the adults (figure 1) were about 24 seconds long and consisted of four sequences: first a red cube was shown moving smoothly across the screen, then a red circle performed the same motion, the third sequence showed a yellow circle also following the standard path. Two-word nonsense sentences, created by concatenating nonsense words that were arbitrarily assigned to represent the color and the shape of the objects, were played along these figures. After exposure to the materials the adult subjects were presented with answer sheet were questions concerning the meaning of the nonsense words used in the videos were embedded in a number of foils containing spurious words and situations. 0.00 Nela (Red) Dule (Cube) Lame (Yellow) Bima (Circle) Figure 2. Reference data provided by the adult subjects. Percentage correct discoveries of the meaning of the non-words representing the colours and the shapes of the objects. Only in two cases were there errors made by the subjects. The auto association model To carry out NN simulation, two types of feedforward architectures were constructed. The first – an auto association model (figure 3) – was intended to simulate a priori knowledge, the task of the NN simply being to associate the colours and the shapes of the visual objects to the non-words corresponding to these two attributes, i.e. to reproduce its input at the output level. Exactly the same set of data formed the input and the output patterns of the NN. The 96-bit input vectors encoded the visually shown colour (bits 1 to 24), the visually shown shape (bits 25 to 48), the auditorily presented nonword standing for the colour (bits 49 to 72), and the auditorily presented non-word standing Reference data from the adult subjects A group of 21 adult subjects participated in the simulated “language learning” experiment described above. The subjects were asked to sit in front of a blank computer screen and without further instructions the video sequences were started. After this exposure the subjects were asked to describe what they just had seen and heard. Most of the subjects referred spontane84 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University for the shape (bits 73 to 96). The hidden node layer consisted of two hidden units. 2 2 96 Output 48 2 96 Output Hidden Visual Input 2 Hidden 48 Auditory Input Hidden Figure 4. Schematic architecture of the NN with two sensory input channels aimed at simulating learning. The task of the NN was to generalize its knowledge of colours/shapes of figures and show it via recognizing familiar colours despite of novel figures, as well as recognizing novel-coloured familiar figures. Input Figure 3. Schematic architecture of the auto association NN aimed at simulating a priori knowledge. The task of the NN was to reproduce its input at the output level. The number of units in each layer is indicated by the figures to the left of rectangles. The performance of the NN was tested with help of data not previously shown to the NN. The novel data consisted of a blue cube and a green circle (new colours for familiar shapes) as well as of a red cone and a yellow cylinder (new object shapes for familiar colours), as illustrated in figure 5. In the first run of the NN, the information from the input layer was separated in two different receptive fields – one of them corresponding to vision and the other to audition. In the second run of the NN all the information from the input layer was passed to each of the two hidden nodes. Training was done sequentially by presenting the model with a visual colour and shape parameter set and its associated nonword auditory set. The NN was able to find redundancies in the distributed data and accordingly all the input patterns were correctly categorized. The performance of the NN was stable regardless of the way of passing input information to the hidden units. Nela Duma (Red Cone) Lame Guma (Yellow Cylinder) Neme Dule (Blue Cube) Gale Bima (Green Circle) Figure 5. Illustration of the novel data used to test the NN’s ability to generalize its knowledge in new context. Before the experiment these non-words were arbitrarily assigned to represent the novel colours and the novel shapes of the objects. The generalization model The second architecture of the NN (figure 4) attempted to simulate the fact that the adult subjects in addition to being able to discriminate the colours and the shapes of the objects, also learned the concepts conveyed by the nonwords and were readily able to apply them to new contexts. This NN also had 96 bit-vectors as input. The hidden node layer consisted of four hidden-nodes, the first and the second of them receiving information from the visual half of the input nodes, and the third and the fourth of them receiving information from the auditive half of the input nodes. The NN was in this way structured in two input channels so to simulate two kinds of correlated sensory input. The output activations showed that the NN was able to – not only associate the non-words corresponding to visually presented colours and shapes of the objects – but also to generalize its knowledge to a new context. In a second run of the NN all the input information was passed to each of the four units at the hidden-layer. Just as in the above auto association simulation, the results of this run were not affected by the change of the way of passing information to the hidden units. Discussion The ultimate goal of this study is to investigate how human infants might be able to extract im85 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University plicit information from their experience with audio-visual stimuli. At this stage we simply ran a pilot study using adult subjects that were asked to watch short video sequences and subsequently requested to answer questions related to the implicit information potentially conveyed by the audio-visual stimuli. Although the adult subjects did not receive any instructions (in an attempt to make the situation more comparable to that of the infants’) the subjects had no difficulties in inferring the underlying structure right on the first inquire. The situation created by these stimuli was obviously too simple for the adult subjects, who could not avoid thinking of the goals and the structure of the stimuli as soon as they were put in the experimental situation. Our next question concerns the extent to which the infant subjects may also be able to detect and generalize the implicit audio-visual consistencies. Although we still do not have data from the infants and we expect the infants’ performance to be age-dependent, it is reassuring that NN’s performance mimic so well the adults’ results. This leads us to envisage the future infant speech perception experiments as a means to evaluate the potential significance of NN models in accounting for the grounds of linguistic development departing from generalpurpose association mechanisms. In general the outcome of a NN is dependent on its architecture but our results do not suggest critical dependence on any of the two architectures tested. To be sure, the stimuli used in this first experiment are likely to be too simple to fully demonstrate relevant language development relying on general-purpose associative mechanisms. Therefore our current experiments with infants are being conducted using audio-visual contingencies that attempt to replicate ecologically relevant communication settings. Acknowledgements This work was supported by grants from the Swedish Research Council, the Bank of Sweden Tercentenary Foundation and Birgit & Gad Rausing’s Foundation. References Chomsky N. (1968). Language and mind. New York: Harcourt Brace Jovanovich. Elman J., Bates E., Karmiloff-Smith A., Parisi D., & Plunkett K. (1997) Rethinking innateness. Cambridge, Massachusetts: MIT Press. Jusczyk P. (1985) On characterizing the development of speech perception. In Mehler J. & Fox R. (eds), Neonate cognition: Beyond the blooming, buzzing confusion Hillsdale, New Jersey: Lawrence Erlbaum, 199–299. Lacerda F. (2003) Phonology: An emergent consequence of memory constraints and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 41–59. Lacerda F., Klintfors E., Gustavsson L., Lagerkvist L., Marklund E., & Sundberg U. (2004a) Ecological Theory of Language Acquisition. In Genova: Epirob 2004, 147–148. Lacerda F. and Lindblom B. (1997) Modelling the early stages of language acquisition. In Olofsson Å. and Strömqvist S. (eds), Crosslinguistic studies of dyslexia and early language development. Office for official publications of the European Communities, 14– 33. Lacerda F., Marklund E., Lagerkvist L., Gustavsson L., Klintfors E., & Sundberg U. (2004b) On the linguistic implications of context-bound adult-infant interactions. In Genova: Epirob 204, 149–150. Pinker S. (1994) The Language Instinct: How the Mind Creates Language. (1 ed.) New York: William Morrow and Company, Inc. % correct generalizations (Network) 1.00 0.80 0.60 0.40 0.20 0.00 Nela (Red) Dule (Cube) Lame (Yellow) Bima (Circle) Figure 6. Results of the NN performance. Percentage correct generalizations of the meaning of the non-words representing the colours and the shapes of the objects. The results indicate that generalization of shapes was slightly more robust than generalization of colours. 86 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Effects of Stimulus Duration and Type on Perception of Female and Male Speaker Age Susanne Schötz Department of Linguistics and Phonetics, Lund University 2001; Brückl and Sendlmeier, 2003). Unfortunately, these studies are often difficult to compare due to differences in the stimuli as well as in the method. Differences concern (1) language, (2) stimulus duration, (3) type of speech (prolonged vowels, whispered vowels, single words, read, backward or spontaneous speech etc.), (4) sound quality (HiFi, telephonetransmitted etc.), (5) speaker age and gender, (6) listener age and gender, (7) recognition task (classify into 2, 3 or 7 age groups, direct magnitude etc.), and (8) result measure (correlation, absolute mean error, % correct etc.). Another question concerns whether listeners use different strategies when estimating female and male speaker age, since women and men age differently (Higgins and Saxman, 1991). In a study of automatic estimation of elderly speakers, Müller et al. (2003) successfully built gender-specific age classifiers. The author (2005) found differences between female and male speakers in both human and machine age recognition from single word stimuli. While F0 was a better cue for estimation of female age, the formants seemed to constitute better cues when judging male age. One possible explanation is that the characteristics of female voices appear to be perceived as more complex than those of male speech (Murry and Singh, 1980), suggesting that listeners would need either a partly different set or a larger number of phonetic cues when judging female age. Abstract Our abilitiy to estimate speaker age was investigated with respect to stimulus duration and type as well as speaker gender in four listening tests with the same 24 speakers, but with four different types of stimuli (ten and three seconds of spontaneous speech, one isolated word and six concatenated isolated words) Results showed that the listeners' judgements were about twice as accurate as chance, and that stimulus duration and type affected the judgements. Moreover, stimulus duration affected the listeners’ judgments of female speakers somewhat more, while stimulus type affected the judgments of male speakers more, indicating that listeners may use different strategies when judging female and male speaker age. Introduction Most of us are able to make fairly accurate estimates of an unknown speaker’s chronological age from hearing a speech sample (Shipp and Hollien, 1969; Linville, 2001). This paper adresses the question of how much and what kind of speech information we need to make as good estimates of speaker age as possible. Background and previous studies In age estimation, the accuracy depends, among other things, on the precision required and on the duration and type of the speech sample (prolonged vowel, read speech etc.). The less acoustic information present in a speech sample, the more difficult the task, but even with very little information, listeners are still not reduced to random guessing. Speaker and listener characteristics, including gender, age group, the speaker's physiological and psychological state, and the listener's experience or familiarity with similar speakers (dialect etc.) may also influence the accuracy (Ramig and Ringel, 1983; Linville, 2001). Consequently, some speakers may be more difficult to judge than others. A considerable amount of research has been devoted to the issue of age recognition from speech (Ptacek and Sander, 1966; Huntley et al., 1987; Braun and Cerrato, 1999; Linville, Purpose and questions The purpose of this study was to determine how stimulus duration and two different stimulus types (isolated words and spontaneous speech) influence estimation of female and male speaker age by answering the following questions: 1. In what way does stimulus duration and type affect the accuracy of listeners’ perception of speaker age? 2. Is there a difference between perception of female and male speaker age with respect to stimulus duration and type? 87 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University in years, for female, male and all speakers in the four tests. The listeners' judgements were about twice as accurate as a baseline estimator, which judged all speakers to be 47.5 years old (the mean CA of all speakers) in the first three tests. In Test 4, the shortest (1 word) stimuli yielded results at levels approximately half-way between the baseline and the other tests. Material and method Six speakers each from four different groups – older women (aged 63-82), older men (aged 6075), younger women (aged 24-32) and younger men (aged 21-30) from the southern part of Sweden (Götaland) – were selected randomly from the SweDia 2000 database (Bruce et al., 1999), which contains native speakers of Swedish. For each of the 24 speakers, four different speech samples were extracted, normalized for intensity, and used in the four perception tests: • Test 1: about 10 seconds of spontaneous speech • Test 2: about 3 seconds of spontaneous speech • Test 3: a concatenation of 6 isolated words: käke (jaw), saker (things), själen (the soul), sot (soot), typ (type) and tack (thanks), (dur ≈4 sec.) • Test 4: one isolated word: rasa (collapse), (dur.≈0.65 sec.) Four separate listening tests (one for each of the four sets of stimuli) were carried out. Two listener groups participated in one test each, while a third group took part in two of the tests. The gender and age distribution for the three groups is shown in Table 1, along with information on which test and set of stimuli each group was presented with. All subjects were students of phonetics at Lund University. The task was to make direct age estimations based on first impressions of the 24 stimuli, which were played only once in the same random order in all four tests using an Apple PowerBook G4 with Harman Kardon's SoundSticks loudspeakers. The listeners were also asked to name cues, which they believed had affected their judgements. Figure 1. Mean (abs) error for the 4 sets of stimuli for female, male and all speakers. The sum, mean and median values of the errors for all speakers in the four tests as well as for the baseline are shown in Table 2. In all tests, the listeners' judgements of women were more accurate than those of men. The highest accuracy was obtained for the female 10 second stimuli (6.5), while the male 6 word stimuli received the lowest accuracy (15.3). Moreover, the listeners tended to overestimate the younger speakers, and to underestimate the older speakers. Table 2. Sum, mean and median error values for all speakers in the four tests and for the baseline (BL). Table 1. Test number, stimuli set, number of listeners, and gender and age distribution of the listener groups in the four tests. Test (stimuli) 1 (10 sec.) 2 (3 sec.) 3 (6 words) 4 (1 word) N 31 33 37 37 F 18 22 33 33 Test sum mean median M Age range (mean/median) 11 19-65 (27/23) 11 19-57 (25/23) 4 19-55 (26/24) 4 19-55 (26/24) 1 (10s) 196.5 8.2 7.2 2 (3s) 3 (6w) 4 (1w) BL 256.1 277.6 348.7 497.0 10.7 11.6 14.5 20.7 10.0 10.0 16.7 19.5 Stimulus and speaker gender effects The listeners’ mean absolute errors were subjected to two separate analyses of variance. In the first analysis, speaker gender and speaker age (old or young) were within-subject factors, and stimulus duration (short (1 word), medium (6 words and 3 sec.), long (10 sec.)) was the between-subjects factor In the second analysis, the between-subjects factor was stimulus type (spontaneous or word stimuli) instead of stimulus duration. Results Accuracy Figure 1 displays the mean absolute error, i.e. the average of the absolute difference between perceived age (PA) and chronological age (CA) 88 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Stimulus duration Discussion Longer stimulus durations led to significantly higher accuracy (F(2,100)=71.059, p<.05). A difference between the female and male speaker judgements was also observed. Accuracy for longer stimuli improved more for the female than for the male speakers. For the female speakers, a lower error was observed for the 10 sec. stimuli (6.5) than for the 3 sec. stimuli (9.7), and the error for the 6 word stimuli (7.9) was lower than for the 1 word stimuli (13.9). The difference between longer and shorter stimulus durations was much smaller for the male speakers, with a error of 9.9 for the longest (10 sec.) stimuli, higher errors for the medium long 3 sec. and 6 word stimuli (11.6 and 15.3), and a similar error for the 1 word stimuli (15.1). This interaction of speaker gender and stimulus duration was, however, not significant (F(2,100)=2.171, NS). Despite the limited number of stimulus durations and types investigated, a few interesting results were found. These are discussed below, along with a few suggestions for future work. Accuracy The listeners performed significantly better than the baseline estimator (about twice as good) in three of the tests, which is in line with previous work. However, it remains unclear what accuracy levels can be expected from listeners' judgements of age. Differences in speakers' CA have to be taken into account as well. A mean absolute error of 10 years could be considered less accurate for a 30 year old speaker (a PA of 20 could be regarded as 20/30 = 66.7\% correct), compared to an 80 year old speaker (a PA of 70 could be regarded as 70/80 = 87.5\% correct). There is a need for a better measure of accuracy for age estimation tasks. The fact that three different listener groups participated in the tests may also have influenced the accuracy. In all of the four tests, the listeners' estimations of women were more accurate than those of men, perhaps because the listeners were mainly women. However, the influence of listener gender on performance in age estimation tasks is still unclear. Although most researchers have not reported any difference in performance between male and female listeners, some studies have found females to perform better than males, while others still have found male listeners to perform somewhat better (Braun and Cerrato, 1999). Another explanation could be that the male speaker group contained a larger number of atypical speakers, who consequently would be more difficult to judge, than the female speakers. Shipp and Hollien (1969) found that speakers who were difficult to age estimate had standard deviations of nine years and over. Perhaps such a measure can be used to decide whether speakers are typical representatives of their CAs or not. Stimulus type Stimulus type also influenced the age estimations significantly (F(1,68)=61.143, (p<.05). The listeners' judgments of the male speakers were more accurate for the spontaneous stimuli than for the word stimuli. Lower mean absolute errors were obtained for the two sets of spontaneous stimuli (9.9 and 11.6) compared to the two sets of word stimuli (15.3 and 15.1). This effect was not observed for the female speakers. Here, the mean absolute error for the 6 word stimuli (7.9) was lower than for the 3 second spontaneous stimuli (9.7), but higher than the longer spontaneous stimuli (6.5). The interaction of speaker gender and stimulus type was significant (F(1,68)=39.296, p<.05). Listener cues Most of the listeners named several cues, which they believed had influenced their age judgements. Dialect, pitch and voice quality affected the listeners' estimates in all four tests, while semantic content influenced the judgements in the tests with spontaneous stimuli. A common listener remark in the tests with spontaneous stimuli concerned speakers talking about the past. They were often judged as being old, regardless of other cues. Additional listener cues included speech rate, choice of words or phrases and experience or familiarity with similar speakers (age group, dialect etc.). Stimulus effects In this study, longer durations for the most part yielded higher accuracy for the listeners' age estimates. This raises the question of optimal durations for age estimation tasks. When does a further increase in duration for a specific speech or stimulus type no longer result in a 89 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University higher accuracy? Further studies with a larger and more systematic variation of stimulus duration for each stimulus type are needed to answer this question. Significant effects for both accuracy and speaker gender differences were found for the two stimulus types in this study. However, isolated words and spontaneous speech can be difficult to compare in a study of speaker age. Several listeners mentioned that the semantic content of the spontaneous stimuli influenced their age judgements, which may explain why the male speaker spontaneous stimuli yielded higher accuracy compared to the word stimuli. Besides providing more information about the speaker (dialect, choice of words etc.), spontaneous speech is also likely to contain more prosodic and spectral variation than isolated words. However, for the female speakers, the lower accuracy obtained for the 3 second spontaneous stimuli compared to the only slightly longer 6 word stimuli cannot be explained by stimulus type effects alone. It would be interesting to compare a larger number of speech types in search for the types best suited for both female and male speaker age estimation tasks. Future work should include studies where several different speech types are compared and varied more systematically with respect to phonetic content and quality as well as variations and dynamics. tion in future research, when investigating acoustic as well as perceptual correlates to speaker age. References Braun A and Cerrato L. (1999) Estimating speaker age across languages. Proceedings of ICPhS 99 (San Francisco), 1369–1372. Brückl M and Sendlmeier W. (2003) Aging female voices: An acoustic and perceptive analysis. Proceedings of VOQUAL’ 03 (Geneva), 163–168. Cerrato L, Falcone M and Paoloni A. (1998) Age estimation of telephonic voices. Proceedings of the RLA2C conference (Avignon), 20–24. Higgins M B and Saxman J H. (1991) A comparison of selected phonatory behaviours of healthy aged and young adults. Journal of Speech and Hearing Research 13, 10001010. Huntley R, Hollien H and Shipp T. (1987) Influences of listener characteristics on perceived age estimations. Journal of Voice 1, 49–52. Linville S E. (2001) Vocal Aging. San Diego: Singular Thomson Learning. Müller C, Wittig F and Baus J. (2003) Exploiting speech for recognizing elderly users to respond to their special needs. Proceedings of Eurospeech 2003 (Geneva), 1305–1308. Murry T and Singh S. (1980) Multidimensional analysis of male and female voices. JASA 68 (5), 1294–1300. Ptacek P H and Sander E K. (1966) Age recognition from voice. Journal of Speech and Hearing Research 9, 273–277. Ramig L A and Ringel R L. (1983) Effects of physiological aging on selected acoustic features. Journal of Speech and Hearing Research 26. 22–30. Schötz S. (2005) Prosodic cues in human and machine estimation of female and male speaker age. In G. Bruce & M. Horne (Eds.) Nordic Prosody: Proceedings of the IXth Conference, Lund, 2004. Frankfurt am Main: P. Lang, 215–223. Shipp T and Hollien H.(1969) Perception of the aging male voice. Journal of Speech and Hearing Research 12, 703–710. Bruce G, Elert C-C, Engstrand O and Eriksson A. (1999) Phonetics and phonology of the Swedish dialects - a project presentation and a database demonstrator. Proceedings of ICPhS 99 (San Francisco), 321–324. Speaker gender effects As already mentioned in the previous paragraph, there were differences between female and male speakers with respect to which stimulus type and durations yielded higher age estimation accuracy. One explanation for the differences between female and male speakers may be that listeners use different strategies when judging female and male speaker age. As suggested in Schötz (20005), it is possible that listeners use more prosodic cues (mainly F0) when judging female speaker age, but that spectral cues (i.e. formants, spectral balance etc.) are preferred when judging male speaker age. Consequently, the results from this study may indicate that for male speakers, spontaneous stimuli provide the listeners with more spectral information, while longer stimuli contain more prosodic information needed to estimate female speaker age more accurately. The differences in perception of female and male speaker age has to be studied further, and speaker gender has to be taken into considera90 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Effects of age of learning on VOT in voiceless stops produced by near-native L2 speakers1 Katrin Stölten Centre for Research on Bilingualism, Stockholm University of verbal L2 exposure during a limited time frame for phonetic sensitivity, or if nativelike perception and an accent-free pronunciation is biologically possible for any adult learner, given the right social, psychological, and educational circumstances. This study is part of an ongoing project on age of onset (AO) and ultimate attainment in L2 acquisition. The project focuses on early and late learners of Swedish with Spanish as their L1, who have been selected on the criterion that they are perceived by native listeners as mother-tongue speakers of Swedish in everyday oral communication. These nativelike candidates’ L2 proficiency has thereafter been tested for various linguistic levels and skills. The present study reports on analyses of Voice Onset Time (VOT) in the production of Swedish voiceless stops. Swedish and Spanish both distinguish voiced from voiceless stops in terms of VOT but the languages differ as to where on the VOT continuum the two stop categories separate. In languages like Swedish and English short-lag stops are treated as voiced, whereas long-lag stops are classified as voiceless. In contrast, Spanish treats short-lag stops as voiceless, while stops with voicing lead are categorized as voiced (e.g. Zampini & Green 2001). Due to the fact that L2 learners in general show difficulties in correctly perceiving and producing these language specific categories (see, e.g. Flege, 1991), the analysis of VOT production seems to be a good tool for investigating the nativelike subjects’ L2 proficiency. For the present study the following research questions were formulated: Abstract This study is concerned with effects of age of onset (AO) of acquisition on the production of Voice Onset Time (VOT) among near-native L2 speakers. 41 L1 Spanish early and late learners of L2 Swedish, who had carefully been screened for their “nativelike” L2-proficiency, participated in the study. 8 native speakers of Swedish served as control group. Spectral analyses of VOT were carried out on the subjects’ production of the Swedish voiceless stops /p t k/. The preliminary results show an overall age effect on VOT in the nativelike L2 speakers’ production of all three stops (answer to Research Question 1). Among the late learners only a small minority exhibits actual nativelike L2 behavior (answer to Research Question 2). Finally, far from all early L2 speakers do pass as native speakers of their L2 regarding the production of voiceless stops (answer to Research Question 3). Introduction Several studies on infant perception have shown that first language (L1) phonetic categories are already established during the first year of life (e.g. Eimas et al. 1971, Werker et al. 1984). Further evidence that very early exposure is of importance in L1 development comes from children who were deprived from verbal input due to inflammation of the middle ear during their first year of life. Ruben (1997) reports that these children showed significantly less capacity for phonetic discrimination compared to children with normal hearing during infancy when they were tested at the age of nine years. From these findings it has been concluded that there may exist a critical period for phonetic/phonological acquisition and that this critical period may already be over at the age of one year (Ruben 1997). One classical issue in the field of language acquisition concerns whether this theory of the existence of a critical period can be applied to second language (L2) acquisition. More precisely, the question is if L2 learners typically fail to acquire phonetic detail because of lack (1) Is there a general age effect on VOT production among L2 learners who are perceived by native listeners as native speakers of Swedish? (2) Are there late L2 learners who produce voiceless stops with an average VOT within the range of native-speaker VOT? (3) Do all (or most) early L2 learners produce voiceless stops with an average VOT within the range of native-speaker VOT? 91 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University der to exclude extremely short and long word durations from further analysis: one subject (AO=3) as well as par for a subject with AO=9, tal for a subject with AO=3, and kal for a subject with AO=15. Method Subjects A total of 41 native speakers of Spanish (age 21-52 years), who had been selected on the criterion that native listeners perceive them as mother-tongue speakers of Swedish, participated in this study. The nativelike subjects’ mean length of residence (LOR) in Sweden was 24 years (range 12-44 years) and their age of onset (AO) of L2 acquisition varied between 1 and 19 years. Furthermore, the subjects’ showed an educational level of no less than senior high school and they all had acquired the variety of Swedish spoken in the great Stockholm area. A control group was added consisting of 82 native speakers of Swedish who had been matched with the experimental group regarding present age, educational level and variety of Swedish. Results Since it is a well-known fact that VOT varies with place of articulation (see, e.g. Lisker & Abramson, 1964) results for the three voiceless stops are presented separately. Figures 1-3 show the subjects’ average VOT-values (in ms) plotted against their age of onset (AO). 160 140 VOT (ms) 120 100 80 60 40 20 0 0 Material and procedure Both the nativelike and the native speakers of Swedish were tested individually in a sound treated room. The subject was instructed to read aloud the following three Swedish words with /p t k/ in initial position: par ([pA:r], ‘pair, couple’), tal ([tA:l], ‘number, speech’) and kal ([kA:l], ‘naked, bald’). Each word was pronounced ten times, with two-second pauses in between. The experiment leader, a male native speaker of Stockholm Swedish, indicated the reading rate to the subject by visually keeping count with his fingers. All readings were recorded through a KOSS R/50B headset microphone at 22,050 Hz with a 16-bit resolution. Using the Soundswell signal analysis package measurements of VOT were conducted on the basis of broadband spectrograms and the oscillographic display. VOT was measured as the time interval between the release burst of the stop consonant and the onset of voicing in the following vowel. As has been reported in several studies (e.g. Kessinger & Blumstein 1997, Miller et al. 1986) speaking rate has an effect on both VOT and vowel duration. Findings like these thus suggest that VOT should be treated in relation to syllable or word duration rather than independently. Therefore, also measurements on word length were carried out. Furthermore, two standard deviations from the average word length were calculated in or- 2 4 6 8 10 12 14 16 18 20 Age of onset (AO) Figure1. VOT (in ms) for /p/ in relation to age of onset (AO). Unfilled diamonds: native speakers of Swedish, filled diamonds: near-native L2 speakers. 160 140 VOT (ms) 120 100 80 60 40 20 0 0 2 4 6 8 10 12 14 16 18 20 Age of onset (AO) VOT (ms) Figure 2. VOT (in ms) for /t/ in relation to age of onset (AO). Symbols as in figure 1. 160 140 120 100 80 60 40 20 0 0 2 4 6 8 10 12 14 16 18 20 Age of onset (AO) Figure 3. VOT (in ms) for /k/ in relation to age of onset (AO). Symbols as in figure 1. 92 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University As can be seen in figures 1-2, no significant age effects on VOT are found for either /p/ (r=0.20, df=38) or /t/ (r=-0.02, df=38). Only for /k/ (se figure 3) a weak but statistically significant correlation between AO and VOT can be observed (r=-0.35, df=38, p<0.05). However, this analysis did not control for possible effects of speaking rate on VOT. In order to achieve a neutralized measure of VOT, milliseconds were transformed into percentages of total word duration. Figures 4-6 again present VOT in relation to the subjects’ AO, but this time with VOT measurements expressed in terms of percentages of word length. Figures 4-6 reveal that correlations between AO and VOT now have become significant for all three stops: for /p/ r=-0.38 (df=38, p<0.02), for /t/ r=-0.36 (df=38, p<0.02), and for /k/ r=0.45 (df=38, p<0.01). Furthermore, the figures show that most of the nativelike candidates produce average VOT-values within the range of native speaker VOT. However, this observation is more obvious for /p/ (29 subjects) and /t/ (31 subjects) than for /k/ (22 subjects). In order to compare early and late learners in a more systematic way, the nativelike candidates were divided into a pre-puberty (AO 1-11 years) and a post-puberty group (AO 12-19 years). By analyzing these two groups separately it becomes clear that a majority (17 out of 30) of the pre-puberty group produce average VOT-values within the range of native speaker VOT for all three stops. Among the 10 late learners this is the case for only two of the subjects (AO 14 years for both). Within the group of early L2 learners ten subjects show average VOT-values within the range of native speaker VOT for either one or two of the stops. In contrast, three pre-puberty learners (AO=7, 8 and 9 years) can be found who do not produce nativelike VOTs for any of the stops. In the group of late L2 learners five subjects are found who produce average VOT-values within the range of native speaker VOT for /t/ or for both /p/ and /t/, but never for /k/. Finally, three post-puberty learners (AO=15, 16 and 19 years) can be observed who do not produce average VOTs within the range of native speaker VOT for any of the stops. 30 VOT (%) 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 Age of onset (AO) Figure 4. VOT (in %) for /p/ in relation to age of onset (AO). Symbols as in figure 1. 30 VOT (%) 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 Summary and conclusions Age of onset (AO) The present study has revealed that among subjects, who have been selected on the criterion that native listeners perceive them as mothertongue speakers of Swedish, there exists a weak but statistically significant correlation between AO and VOT production. In other words, these findings confirm that there is a general age effect on voiceless stop production even among apparently nativelike L2 speakers (Research Question 1). The analysis of the post-puberty group has shown that only two (AO=14 years) out of 10 late L2 learners show average VOTs within the range of native speaker VOT for all three stops. Furthermore, eight late learners do not produce VOTs within the range of native speaker VOT regarding all three stops, and three of these sub- Figure 5. VOT (in %) for /t/ in relation to age of onset (AO). Symbols as in figure 1. 30 VOT (%) 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 Age of onset (AO) Figure 6. VOT (in %) for /k/ in relation to age of onset (AO). Symbols as in figure 1. 93 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University jects (AO=15, 16, and 19 years) do not exhibit nativelike VOTs for any of the Swedish stops. According to these data only a small minority among late, apparently nativelike L2 learners exhibit actual nativelike L2 behavior regarding the production of Swedish voiceless stops (Research Question 2). An interesting finding is that both early and late L2 learners produce more nativelike VOTs for /p/ and /t/ than for the velar stop /k/. Since Spanish VOTs of voiceless stops are overall shorter (short-lag) than Swedish VOTs (longlag), and since velars, compared to bilabials and dentals, show the longest VOTs (e.g. Lisker & Abramson 1964), /k/ represents the most extreme, or the most “non-Spanish”, articulation concerning VOT. Therefore, the Swedish velar stop probably generates the most difficulty for both early and late L2 learners. Regarding the pre-puberty group a majority of the L2 learners do produce average VOT within the range of native speaker VOT for all three stops. However, 13 individuals are found who do not produce nativelike VOT concerning all three stops, and three of these subjects (AO=7, 8 and 9 years) do not exhibit nativelike VOTs for any of the stop consonants. From these results it can be concluded that far from all early learners pass as native speakers of their L2 when analyzed in detail (Research Question 3). Thus, even though early L2 learners generally sound nativelike in everyday oral communication, an early age of onset does not automatically result in an entirely nativelike VOT production. Finally, when measuring VOT in milliseconds no correlations between AO and VOT could be observed, except for the velar stop /k/. However, after have taken word duration into consideration a significant overall age effect emerged for all three stops. These findings suggest that effects of speaking rate have to be taken into account in the analysis of VOT. 2. For further analysis the control group will be expanded by another 7 native speakers of Swedish. References Abrahamsson, N., Stölten, K. & Hyltenstam, K. (in press), Effects of age on voice onset time: The production of voiceless stops by near-native L2 speakers. To appear in: S. Haberzettl (ed.), Processes and Outcomes: Explaining Achievement in Language Learning. Berlin: Mouton de Gruyter. Eimas P. D., Siqueland E. R., Jusczyk, P., and Vigorito J. (1971) Speech perception in infants. Science 171, 303-306. Flege J. E. (1991) Age of learning affects the authenticity of voice-onset time (VOT) in stop consonants produced in a second language. Journal of the Acoustical Society of America 89:1, 395-411. Lisker L. and Abramson A. (1964) A crosslanguage study of voicing in initial stops: Acoustical measurements. Word 20, 384422. Kessinger R. H. and Blumstein S. E. (1997) Effects of speaking rate on voice-onset time in Thai, French, and English. Journal of Phonetics 25, 143-168. Miller J. L., Green K. P., and Reeves A. (1986) Speaking rate and segments: A look at the relation between speech production and speech perception for the voicing contrast. Phonetica 43, 106-115. Ruben R. J. (1997) A time frame of critical/sensitive periods of language development. Acta Otolaryngologica 117, 202-205. Werker J.F. and Tees, R.C. (1984) Crosslanguage speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behaviour and Development 7, 49-63. Zampini M. L. and Green K. P. (2001) The voicing contrast in English and Spanish: The relationship between perception and production. In: Nicol J. L. (ed) One Mind, Two Languages. Oxford: Blackwell. Acknowledgements This study was in part supported by The Bank of Sweden Tercentenary Foundation, grant no. 1999-0383:01. Notes 1. A more detailed description and discussion of this study will be given in Abrahamsson, Stölten & Hyltenstam (in press). 94 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Prosodic phrasing and focus productions in Greek Antonis Botinis, Stella Ganetsou and Magda Griva School of Humanities and Informatics, University of Skövde and Department of Linguistics, University of Athens Abstract Experimental methodology This is an experimental study of tonal correlates of prosodic phrasing and focus production in Greek. The results indicate: (1) the tonal correlates of phrasing are a rising tonal command at phrase boundaries and a deaccentuation of the preboundary lexical stress; (2) the tonal correlates of focus are a local tonal range expansion aligned with the stressed syllable of the last lexical unit in focus and a global tonal range compression, which is most evident for the speech material after focus; (3) phrasing and focus have significant interactions, according to which the phrasing tonal command is suppressed as a function of focus production at the same linguistic domain. One experiment was designed in order to investigate distinctive phrasing and focus structures. The speech material consists of two compound test sentences with a phrasing distinction as well four focus distinctions. The phrasing distinction involves the attachment of a surface subject to either subordinate or main clause. The focus distinctions involve one neutral production as well as three productions with focus on different constituents of the test sentences. The neutral production of the test sentences had no contextual information whereas the focus productions of the test sentences were preceded by a question which elicited focus in different constituents of the test sentences. The two test sentences were {ótan épeze bála, i maría ðjavaze arxéa}(When (he) was playing football Maria was studying Ancient (Greek)) and {ótan épeze bála i maría, ðjávaze arxéa} (When Maria was playing football (he) was studying Ancient (Greek)). Thus, the noun “Maria” is the subject of the subordinate and main clause in pre-comma and post-comma position respectively. With different elicitation questions, focus was assigned on the test sentences in three different ways, i.e. on the subordinate clause, on the main clause and on the subject “Maria”. Two female students of the Linguistics Department at Athens University produced the speech material in five repetitions at normal speech tempo. The speech material was directly recorded in to a computer disc and analysed with the Waveserfer software package. Three tonal measurements were taken at each syllable, i.e. at the beginning, middle and end, regardless the segmental structure of syllable. This methodology normalizes tonal measurements with reference to temporal and tonal alignments of produced utterances. Introduction This study is within a multifactor research context in linguistic structuring. We examine the relation between sound and meaning as a function of linguistic distinctions and linguistic structures in an integrated experimental framework, which is in the spirit of the ISCA Workshop “Experimental Linguistics” (see Botinis, Charalabakis, Fourakis and Gawronska, 2005). Phrasing and focus are abstract linguistic categories with distinctive functions in linguistic structuring. The basic functions of phrasing and focus are the segmentation of continuous speech into a variety of meaningful linguistic units and the marking of variable linguistic units as more important than others respectively. We do have basic knowledge with reference to both phrasing and focus from earlier research (e.g. Botinis, 1989, Fourakis, Botinis and Katsaiti, 1999, Botinis, Bannert and Tatham, 2000, Botinis, Ganetsou and Griva, 2004) but we do not have any knowledge with reference to phrasing and focus interactions at the same linguistic domains. In this study, we present production data whereas perception research with reference to phrasing and focus interactions is being carried out. In the remainder of the paper, the experimental methodology is presented next followed by results and concluded by discussion. Results The results of this study, in accordance with the experimental methodology described in the previous section, are presented in average values of the tonal measurements in Figure 1. 95 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University 400 300 200 100 0 ó tan é pe ze bá la / i ma rí a ðjá va ze ar xé a rí a/ ðjá va ze ar xé a rí a ðjá va ze ar xé a Rĺ A/ ðjá va ze ar xé a 1a 350 300 250 200 150 100 ó tan é pe ze bá la i ma 2a 400 350 300 250 200 150 100 Ó TAN É PE ZE BÁ LA / i ma 1b 400 350 300 250 200 150 100 Ó TAN É PE ZE BÁ LA I MA 2b Figure 1. Continuous next page. 96 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University 350 300 250 200 150 100 ó tan é pe ze bá la/ I MA Rĺ A ðJÁ VA ZE AR XÈ A rí a/ ðJÁ VA ZE AR XÈ A Rĺ A ðjá va ze ar xé a Rĺ A/ ðjá va ze ar xé a 1c 400 350 300 250 200 150 100 ó tan é pe ze bá la i ma 2c 350 300 250 200 150 100 ó tan é pe ze bá la / I MA 1d 350 300 250 200 150 100 ó tan é pe ze bá la I MA 2d Figure 1. Average values of tonal measurements as a function of prosodic phrasing (1-2), indicated by solidus (/), and focus productions (a-d), indicated by capital letters (see text). 97 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Figures 1a and 2a show tonal structures of the test sentences as a function of neutral productions. There is a prosodic phrasing aligned with the respective clause boundaries of the test sentences which is a tonal command aligned with the edge of the subordinate clause. This phrasing tonal command is a tonal rise with no lexical stress alignment. On the other hand, the last lexical stress in relation to the prosodic boundary is not correlated with any distinct tonal command. Figures 1b and 2b show tonal structures of the test sentences as a function of focus production on the respective subordinate clause. No prosodic phrasing is correlated with clause boundaries. Instead, a bidirectional tonal command is correlated with the right edge of the subordinate clause, which is a tonal rise aligned with the stressed syllable of the last word in focus followed by a tonal fall aligned with the poststressed syllable. The end of the tonal fall spreads to the right to the end of the sentence. Figures 1c and 2c show tonal structures of the test sentences as a function of focus productions on the respective main clause. The tonal structure of these productions is fairly similar to the tonal structure of the neutral productions shown in Figures 1a and 2a, i.e. a prosodic phrasing with a tonal rise aligned with the edge of the subordinate clause. Figures 1d and 2d show tonal structures of the test sentences as a function of focus production on the subject of either subordinate or main clause. A distinct tonal command is correlated with the clause boundaries in 1d, i.e. when the noun “Maria” is the subject of the main clause, whereas no tonal command is correlated with the clause boundaries in 2d, i.e. when the noun “Maria” is the subject of the subordinate clause. On the other hand, the focus productions in 1d and 2d have fairly similar tonal correlates, which involve a bidirectional tonal command aligned with the subject “Maria” in focus and a substantial compression of the postfocus global tonal structure. Phrasing and focus may have distinct tonal correlates each in speech production. Phrasing has thus a relative local tonal effect, which defines syntactic boundaries as a function of coherence distinctions (see Botinis, Ganetsou and Griva, 2004), whereas focus has a global effect, which defines semantic weighting as a function of information structure distinctions (see Botinis, 1989). Each phrasing and focus may be applied on different linguistic domains with distinct tonal structures. However, at the same linguistic domains, phrasing tonal structures are suppressed as a function of focus applications. This is an indication that focus is a higher prosodic category with global rather than local prosodic effects in relation to phrasing. On the other hand, phrasing is a higher prosodic category, which suppresses lexical stress on the domain of its immediate application. The results of the present study may have several theoretical implications. With reference to prosodic theory, prosody is organized in a hierarchical structure, according to which different linguistic levels are associated with different prosodic categories (see Botinis, 1989). Higher prosodic categories are thus associated with higher linguistic levels in the domain of which prosodic rules operate to produce related prosodic structures. Accordingly, the prosodic correlates of lower and higher prosodic categories are relative local and global ones respectively, which results in variable suppressions of lower prosodic category correlates as a function of higher prosodic category applications. References Botinis A (1989) Stress and Prosodic Structure in Greek. Lund University Press. Botinis A., Bannert R., and Tatham M. (2000) Contrastive tonal analysis of focus perception in Greek and Swedish. In Botinis A. (ed.), Intonation: Analysis, Modelling and Technology, 97-116. Dordrecht: Kluwer Academic Publishers. Botinis A., Ganetsou S., Griva M., and Bizani H. (2004) Prosodic phrasing and syntactic structure in Greek. The XVIIth Swedish Phonetics Conference, 96-99, Stockholm, Sweden. Fourakis M., Botinis A., and Katsaiti M. (1999) Acoustic characteristics of Greek vowels. Phonetica 56, 28-43. Discussion and conclusions In accordance with the results of the present study, some old knowledge has been corroborated and some new knowledge has been produced. The old knowledge refers to tonal correlates of phrasing and focus whereas the new knowledge refers to interactions between these two prosodic categories. 98 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Syntactic and tonal correlates of focus in Greek and Russian Antonis Botinis1,2, Yannis Kostopoulos1, Olga Nikolaenkova1,3 and Charalabos Themistocleous1 1 Department of Linguistics, University of Athens, Greece 2 School of Humanities and Informatics, University of Skövde, Sweden 3 Department of Linguistics , University of Saint Petersburg, Russia a random disposition on a piece of paper. Experimenters were asked to compose and write nine full sentences with the most natural word order. The language material of this experiment consists of 1206 (9 sentences x 134 experimenters) and 657 (9 sentences x 73 experimenters) individual sentence productions for Greek and Russian respectively. The second experiment was to investigate word order of the basic language material used in the first experiment as a function of five different questions, which were designed to elicit five focus distinctions, i.e. one neutral and the remaining four with focus on subject, verb, verb phrase and object respectively. The ten sentences were organised in ten respective series and each series was in turn organised in four sets with different word order. Each set was led by a statement followed by five different questions, i.e. one question for the elicitation of each of the five focus distinctions. Experimenters were asked to fill in a form with two main options, i.e. if the statements were accepted or non-accepted and, if accepted, which of the five alternative questions were most appropriate as answers to these questions. The language material of this experiment consists of 3400 (10 sentences x 5 focus distinctions x 85 experimenters) and 1850 (10 sentences x 5 focus distinctions x 37 experimenters) word order individual sentence options as a function of focus distinctions for Greek and Russian respectively. The third experiment was to investigate unmarked word order of spoken language production as a function of contextual information on the computer screen as well as syntactic and tonal correlates of focus distinctions elicited by different questions. Experimenters were asked to produce neutral as well as variable focus distinctions. The language material of this experiment consists of 480 (12 sentences x 4 focus distinctions x 10 experimenters) and 720 (12 sentences x 4 focus distinctions x 15 experimenters) sentence productions in Greek and Russian respectively. Abstract This is an experimental study of syntactic and tonal correlates of focus in Greek and Russian. Three experiments were carried out the results of which indicate: First, the dominant word order is SVO in both Greek and Russian. Second, focus distinctions have inverse word order effects, according to which syntactic elements of focus elicitations are dislocated at sentence beginning and sentence end for Greek and Russian respectively. Third, focus has a local tonal range expansion and a global tonal range compression in both Greek and Russian. Introduction This study is in the spirit of the forthcoming ISCA Workshop “Experimental Linguistics” to be held in Athens, Greece, in 2006 (see Botinis et al. 2005, this volume). Three experiments were carried out the main questions of which are (1) which is the unmarked word order? (2) which are the word order correlates of focus production? (3) which are the tonal correlates of focus production? These questions are also related to contrastive linguistics and language typology with reference to sentence structure production in Greek and Russian. Experimental methodology The basic language material of the three experiments in this study consists of controlled speech situations, in which experimenters from Athens and Saint Petersburg for Greek and Russian respectively were asked to produce utterances with reference to pictures on the computer screen in apparent agent-action-goal semantic relations. The language material was directly recorded into computer disc and tonal analysis was carried out with Waveserfer. The main objective of the first experiment was to investigate unmarked word order of written sentence production. Lexical words corresponding to syntactic categories subject (S), verb (V) and object (O) were copied from the basic language material and were written in 99 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Results The results of the three experiments described in the previous section are shown in Figures 1 and 2 with reference to syntactic and tonal correlates of focus distinctions respectively. In Figures 1a and 1b, SVO is the dominant word order structure in unmarked written production in both Greek (1a) and Russian (1b), with marginal word order variability across speaker’s age and gender. In Figures 1c and 1d, the neutral elicitation of spoken productions has a dominant SVO structure in Russian (1d) but not in Greek (1c). OVS VSO SVO Focus elicitation have dislocation effects, according to which syntactic categories are dislocated at sentence beginning and sentence end for Greek and Russian respectively and this dislocation is most evident in Russian. In Figures 1e and 1f, the neutral elicitation of written production has VSO and SVO dominant structures in Greek and Russian respectively. Focus involves inverse syntactic dislocations at the beginning of sentence and end of sentence for Greek and Russian respectively. These dislocations are more evident for Russian than for Greek and also for Greek females than Greek males. SVO 800 240 600 160 VSO OVS VOS 400 80 200 0 0 Female Male Female Adults Female Male Male Adults Children Male Children Russian Greek a SVO Female b VOS OVS SVO 80 160 60 120 40 80 20 40 0 SOV VSO OSV OVS 0 [N] [S] [VP] [O] [N] Male [S] [VP] [O] [N] Female [S] [VP] [O] [VP] [O] Russian c VOS [S] Female Greek SVO [N] Male d OVS VSO SVO VOS OVS VSO 240 200 160 120 80 40 0 400 320 240 160 80 0 [N] [N] [S] [V] [VP] [O] [S] [V] [VP] [O] Russian Greek f e Figure 1. Greek (left) and Russian (right) word order of basic syntactic categories as a function of speaker’s age and speaker’s gender written production (a-b), focus elicitations of spoken production (c-d) and focus elicitation of written production (e-f). 100 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Greek Russian a b c d e f g h Figure 2. Tonal structures of variable word order and distinctive focus productions of the sentences /o erɣátis ftiáxni ti lába/ (The worker repairs the lamp) and /máljtʃik njisjót glóbus/ (The boy carries the globe) in Greek (left) and Russian (right) respectively (capital letters indicate focus). 101 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University In figure 2 some typical examples of tonal structures as a function of focus distinctions in Greek and Russian are presented. In both languages, the neutral productions (a and b) have a regular tonal structure, according to which stressed syllables of lexical words are as a rule associated with local tonal commands which are aligned with respective stress group boundaries. Focus productions, on the other hand, modify the tonal structure in both Greek and Russian in three main ways. First, speech material in focus has a local tonal range expansion in relation to the corresponding local tonal range of the neutral productions. Second speech material out of focus undergo deaccentuation. Third, speech material out of focus undergoes major tonal compression. These three ways may operate simultaneously or in combinations in variable linguistic domains. Our results indicate that focus productions have constant tonal correlates which operate independently from syntactic correlates, although both tonal and syntactic structures may function complementary with reciprocal reinforcement for focus structures and focus distinctions. Discussion and conclusions Although much research has been conducted on each word order and tonal structures in a variety of languages, including Greek and Russian (Botinis, 1989, Svetozarova, 1998, Yoo, 2003.), little attention has been paid to interactions between word order and prosody, especially with reference to semantic impacts and focus assignments in linguistic structures. Furthermore, although several languages, such as Greek and Russian, have traditionally been described as free word order languages, in the sense that main syntactic categories may have variable word order, the conditions and factors that trigger alternative word order structures are underexamined. The results of this study, based on the experimental methodology and the investigated language material described previously, indicate that both Greek and Russian have a dominant word order syntactic structure as well as a regular tonal structure. On the other hand, focus has a major effect on both tonal and syntactic structures in the two languages. The dominant unmarked word order structure is SVO, whereas the regular tonal structure consists of local tonal commands aligned with stressed syllables, which may have variable tonal range as a function of focus distinctions. Dislocation of syntactic elements, which bear required information, at the beginning of sentence and end of sentence are syntactic correlates of focus in Greek and Russian respectively, whereas local expansions in relation to global compressions of the tonal range are tonal correlates of focus in both Greek and Russian. Focus is a complex linguistic category with a heavy functional load, according to which some linguistic units are marked as more important than other ones in communication situations. The basic linguistic function of focus is thus semantic weighting of variable linguistic units in relation to information structure and contextual specifications of actual utterances. Despite prosodic variability in different languages, tonal correlates are most usually reported as prosodic correlates of focus distinctions in the majority of analysed languages (see e.g. Hirst and Di Cristo, 1998). However, although focus has both local and global tonal correlates, which has been evidenced in several studies in Greek and Russian, it is the global tonal structure that determines focus perception rather than any local tonal variability of the linguistic units in focus (Botinis, 2000). References Botinis A. (1989) Stress and Prosodic Structure in Greek. Lund University Press. Botinis A., Bannert R., and Tatham M. (1998) Contrastive tonal analysis of focus perception in Greek and Swedish. In Botinis A. (ed) Intonation: Analysis, Modelling and Technology. Dordrecht: Kluwer Academic Publishers. Botinis A., Charalabakis Ch., Fourakis M., and Gawronska B. (2005) Athens 2006 ISCA Workshop on Experimental linguistics (this volume). Hirst D., and Di Cristo A. (eds) (1998) Intonation Systems. Cambridge University Press. Svetozarova N. (1998) Intonation in Russian. In Hirst D. and Di Cristo A. (eds) Intonation Systems, 261–274. Cambridge University Press. Yoo H-Y. (2003) Ordre des Mots et Prosody. PhD dissertation, University of Paris 7. 102 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Prosodic correlates of attitudinally-varied back channels in Japanese Yasuko Nagano-Madsen1 and Takako Ayusawa2 1 Department of Oriental and African Languages, Göteborg University, Göteborg 2 Department of Japan Studies, Akita International University, Akita Kitamura 2000, Katagiri, Sugito, and NaganoMadsen 1999, 2001). This is because these studies deal with recordings of real-life communication and the samples were therefore not systematically varied or distributed, and nor were they easily analysable, due to overlap of utterances. In order to overcome the difficulties mentioned above and to balance the phonetic data on back channels in Japanese, we present a study of another kind: well controlled simulated utterances recorded in a good acoustic environment. The kinds of back channels presented in this study are of the ‘unrepeatable back channel’ type, following Nagano-Madsen and Sugito’s classification based on the phonological form. Unrepeatable back channels look more like a proper utterance, whereas repeatable back channels are of /so:so:/, /haihai/ ‘yes, yes’ type. The first back channel dealt with in the present study is /a-soo-desu-ka/ ‘Is that so? I see..’, which was the second common back channel after /N:/ ’yes’ (Nagano-Madsen and Sugito 1999). Phonologically, it contains the H*L accent in the /soo/. The second type /yamada-sandesu-ka/ ‘Is it Mr Yamada? / I see, it is Mr Yamada…’ is classified as echo back channel where a keyword in the previous utterance is repeated as back channel (in this case Mr Yamada). This type of back channel shows a deeper concern from the listener and is frequently used where a stream of conversation becomes lively, with quick turn taking (Sugito etc.). Phonologically, it contains unaccented word /yamada/. In addition /are-desu-ka/ ‘it is that? It is that…’, which is similar to /yamadasandesuka/ but shorter, is also included. Abstract Attitudinally-varied back channel utterances were simulated by six professional voice actors for Japanese. Contrary to the general assumption that a pitch accent language like Japanese cannot vary the tonal configuration for attitudinal variation as in a stress/intonation language, all the speakers differentiated two kinds of tonal configurations. Further variation was achieved by phrasing utterances differently on pitch and timing dimensions, and by adding a rising or non-rising terminal contour. Introduction In many stress-accent languages that have been traditionally classified as intonational languages, attitudinal meaning is expressed by means of finely defined tonal contours. In contrast, pitch-accent languages such as Japanese are assumed not to be able to choose contour types for attitudinal meaning or emotion, because the languages use lexically fixed accent shapes (Mozziconacci 2000). Thus, apart from the variation in terminal contour, the dimensions where intonation can vary are pitch range and phrasing (Beckman and Pierrehumbert 1986). Contrary to traditional belief and assumptions, one of the findings of the present paper is the systematic variation in utterance internal tonal configuration in order to express attitudinal meaning in back channels. Back channels in Japanese Japanese is known to be language that uses back channels extensively and this has been studied from various perspectives. For phonological forms used for back channels in Japanese, Nagano-Madsen & Sugito (1999) presents extensive analysis and classification for both Tokyo and Osaka Japanese. The phonetic properties of back channels have only been partially studied for restricted types of back channels (Sugito, Nagano-Madsen and Material Our speech material consists of high quality recordings of 6 professional voice actors (3 males and 3 females) who were in their 30s or 40s at the time of recording. Each of them produced 3 back channel utterances with neutral 103 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University (NEU), joyful (JOY), disappointed (DIS), and suspicious (SUS) attitudes. In addition, the same utterance was produced as a question (Q) rather than a back channel. The second author judged the appropriateness of each attitude type at the time of recording and the best sample for each attitude for each utterance was used for analysis. The recorded material was part of the self-learning CD for Japanese accent and intonation for learners of Japanese (Ayusawa 2001). 'SUGI Speech Analyser' software installed on PC was used to do acoustic analyses. The three back channel utterances are as follows: (1) /a-soo-desu-ka/ ‘ Is that so? I see …’ (2) /yamada-san-desu-ka/ ‘It is Mr Yamada? / It is Mr Yamada...’ (3) /are-desu-ka/ ‘It is that? It is that..’ Table 1 shows the distribution of two contour types, H or LH, for the six speakers for varying attitudes as well as for a question Q. Table 1. The choice of tone for different attitudes by the six speakers for /soo/ H*L in /asoodesuka/. U,V,W are female speakers. U V W X Y Z NEU H H H H H H(L) Q H H H H H H JOY LH LH LH H H LH SUS LH LH LH LH LH LH DIS LH LH H H LH H It can be seen that the NEU, including Q, is expressed predominantly by a H tone, SUS by a LH tone. The choice of the contour was speaker dependent for JOY and DIS. SUS and Q always have a rising terminal contour while others have a falling or a level contour. There was further variation in the tonal configuration with regard to the particular mora on which the F0 peak was reached. Figures 1a and 1b show two kinds of tonal configuration produced by speaker U. She used a level H tone for NEU and DIS, and a rising (LH) tone for JOY and SUS. Note that for each pair, basically the same shape of tone are placed at different time scale (NEU and DIS) or at different pitch range (JOY and SUS). DIS and SUS are further differentiated by falling vs. rising terminal contour. Results Auditory and acoustic analyses revealed that speakers modified several parameters in order to produce attitudinally-varied back channels in Japanese. This included variations in tempo, tonal configuration, pitch range, vowel quality, voice quality, and clarity of articulation. Of these, the most notable systematicity was observed for tonal configuration and phrasing in the pitch and time dimensions. The rest of the paper will focus on these aspects. Tonal configuration Contrary to the general assumption that tonal configuration cannot vary in a pitch accent language like Japanese, all six speakers were found to use two tonal configurations. These contours are further differentiated in phrasing in the pitch and time dimensions when expressing various attitudes. For the utterance /asoodesuka/ ‘is that so? / I see’, where /soo/ is associated with the lexical H*L accent, it is interesting to note that the expected F0 fall was largely missing. Only in one occasion (speaker Z for NEU) was there a very slight F0 fall, and all other cases were produced with either a level H or a rising LH contour. Maekawa (2004), whose data included basically the same utterance /soodesuka/ (without the initial interjection /a/) for his study of paralinguistic information in Japanese, noted this change from the H*L to the LH pattern in some of the utterances. 104 Figures 1a,b. The H contour for NEU and DIS and the LH contour for JOY and SUS (speaker U). Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University six speakers had the accentual F0 peak on /ma/. Instead, it was either on the third mora /da/ (10 tokens), the fourth mora /sa/ (7 tokens) and /de/ (13 tokens). None of the tokens for DIS and SUS had F0 peak on /da/. All the foregoing observation indicate that attitudinal meaning for /yamadasandesuka/ is clearly conveyed by systematic variation in tonal configuration through different phonetic implementation of the accentural-phrasal tones %L and H- (cf. X-JtoBI, Maekawa et al. 2004). The utterance /yamadasandesuka/ contains an unaccented word /yamada/ and starts with a phrasal L% followed by the H*L accent on /de/ of the copula /desu/. Phrasing There was fairly good agreement among the speakers as to how the two tonal configurations were phrased in the pitch and time dimensions. With one exception, the attitude JOY always had the highest F0 peak, while the lowest peak was typically found for DIS. SUS and DIS were spoken more slowly than other types of attitudes and therefore had a longer duration. Figure 3 shows a typical example of phrasing for four attitudes by speaker Y. Figure 2a,b. F0 contours for JOY, Q, and NEU (above) and SUS and DIS (below) for speaker U. As for the previous case, this utterance also had two kinds of tonal configuration that were further differentiated in phrasing and by terminal contour to express attitudinal meaning. Figures 2a and 2b show F0 contours for /yamadasandesuka/ produced by speaker U. The tonal configuration used for NEU (including Q) and JOY has a steeper initial F0 rise than that used for SUS and DIS. The former type of contour has a plateau phase. In contrast, the latter type of contour has a slow and gradual F0 rise that reaches to its peak only before the accentual fall. It does not contain any plateau phases. In Figure 2 (above) the F0 contours are placed at three equidistant pitch range intervals, Q being differentiated by a rising terminal contour. Likewise, the contours for DIS and SUS differ both in pitch range and in terminal contour. Note that the difference in the way F0 rises initially, as described above, does not affect the phonological structure of the utterance, since /yamadasan/ is an unaccented phrase. It is the phrasal feature that varies, i.e. the H- phrasal tone which is expected to be on the second mora is greatly delayed. Interestingly enough, not only one out of the 30 tokens produced by the Figure 3. A typical phrasing of F0 contours for NEU, JOY, DIS, and SUS (speaker Y). Pitch range and duration Figures 4 and 5 show the peak F0 values and total durations for /asoodesuka/ spoken by the six speakers with varying attitudes. Peak F0 value /soodesuka/ 550 500 peak F0 value in Hz 450 400 U W V Z Y X 350 300 250 200 150 100 50 NEU JOY DIS SUS attitude Figure 4. Peak F0 value for /asoodesuka/ produced by six speakers. 105 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University cation of the more important parameters for conveying attitudinally-differentiated back channels in Japanese will require perceptual experiments. Total duration of /soodesuka/ 1800 1600 duration in ms 1400 U W v Z Y X 1200 1000 800 600 References Ayusawa T. (editorial supervision). (2001) Accent and intonation in Tokyo Japanese. CALL Sub-teaching material series, Japanese prosody Vol.1. National Institute of Multimedia Education. Beckman ME and Pierrehumbert JB. (1986) Japanese prosodic phrasing and intonation synthesis. Proceedings of the 24th Meeting of the Association for Computational Linguistics Katagiri, Y., M. Sugito, and Y. Nagano-Madsen (1999) The forms and prosody of back channels in Tokyo and Osaka Japanese. In the Proceedings of The XIIIIth International Conference on Phonetic Sciences, San Francisco.(CD). Katagiri, Y., M. Sugito, & Y. Nagano-Madsen (2001) An analysis of forms and prosodic characteristics of Japanese 'Aiduti' in dialogue (in Japanese). In Bunpoo to Onsei (=Speech and Grammar) III, 263-274. Tokyo: Kuroshio Publication. Maekawa K. (2002) Production and perception of ‘paralinguistic’ information. Proceedings of International Conference:Speech Prosody 2004, Nara, 367-374. Maekawa K, Igarashi Y, Kikuchi E, and Yoneyama S. (2004) Intonation labelling for ‘Spoken Japanese Corpus’, version 1.0 (in Japanese). Electronical document for The corpus of Spontaneous Japanese. Mozziconacci, S. (2000) Prosody and Emotions. In Proceedings Online: ISCA Workshop on Speech and Emotion.: A conceptual framework for research. Nagano-Madsen, Y. and M. Sugito. (1999) Analysis of back channel items in Tokyo and Osaka Japanese (in Japanese). In Japanese Linguistics 5, 26-45. Tokyo: National Language Research Institute. Sugito, M., Y. Nagano-Madsen, and M.,Kitamura.(1999) The pattern and timing of the repeat-back channels in natural dialogue in Japanese (in Japanese). In Bunpoo to Onsei (=Speech and Grammar) II, 3-18. Tokyo: Kuroshio Publication. 400 200 0 NEU JOY DIS SUS attitude Figure 5. Total utterance duration /asoodesuka/ produced by six speakers. for Discussion Some of the findings reported here agree with those reported in Maekawa’s (2002) study on paralinguistic information in Japanese. The present study revealed more systematic details in the way tonal configuration was varied in conveying attitudinally varied utterances. Although the choice of tonal configuration is limited to two basic types, Japanese speakers were found to vary phrase internal F0 contours systematically in order to express attitudinally-varied back channels. The test material contained both accented (H*L) and unaccented words. In the case of the accented word /soo/ in /asoodesuka/, the lexical accent was altered to either H or LH, the former being typically used for NEU. In unaccented words, variation in tonal configuration was achieved by modifying the rate of initial F0 rise. What is consistent in both cases is that in the contours used for unmarked attitude NEU, F0 reaches its peak earlier than in marked attitude such as SUS. According to the X-JToBI (Maekawa et al 2004), Japanese ToBi labelling, ‘H-‘ is introduced to indicate the onset of F0 plateau. It can be analysed that both accented and unaccented word can be expressed by delayed Hfor marked attitudes. Exactly on which mora the initial and final F0 maxima are placed varies between speakers. It should be also reminded that the general assumption that the phrasal His on the second mora was not attested in the present data even for utterances with NEU and Q type. This phenomenon needs further investigation. Apart from the phrase internal tonal variations described above, there are differences in phrasing and terminal contours. These prosodic characteristics are further modified by vowel quality, voice quality, and clarity of articulation. Identifi106 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Prosodic Features in the Perception of Clarification Ellipses Jens Edlund, David House, and Gabriel Skantze Centre for Speech Technology, KTH, Sweden Abstract We present an experiment where subjects were asked to listen to Swedish human-computer dialogue fragments where a synthetic voice makes an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and subjects were asked to judge the computer’s actual intention. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems. Introduction Detection of and recovery from errors is important for spoken dialogue systems. To this effect, system hypotheses are often verified explicitly or implicitly: the system makes a clarification request or repeats what it has heard. These error handling techniques are often perceived as tedious, and one of the reasons for this is that they are often constructed as full propositions, verifying the complete user utterance. In contrast, humans often use short elliptical constructions for clarification – Purver et al. (2001) show that 45% of the clarification requests in British National Corpus (BNC) are elliptical. A dialogue system using word level confidence scores could use elliptical clarifications to focus on problematic fragments, making the dialogue more efficient (Gabsdil, 2003). However, the interpretation of ellipses is often dependent both on context and on prosody, and the prosody of clarification requests has not been greatly studied. We present an experiment in which subjects were asked to listen to Swedish dialogue fragments where the computer makes elliptical clarifications after user turns, and to judge what was actually intended by the computer. The study is connected to the HIGGINS spoken dialogue system (Edlund et al., 2004). The primary 107 domain of HIGGINS is pedestrian navigation, as seen in Table 1. Table 1. Example scenario in the HIGGINS domain (translated from Swedish) User I want to go to an ATM. System OK, where are you? User I’m standing between an orange building and a brick building. Clarification ellipsis could be very useful in this domain. Table 2 shows the scenario that is used in the experiment presented in this paper. Table 2. Example use of clarification ellipsis (translated from Swedish) User […] on the right I see a red building. System Red (?) Clarification Clarification is part of a process called grounding (Clark, 1996) or interactive communication management (Allwood et al., 1992). In this process, speakers give positive and negative evidence or feedback of their understanding of what the interlocutor says. A clarification may often give both positive and negative evidence – showing what has been understood as well as what is needed for complete understanding. Clarification requests may have both different forms and different readings (i.e. functions). In a study of the BNC, Purver et al. (2001) studied the form and function of clarification requests. According to their scheme, the form of clarification ellipses studied in this paper, as exemplified in Table 2, is called reprise fragments. We will use a distinction made by both Clark (1996) and Allwood et al. (1992) in order to classify possible readings of reprise fragments. They suggest four levels of action that take place when speaker S is trying to say something to hearer H: • Acceptance: H accepts what S says. • Understanding: H understands what S means. • Perception: H hears what S says. • Contact: H hears that S speaks. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University For successful communication to take place, communication must succeed on all these levels. The order of the levels is important; to succeed on one level, all the other levels below it must be completed. Also, if positive evidence is given on one level, all the other levels below it are presumed to have succeeded. When making a clarification request, the speaker is signaling failure or uncertainty on one level and success on the levels below it. Other classifications of clarification readings have been made. In Schlangen (2004) a more finegrained analysis of the understanding level is given. In Ginzburg & Cooper (2001), a distinction is made between what is called the “clausal reading” and the “constituent reading” of clarification ellipsis. Using the scheme above, the clausal reading could be described as a signal of positive contact and negative perception, and the constituent reading as a signal of positive perception and negative understanding. According to the scheme given above, the reprise fragment in Table 2 may have three different readings: • Ok, red. (No clarification request; positive on all levels) • Do you really mean red? What do you mean by red? (positive perception, negative/uncertain understanding) • Did you say red? (positive contact, uncertain perception) The reading “positive understanding, negative acceptance” has not been included here. The reason for this is that it is hard to find examples, which may be applied to spoken dialogue systems, where reprise fragments may have such a reading. Prosody In spite of the fact that considerable research has been devoted to the study of question intonation, the use of different types of interrogative intonation patterns has not been routinely represented in spoken dialogue systems. Not only does question intonation vary in different languages but also different types of questions (e.g. wh and yes/no) can result in different kinds of question intonation (Ladd, 1996). In very general terms, the most commonly described tonal characteristic for questions is high final pitch and overall higher pitch. In many languages, yes/no questions are reported to have a final rise, while wh-questions typically are associated with a final low. Wh-questions 108 can, moreover, often be associated with a large number of various contours. Bolinger (1989), for example, presents various contours and combinations of contours which he relates to different meanings in wh-questions in English. One of the meanings most relevant to the present study is what he terms the “reclamatory” question. This is often a wh-question in which the listener has not quite understood the utterance and asks for a repetition or an elaboration. This corresponds to the paraphrase, “What did you mean by red?” In Swedish, question intonation has been primarily described as marked by a raised topline and a widened F0 range on the focal accent (Gårding, 1998). In recent perception studies, however, House (2003), demonstrated that a raised fundamental frequency (F0) combined with a rightwards focal peak displacement is an effective means of signaling question intonation in Swedish echo questions (declarative word order) when the focal accent is in final position. In a study of a corpus of German task-oriented human-human dialogue, Rodriguez & Schlangen (2004) found that the use of intonation seemed to disambiguate clarification types with rising boundary tones used more often to clarify acoustic problems than to clarify reference resolution. Metod Three test words comprising the three colors: blue, red and yellow (blå, röd, gul) were synthesized using an experimental version of LUKAS diphone Swedish male MBROLA voice (Filipsson & Bruce, 1997), implemented as a plug-in to the WaveSurfer speech tool (Sjölander & Beskow, 2000). For each of the three test words the following prosodic parameters were manipulated: 1) Peak POSITION, 2) Peak HEIGHT, and 3) Vowel DURATION. Three peak positions were obtained by time-shifting the focal accent peaks in intervals of 100 ms comprising early, mid and late peaks. A low peak and a high peak set of stimuli were obtained by setting the accent peak at 130 Hz and 160 Hz respectively. Two sets of stimuli durations (normal and long) were obtained by lengthening the default vowel length by 100 ms. All combinations of three test words and the three parameters gave a total of 36 different stimuli. Six additional stimuli, making a total of 42, were created by using both the early and late peaks in the long duration stimuli which created a double peaked stimuli. A possible Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University late-mid peak was not used in the long duration set since a late rise and fall in the vowel did not sound natural. The stimuli are presented schematically for the word “yellow” in Figure 1.The first turn of the dialogue fragment in Table 2 was recorded for each color word and concatenated with the synthesized test words, resulting in 42 different dialogue fragments similar to the one in Table 2. Table 3: Interpretations that were significantly overrepresented, given the values of the parameters POSITION and HEIGHT, and their interactions. The standardized residuals from the χ2-test are also shown. POSITION Early Mid Late HEIGHT High Low POSITION* HEIGHT Early*Low Mid*Low Mid*High Late*High Interpretation ACCEPT CLARIFYUNDERSTANDING CLARIFYPERCEPTION Interpretation CLARIFYUNDERSTANDING ACCEPT Interpretation Std. resid. 3.1 4.6 3.6 Std. resid. 3.2 4.0 Std. resid. ACCEPT ACCEPT CLARIFYUNDERSTANDING CLARIFYPERCEPTION 3.4 3.4 5.6 4.4 ACCEPT CLARIFYUNDERSTANDING Number of votes CLARIFYPERCEPTION 40 30 20 10 0 early mid HIGH late early mid late LOW Figure 2: The distribution of votes for the three interpretations as a function of position: where HEIGHT is “high” on the left, and “low” on the right. The circles mark distributions that are significantly overrepresented. Figure 1. Stylized representations of the stimuli “gul” (“yellow”), showing the F0 peak position. The top panel shows normal duration, the bottom lengthened duration. The subjects were 8 Swedish speakers, all of which have some experience of speech technology, although none of them are involved in this research. The subjects were told that they would listen to 42 similar dialogue fragments containing a user utterance and a system utterance each, and that their task was to judge the meaning of the system utterance by choosing one of three alternatives. They were informed that they could only listen to each stimulus once. During the experiment, the subjects were played each of the 42 stimuli once, in random order. After each stimulus, they chose a paraphrase for the system utterance. The different paraphrases were (X stands for a color): • ACCEPT: Ok, X • CLARIFYUNDERSTANDING: Do you really mean X? • CLARIFYPERCEPTION: Did you say X? Results There were no significant differences in the distribution of votes between the different colors (“red”, “blue”, and “yellow”) (χ2=3.65, dF=4, 109 p>0.05), nor were there any significant differences for any of the eight subjects (χ2=19.00, dF=14, p>0.05). Neither had the DURATION parameter any significant effect on the distribution of votes (χ2=5.72, dF=2, p>0.05). Both POSITION and HEIGHT had significant effects on the distribution of votes, which is shown in Table 3 (χ2=70.22, dF=4, p<0.001 resp. χ2=59.40, dF=2, p<0.001). The interaction of the parameters POSITION and HEIGHT also gave rise to significant effects (χ2=121.12, dF=10, p<0.001), as shown in the bottom of Table 3. Figure 2 shows the distribution of votes for the three interpretations as a function of position for both high and low HEIGHT. Results from the doublepeak stimuli were generally more complex and are not presented here. Discussion The most interesting result in this experiment from both a spoken dialogue system perspective and a prosody modeling framework concerns the strong relationship between intonational form and meaning. For these single-word utter- Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University ances used as clarification ellipses, the general division between statement (early, low peak) and question (late, high peak) is consistent with the results obtained for Swedish echo questions in (House, 2003) and for German clarification requests in (Rodriguez & Schlangen, 2004). However, the further clear division between the interrogative categories CLARIFYUNDERSTANDING and CLARIFYPERCEPTION is especially noteworthy. This division is related to the timing of the high peak. The high peak is a prerequisite for perceived interrogative intonation in this study, and when the peak is late, resulting in a final rise in the vowel, the pattern signals CLARIFYPERCEPTION. This can also be seen as a yes/no question and is consistent with the observation that yes/no questions generally more often have final rising intonation than other types of questions. The high peak in mid position is also perceived as interrogative, but in this case it is the category CLARIFYUNDERSTANDING which dominates as is clearly seen in the left panel of Figure 2. This category can also been seen as a type of wh-question similar to the “reclamatory” question discussed in (Bolinger, 1989). Conclusions and future work The results of this preliminary study can be seen in terms of a tentative model for the intonation of clarification ellipses in Swedish. A low-early peak would function as an ACCEPT statement, a mid-high peak as a CLARIFYUNDERSTANDING question, and a late high peak as a CLARIFYPERCEPTION question. This would hold for single-syllable accent I words. Accent II words may be more complex. We intend to test this model and extend this research in two ways. By implementing these prototypical patterns in the HIGGINS dialogue system, we will study responses of actual users to the different prototypes. We also plan to study these types of clarification ellipses in a database of Swedish human-human dialogue. Acknowledgements This research was carried out at the Centre for Speech Technology, a competence centre at KTH, supported by VINNOVA (The Swedish Agency for Innovation Systems), KTH and participating Swedish companies and organizationsand was also supported by the EU project CHIL (IP506909). 110 References Allwood, J., Nivre, J., & Ahlsén, E. (1992). On the semantics and pragmatics of linguistic feedback. Journal of Semantics, 9, 1-26. Bolinger, D. (1989). Intonation and its uses: Melody in grammar and discourse. London: Edward Arnold. Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press. Edlund, J., Skantze, G., & Carlson, R. (2004). Higgins – a spoken dialogue system for investigating error handling techniques. In Proceedings of ICSLP, 229-231. Filipsson, M. & Bruce, G. (1997). LUKAS - a preliminary report on a new Swedish speech synthesis. Working Papers 46, Department of Linguistics and Phonetics, Lund University. Gabsdil, M. (2003). Clarification in spoken dialogue systems. In Proceedings of the AAAI spring symposium on natural language generation in spoken and written dialogue. Gårding, E. (1998). Intonation in Swedish, In D. Hirst and A. Di Cristo (eds.) Intonation Systems. Cambridge: Cambridge University Press, 112-130. Ginzburg, J. & Cooper, R. (2001). Resolving ellipsis in clarification. In Proceedings of the 39th meeting of the ACL. House, D. (2003). Perceiving question intonation: the role of pre-focal pause and delayed focal peak. In Proc 15th ICPhS, Barcelona, 755-758 Ladd, D. R. (1996). Intonation phonology. Cambridge: Cambridge University Press. Purver, M., Ginzburg, J., & Healey, P. (2001). On the means for clarification in dialogue. In Proceedings of SIGDial. Rodriguez, K. J. & Schlangen, D. (2004). Form, intonation and function of clarification requests in German task oriented spoken dialogues. In Proceedings of Catalog '04 (The 8th Workshop on the Semantics and Pragmatics of Dialogue, SemDial04), Barcelona, Spain. Schlangen, D. (2004). Causes and strategies for requesting clarification in dialogue. In Proceedings of SIGDial. Sjölander, K. & Beskow, J. (2000). WaveSurfer – a public domain speech tool, In Proceedings of ICSLP 2000, 4, 464-467, Beijing, China. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Perceived prominence and scale types Christian Jensen1 and John Tøndering2 1 Department of English, Copenhagen Business School, Denmark 2 Institute of Nordic Studies and Linguistics, Dept. of Linguistics, University of Copenhagen, Denmark Abstract Three different scales which have been used to measure perceived prominence are evaluated in a perceptual experiment. Average scores of raters using a multi-level (31-point) scale, a simple binary (2-point) scale and an intermediate 4-point scale are almost identical. The potentially finer gradation possible with the multilevel scale(s) is compensated for by having multiple listeners, which is a also a requirement for obtaining reliable data. In other words, a high number of levels is neither a sufficient nor a necessary requirement. Overall the best results were obtained using the 4-point scale, and there seems to be little justification for using a 31-point scale. Introduction than the 31-point scale it still allows raters to make some gradation in their prominence evaluations. We investigated the three prominence scales outlined above with the purpose of answering two overall questions: does the choice of scale influence the results with regard to 1) the perceived prominence relations of words in utterances, and 2) the ability to make observations about statistically significant differences between words. These questions were addressed from the point of view of three relevant linguistic parameters which are known to be associated with perceived prominence: part of speech membership, information structure and correlation with F0 . Method The purpose of this paper is to evaluate the use The speech material chosen to evalof different scales for measuring the perceived uate the scales was two short monoprominence of syllables and words. In this inlogues from the Danish DanPASS project vestigation only word-level prominence is con(http://www.cphling.dk/pers/ng/danpass.htm), sidered. both recordings of a map task activity. The two Prominence, as perceived by groups of monologues, by two different male speakers, raters, has been measured on different types of included a total of 123 words. The monologues scale: some use a 31-point scale from 0 to 30, were divided into shorter phrases which were first described in Fant & Kruckenberg (1989). presented via a web page (one phrase per page). The strength of this scale is that it allows for very The raters could hear the phrase as many times fine gradation of the perceived prominence, even as they wanted by pressing a “play” button, for a single rater, but this also makes the task and indicated their judgment by clicking the quite difficult. Others, e.g. Wightman (1993), appropriate scale point. Time consumption and have proposed to use instead a simple binary (2a count of sound file playbacks were recorded point) scale (0 or 1) and use the cumulative (or for each phrase. average) score of each word as an expression of its level of prominence, which results in much A large group of raters participated in the exsimpler task for the raters. The disadvantage of periment and were randomly assigned to a spethis simple scale is that it may force raters to concific scale. Equally sized groups of 19 raters flate items which they perceive as “different, but (the size of the smallest group) were selected for within the same category”, which could lead to a the analyses. The instructions to the raters were reduced or lost ability to distinguish variations in presented from the web page and were identical perceved prominence at either end of the promifor all three groups, except for the details about nence continuum. For example accented words the specific scale. The concept of prominence with or without special emphasis. In addition, was explained and exemplified, and raters were the level of gradation you achieve with this scale advised that prominence might be a question of is directly proportional to the number of raters: “more or less”. 0 represented no prominence, but to get the same gradation as is (potentially) posno other scale points were defined. Prominent sible with the scale from 0 to 31 you need 30 words could be assigned values up to the scale raters. As a possible compromise between these maximum. Raters using the 2-point scale were two scales one could use a 4-point scale (e.g. informed that they could not grade their ratings from 0 to 3). While this scale is much simpler but were given a forced choice. 111 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Results Reliability Note: the phrase “the 2/4/31-point scale” is used in the following as shorthand expressions of “the prominence ratings obtained from the group of listeners using the 2/4/31-point scale”. The reliability of the data was tested by calculating Cronbach’s α coefficient, which expresses the extent to which the scores of the individual raters covary. The coefficients for all three groups are high (from 0.94 to 0.96) and the difference between them is nonsignificant (M = 1.02, p > 0.05). Comparison of prominence ratings The first question to be addressed is whether the prominence ratings on the three scales express the same relations between words. In order to be able to make direct comparisons all scores were normalised by dividing each value with the scale maximum (1, 3 or 30, respectively), which fits all data to a normalised scale of 0 to 1 without affecting the relations between scores. These values were then plotted on a line chart for simple visual inspection. An example diagram of one phrase is shown in Fig. 1. Perceived prominence 1 • 2-pt. = • 4-pt. = 31-pt. = 0.75 • 0.5 • 0.25 0 • til till • du kommer you come • • til to det the næste kryds next intersection Figure 1: Prominence of selected phrase – all scales The diagrams showed a high level of agreement across the three scales, which was further tested in a correlation analysis (Spearman’s ρ). The result can be seen in Table 1. Table 1: Correlation coefficients (Spearman’s ρ) across all three scales Correlation 2-pt 4-pt 4-pt 31-pt 0.933 0.926 — 0.964 The correlation coefficients were high for each scale pair and quite similar, with the best correlation apparently between the 4-point scale 112 and the 31-point scale. The preliminary conclusion is clear: raters arrive at approximately the same rank order of perceived prominence regardless of the scale used. It appears from Fig. 1 that the 2-point scale displays somewhat larger variation in values between the scale minimum and maximum than the 4-point scale and especially the 31-point scale. This was in fact a general trend demonstrating a certain compression of values on the 31-point scale (and to a lesser degree the 4-point scale), while the 2-point scale has more mean values near the scale extremes. Analyses of the distribution of scores (inter-quartile range for each rater and visual inspection of x-y plots) showed that many raters on the 31-point scale assigned most ratings to a restricted – sometimes very restricted – range of the scale, either at the lower, the middle or the higher end of the scale. There are therefore no mean values at the scale extremes, although there were many individual scores near the minimum and maximum values. Obtaining significant differences One very important aspect of choosing a scale is whether it will affect the ability to obtain statistically significant differences between test items. The hypothesis might be that scales with too few points (most notably the 2-point scale) would mask subtle perceptual differences which could be brought out with more scale points. This suitability of the three scales for quantitative analysis was tested by examining the association between perceived prominence and three linguistic phenomena: part of speech membership, information structure and a specific acoustic correlate, namely F0 . The purpose was to see if the data obtained by using three different scales will lead to different conclusions about linguistic structure. Comment on the statistical procedures Since it is not possible to compare results directly across scale types we simply decided to use the statistical procedures which were felt to be most appropriate for each individual scale. This resembles quite well the choice which researchers would be forced to make when they are making a choice about scale type. For all scales we have decided to use nonparametric methods. For significance testing on the 2-point scale we use the Fisher exact test or a chi-square test with corrections for continuity (when n > 40), and for the other two scales we use the Wilcoxon-Mann-Whitney test with correction for ties (WMW). Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Table 2: Prominence ratings and parts of speech. Left braces indicate non-significant differences. Nonadjacent, nonsignificant differences on the 31-pt scale: adv-v, art-prep 1 2 3 4 5 6 7 8 9 Scale → Part of speech Adjectives Nouns Interjections Adverbs Verbs Pronouns Conjunctions Articles Prepositions n 9 28 3 12 13 16 10 2 30 2-point Ranked x̄ adj 0.92 n 0.78 int 0.60 { adv 0.58 v 0.34 { pron 0.33 conj 0.17 { art 0.13 { prep 0.10 Parts of speech The mean prominence ratings of nine parts of speech are listed in Table 2, ordered according to their ranking on each scale. These ranking are very similar for all three scales. The only difference which can be detected is the relegation of prepositions to ninth place on the 2-point scale, instead of the seventh place it holds on the other two scales. (The different ranking of pronouns and verbs on the 31-point scale is irrelevant.) Most of the differences between the classes are significant: except for two cases on the 31-point scale (see the table caption) all differences between classes which are not adjacent in the rankings are significant, and of the differences between adjacent classes four are nonsignificant on the 2-point scale, two are nonsignificant on the 4-point scale, and three are nonsignificant on the 31-point scale (giving a total of five differences which are not significant for this scale). These figures are quite similar, with a small bias in favour of the 4-point scale, where the highest number of significant differences was found. Information structure Chafe (1994) states that new information is more prominent than non-new information. To test the validity of this statement we compared the prominence ratings of all words carrying new information with the most prominent word carrying non-new information in the same phrase (20 cases), thus testing the hypothesis that new information is more prominent than other information (H1 ). H0 states that the perceived prominence of the new information is less than or equal to that of the given/accessible information. In four cases (three on the 31-point scale) the new information is not more prominent than the non-new information, in which case H0 cannot 113 4-point Ranked x̄ adj 0.73 n 0.66 int 0.50 adv 0.38 v 0.30 { pron 0.30 prep 0.21 conj 0.13 { art 0.12 31-point Ranked x̄ adj 0.67 n 0.63 { int 0.58 adv 0.40 pron 0.35 { v 0.35 prep 0.28 conj 0.24 { art 0.22 be dismissed. Of the remaining 16 (17) cases, where the new information had higher prominence ratings than the non-new information, nine were significant on the 2-point scale (Fisher exact test, one-tailed, p < 0.05); 15 were significant on the 4-point scale and 14 on the 31-point scale (WMW, one-tailed, p < 0.05). Here we find a clear difference between the 2-point scale and the 4-point and 31-point scales in the number of significant differences. Our conclusion about the relative prominence levels of new versus non-new information would therefore be affected by our choice of scale, provided that we want to verify observed differences in mean ratings statistically. Correlation with F0 The prominence level of a Danish accented syllable, and of the word in which it occurs, is generally felt to be associated with, among other cues, a rise in F0 . The greater the rise, the more prominent the syllable is perceived to be. For this investigation two F0 values were measured for all words in which such a rise occurs: the F0 trough and the F0 peak value within the domain of onset of the accented vowel and the end of the word (since we were concerned with word level prominence). The rise is expressed as the difference in semitones between these two values, and the values for the rises were then correlated against the prominence ratings from the three scales. The results are displayed in Table 3. The correlation coefficients are very similar for the three data sets, indicating the the association between prominence and F0 can be described equally well regardless of the scale used. To the (slight) extent that any difference can be detected it seems that the correlation is better with data obtained on the 4-point scale. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Table 3: Correlation (Spearman’s ρ) between perceived prominence and F0 Scale 2-pt 4-pt 31-pt ρ 0.593 0.626 0.606 Rater effort, or level of difficulty In a few places we have described the 2-point scale, and to some extent the 4-point scale, as “simpler” and less difficult for the rater than the 31-point scale. At least this was our expectation, and as an attempt to capture this we measured the time consumption for each phrase and number of times the raters listened to each phrase. The hypothesis is that both of these measures will increase with an increase in the number of scale points. This hypothesis was in fact borne out: there is an increase in time consumption of 18% when going from two to four scale points, and an increase of 42% when going from two to 31 points. All pairwise comparisons between the three scales are significant (t-tests, one-tailed, p < 0.05). The pattern is less clear for the number of playbacks, where only the tendency for more playbacks on the 31-point scale compared with the 2- and 4-point scales is statistically significant. It must be concluded, though, that using more scale points will result in a somewhat higher “cost”. Discussion and conclusion speech categories and between words with new versus given/accessible information, and the correlation with F0 is better. No such improvement can be obtained, however, by raising the number of scale point to 31. On the contrary we find slightly fewer significant differences on this scale. One reason for this finding may be that it is too difficult for untrained listeners to use the 31-point scale. In a parallel experiment (to be reported elsewhere) we had five expert listeners rate the same phrases as in this experiment (with slightly different instructions). The performance of this group was generally better than any random group of five untrained listeners (higher Cronbach α coefficient and more significant differences), which indicates that they did in fact do better on this scale. The analysis also showed, however, that five expert listeners cannot replace a larger group of untrained listeners if the objective is to find statistically significant differences – the number of observations becomes too small. It was shown that “expenses”, in terms of especially time consumption, grew with an increase in the number of scale points. Combined with the above observations this points to a recommendation of using many listeners rating on a scale with relatively few levels. A 2-point scale may then be adequate for most purposes and makes for the simplest and fastest task, but it would appear that increasing the number of levels to four results in slightly better performance. There seems to be no justification for using a 31-point scale, unless the requirement of using many listeners cannot be met. The task becomes more difficult and takes more time, and there is no gain in terms of precision or “discriminatory power” to balance the extra cost. Two main questions were asked about the influence of scale type on ratings of perceived promiReferences nence: 1) do we get the same prominence relations in utterances, as expressed in mean valFant, G. and Kruckenberg, A., “Preliminaries to ues and rankings, and 2) does scale type affect the study of Swedish prose reading and readour ability to make observations about statistiing style”, STL-QPSR 2/1989:1–83, 1989. cally significant differences between words. The Wightman, C., “Perception of multiple levels overall conclusion must be that the perceived of prominence in spontaneous speech”, ASA prominence relations in the utterances are very 126th Meeting Denver 1993 (abstract). similar whether expressed on a 2-point scale, Chafe, W., “Discourse consciousness, and time: a 4-point scale or a 31-point scale. The difthe flow and displacement of conscious exferences are small and are mostly caused by a perience in speaking and writing”, The Unitendency for some raters to prefer a restricted versity of Chicago Press, 1994. range within a multi-level scale. The differences are also relatively small when it comes to statistical testing of observations, but it does seem that raising the number of scale points from two to four yields slightly better results: there are more significant differences between the part of 114 Proceedings, FONETIK 2005, Dept. of Linguistics Göteborg University The postvocalic consonant as a complementary cue to the perception of quantity in Swedish – a revisit Bosse Thorén Dept. of Linguistics at The University of Stockholm based on the temporal organization of Swedish, with stressed syllables having longer duration than unstressed, and the well known quantity contrast in stressed syllables, manifesting itself as either /VC/ or /VC/. The simplified prosodic description is henceforth called basic prosody, or BP, and it can in the teaching situation be reduced to a short recommendation: “lengthen the proper speech sound”, thus aiming at enhancing the word stress as well as the quantity contrast, both of which depend mainly on duration as perceptual cue for the listener (Fant & Kruckenberg 1994, Thorén 2003). Measuring of Swedish syllable duration has shown that stressed syllables are 50-100% longer than unstressed syllables (e.g. Strangert 1985). If a stressed syllable containing a short vowel is going to be lengthened, an increased post-vocalic consonant duration is one way of maintaining the proper duration of the stressed syllable. The importance of Swedish word stress is studied by Bannert (1986) who showed that word stress on the improper syllable can make otherwise familiar words unintelligible to the Swedish listener. The importance of the quantity contrast is evident from the fact that there are numerous minimally contrasting word pairs, also within the same part of speech. The role of duration as an important cue to both the word stress and the quantity feature makes it reasonable to assign it great importance in Swedish pronunciation. Thorén (2001) showed that digitally increased duration in phonologically long segments in Swedish with a foreign accent tended to be judged as improved Swedish pronunciation by native Swedish listeners. The study showed similar effect for lengthening of both vowel and consonant duration. The non-native speaker in the study was Polish, and Polish is a language without phonological quantity. The last mentioned study deals with the duration feature as maintainer of euphony rather than phonology, and BP is assumed to enhance naturalness as well as the word stress and the quantity contrasts. In this intricate Abstract Is the duration of the post-vocalic consonant in stressed syllables an important property when teaching Swedish as a L2? Is it a cue to the discrimination of /VC/ and /VC/ words or a buffer for proper syllable duration, or both? Four Swedish words, providing two minimal pairs with respect to phonologic quantity, and containing the vowel phonemes // and //, were gradually changed temporally from /VC/ to /VC/ and vice versa. Manipulations of durations were made in two series – one with changing of vowel duration only, and one with changing of vowel and consonant duration in combination. 30 native Swedish listeners decided whether they perceived test words as original quantity type or not. The results show that the duration of the post-vocalic consonant had substantial influence on how the listeners categorized the test words. The study also includes naturalness judgements of the test words, and here the “proper” post-vocalic consonant duration had a positive influence on the listener’s judgements of naturalness for // but not for //. Introduction Teaching and learning the pronunciation of a second language comprises many considerations as to what phonetic features are more or less important in order to – on the one hand – make oneself understood, and on the other hand not to disturb the listener. Bannert (1984) states: “…to improve pronunciation when learning a foreign language, … linguistic correctness has been the guiding principle. It seems however, that hardly any consideration has been given to the native listener’s problems of understanding foreign accent”. In the past 20-25 years, a simplified description of Swedish prosody for pedagogical use has appeared in a number of teaching media (Kjellin 1978, Fasth & Kannermark 1989, Slagbrand & Thorén 1997). The description is 115 Proceedings, FONETIK 2005, Dept. of Linguistics Göteborg University interplay between sentence stress, word stress and quantity, the perceptual role of the postvocalic consonant is probably the least explored property. The present study is an attempt to evaluate the role of the duration of the postvocalic consonant as a cue to the phonologic quantity contrast. In addition to temporal corelates, the phonological quantity contrast is also known to use different proportions of spectrum as perceptual cue, depending on vowel phoneme and regional variety. Studies testing the perceptual roles of these correlates (HaddingKoch & Abramson 1964, Behne et al. 1997, Thorén 2003) have arrived at somewhat different conclusions, but agree that duration is the overall most important cue to the quantity contrast, but that the /a/ phoneme ([]-[a]) and even more the // phoneme ([]-[]) depend more on spectrum than the rest of the Swedish vowel inventory. In a study by Thorén (2003), in which spectral properties were kept intact while both vowel and consonant durations were manipulated in a complementary way, most of the listeners perceived even the // phoneme as non-original quantity type, which was not the case in the study by Hadding-Koch & Abramson (1964). This indicates that the total timing of the VC-sequence, rather than mere vowel duration, can be important for discrimination of /VC/-/VC/, and that the listeners in Thorén (2003) may have used the post-vocalic consonant as a complementary cue. However there are confusing conditions since Thorén (2003) used a central standard Swedish speaker, and listeners from all over the Swedish speaking area, while Hadding-Koch & Abramson (1964) used south Swedish speaker and listeners. South Swedish (Skåne dialect) is known for having smaller differences in consonant duration after long and short vowel allophone (Gårding 1974), and moreover other spectral properties for the vowel system than central standard Swedish. The present study compares two series of vowel duration manipulations; one with changing of vowel durations only, and one with vowel and consonant duration change in combination in accordance with the complementary VC-relation in Swedish. This method could evaluate the consonant duration as a possible complementary cue to the Swedish quantity distinction. Hypothesis 1: Complementary vowel + consonant duration change helps the listener perceive the non original quantity type with less vowel duration change than change of vowel duration only. Hypothesis 2: Test words with complementary duration – /VC/ or / VC/ – will be judged as more natural sounding than words with “correct” vowel duration and “wrong” consonant duration i.e. /VC/ or /VC/. Method Stimuli The test words in the present study are mäta [mta] ‘to measure’, mätta [mta] ‘to satisfy’ skuta [skta] ‘boat’, skutta [skta] ‘to scamper’. These words provide two minimal pairs with respect to phonologic quantity. One pair - contains the vowel phoneme //, and the other pair contains the vowel phoneme //. The words were recorded in a fairly damped room in the present authors home, using a Røde NT3 condenser microphone and a Sony MZ-N710 mini-disc player. The speaker was a Swedish male speaking central standard Swedish (Stockholm variety). The test words were pronounced within a carrier phrase: Det var …… jag menade ‘It was ….. that I meant’ Vowel and consonant durations in the test words were manipulated in Praat (Boersma & Weenink 2001). All stimuli were given stepwise vowel duration change. Half of the stimuli kept a constant consonant duration, identical with the original quantity type, and the other half were given stepwise consonant duration changes based on original values for non-original quantity type. The manipulated durations are shown in table 1. Listeners 30 native speakers of Swedish listened to the 48 stimulus words, marking whether they perceived them as /VC/ or /VC/. The listeners were between 23 and 60 of age, and had different regional varieties of Swedish as their L1. None of them had any hearing deficiencies that affected their perception of normal speech. Presentation The 48 stimuli were presented in random order, in the carrier phrase, preceded by the reading of stimulus number. The test was presented from 116 Proceedings, FONETIK 2005, Dept. of Linguistics Göteborg University Table 1. Vowel- and consonant (occlusion) durations for manipulated stimuli in the present study. Shaded parts represent original durations for non-original quantity type. 30 Number of "mätta" (VC:) responses for original "mäta" (V:C) 25 20 Changing of vowel duration only (ms) 15 V 188 168 148 128 108 88 10 [mta] C 153 153 153 153 153 153 5 Original V 136 156 176 196 216 236 0 [mta] C 334 334 334 334 334 334 V C 141 166 121 166 101 166 81 166 61 166 41 166 V C 166 312 186 312 206 312 226 312 246 312 266 312 Original Original [skta] Original [skta] 188 [mta] Original [mta] Original [skta] Original [skta] V C 188 234 168 254 148 274 128 294 108 314 88 334 V C 136 253 156 233 176 213 196 193 216 173 236 153 V C 141 232 121 252 101 272 81 292 61 312 41 332 V C 166 246 186 226 206 206 226 186 246 166 266 146 148 128 108 88 Vowel duration Number of "skutta" (VC:) responses for original "skuta" (V:C) 30 25 20 15 10 5 0 Changing of V and C duration (ms) Original 168 141 121 101 81 61 Vow el duration 41 Figure 1. Number of /VC/-responses for each value of vowel duration in original /VC/-words. Filled squares represents manipulations of both vowel and consonant durations and open squares represents manipulations of vowel duration only. CD-player via headphones. The listener was first allowed to hear 2-3 stimuli while adjusting the sound level. The response was marked on an answering sheet, presenting the number and the pair of words providing the two choices. The listener had to chose one of the two possibilities. Naturalness rating was done in direct connection to each identification task. After hearing the test word a second time, the listener marked a figure (1-10) on a horizontal 10 cm scale, where 1 represented “totally unnatural or unlikely pronunciation for a native speaker of Swedish” and 10 “totally natural pronunciation for a native speaker of Swedish, regardless of regional variety”. Number of "mäta" (V:C) responses for original "mätta" 30 25 20 15 10 5 0 136 156 176 196 Vowel duration 216 236 Number of "skuta" (V:C) responses" for original "skutta" (VC:) 30 25 20 15 10 5 0 Result In both the vowel lengthening series and the vowel shortening series, the complementary consonant manipulation seems to have distinct influence on the listeners perception of /VC/ or /VC/ (figure 1 and 2). Listeners start to perceive stimuli as non-original quantity type at lower degree of vowel duration change when the post-vocalic consonant duration follows the complementary pattern. For //, the complementary manipulation seems to make 166 186 206 226 246 Vowel duration 266 Figure 2. Number of /VC/-responses for each value of vowel duration in original /VC/-words. Filled squares represents manipulations of both vowel and consonant durations and open squares represents manipulations of vowel duration only. 117 Proceedings, FONETIK 2005, Dept. of Linguistics Göteborg University less difference compared to //, both when going from /VC/ durations to /VC/ and vice versa. The over all effect of duration change is greater for // than for //, which is expected, because of the greater difference in formant spectrum between long and short allophone of //. “Correct” consonant duration gave higher naturalness ratings in the two // series, but had a vague effect in the // series. There was a slight positive effect when going from original skutta [skta] to skuta [skta] , and a small but consistent negative effect when going in the other direction. The observed effect on naturalness from post-vocalic consonant duration in both series containing the // phoneme has low significance, due to the smaller number of “non-original quantity type” responses. References Bannert R. (1984) Prosody and intelligibility of Swedish spoken with a foreign accent. Nordic Prosody III. Acta Universitatis Umensis, Umeå Studies in the Humanities 59, 7-18. Bannert R. (1986) From prominent syllables to a skeleton of meaning: a model of prosodically guided speech recognition. In Proceedings of the XIth ICPhS Tallinn, 7376. Behne D. Czigler P. and Sullivan K. (1997) Swedish Quantity and Quality: A Traditional Issue Revisited. In Phonum 4, Dept of Linguistics, Umeå University. Boersma P. & Weenink D. (2001) Praat – a system for doing phonetics by computer. http://www.fon.hum.uva.nl/praat/ Fant G. and Kruckenberg A. (1994). Notes on stress and word accent in Swedish, KTH, Speech Transmission Laboratory, Quarterly Progress and Status Report 2-3, 125-144. Fasth C. & Kannermark A. (1989) Goda grunder. Kursverksamhetens förlag, Lund. Gårding E. (1974). Den efterhängsna prosodin. I: Teleman & Hultman Språket I bruk. Liber, Lund. Hadding-Koch K. & Abramson A. (1964). Duration versus spectrum in Swedish vowels: Some perceptual experiments. I Studia Linguistica 18. 94-107. Kjellin O. (1978) Svensk prosodi i praktiken. Hallgren & Fallgrens studideförlag. Uppsala. Slagbrand Y. & Thorén B. (1997) Övningar i svensk basprosodi. Semikolon, Boden. Thorén B. (2001). Vem vinner på längden? Två experiment med manipulerad duration i betonade stavelser. D-uppsats i fonetik. Institutionen för filosofi och lingvistik. Umeå universitet. Thorén B. (2003). Can V/C-ratio alone be sufficient for discrimination of V:C/VC: in Swedish? A perception test with manipulated durations. Phonum (Department of Phonetics, Umeå University) 9, 49-52. Conclusion The result shows that the duration of the postvocalic consonant is more than a means to assign the proper length to stressed syllables. It does obviously play a distinctive role for the perception of quantity type in the present material. Since the involved vowels represent the maximal (//) and the minimal (//) spectral differences between long and short vowel allophone in the Swedish vowel inventory, the result indicates that the duration of the postvocalic consonant functions as a general complementary cue to the perception of quantity type in Swedish. The ambiguous contribution from “correct” consonant duration to naturalness for //, can probably be accounted for by the already damaged naturalness caused by changing of durations with intact spectral properties. In the case of //, the listeners were probably not disturbed by “incorrect” vowel timbre, and could consequently enjoy the adjusted consonant duration easier. Since there is already enough evidence for the greater duration of stressed syllables in Swedish, it can be assumed that the duration of the post-vocalic consonant contributes to perception of quantity, word stress and – in most cases – improved naturalness. This in turn makes it reasonable to regard both vowel and consonant duration as important properties when learning Swedish as a second language. 118 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Gender differences in the ability to discriminate emotional content from speech Juhani Toivanen1, Eero Väyrynen2 and Tapio Seppänen2 1 MediaTeam, University of Oulu and Academy of Finland 2 MediaTeam, University of Oulu Abstract In this paper, an experiment is reported which was carried out to investigate gender differences in the ability to infer emotional content from speech. Fourteen professional actors (eight men, six women) produced simulated emotional speech data representing the most important basic emotions (three emotions in addition to neutral). Each emotion was simulated when reading aloud a semantically neutral text. Fifty-one listeners (27 males, 24 females) were asked to listen to the speech samples and choose (among the four options) the most appropriate emotional label describing the simulated emotional state. The female listeners were consistently better at discriminating the emotional state from speech than the male subjects. The results suggest that females are emotionally more sensitive than males, as far as emotion recognition from voice is concerned. Introduction Phoneticians, speech scientists and engineers are taking increasing interest in the role of the expression of emotion in speech communication. In addition to so-called basic emotions, other global speaker-states are investigated, for example, irritation and trouble in communication (Batliner et al. 2003). A major approach in basic (phonetic) research has been to investigate the vocal parameters of specific emotions, and these parameters are now understood relatively well. Nowadays, the role of the vocal expression of emotion is gaining increasing importance in the computer speech community, for example, in the applied context of the automatic discrimination/classification of emotional content from speech (ten Bosch 2003). It can be argued that, after a long exploratory stage, the study of the vocal expression of emotion is reaching a level of maturity where the main focus is on important applications, particularly those involving human-computer interaction. In the study of the vocal communication of emotion, an important taking-off point is the base-line data, i.e. the human emotion discrimination performance level. There is now a relatively large literature on the human discrimination of emotions from speech: reviewing over 30 studies of the subject conducted up to the 1980’s, Scherer (1989) concludes that an average accuracy percentage of about 60 % can be obtained in experiments where listeners are to infer emotional content from vocal cues only (without any help from lexis etc.). In a recent large-scale cross-cultural study (Scherer et al. 2001), an accuracy level rate of 66 % was found, across emotions (neutral, anger, fear, joy, sadness and surprise) and cultural contexts (Europe, Asia and the US). In a western cultural context, vocal recognition of six emotions (neutral, anger, fear, joy, sadness and disgust) was 62 %. Typically, in investigations of the human discrimination of emotions, a standard speech sample (an utterance or a short passage) is used: the same lexical content is produced (often by actors) with different simulated emotions and test subjects are asked to choose the most appropriate emotional label for each sample (among the intended emotion categories). The emotions investigated in these studies usually represent “basic emotions”: it is argued that certain emotions – at least fear, anger, happiness, sadness, surprise and disgust – are the most important or basic emotions (because they are seen to represent survival-related patterns of responses to events in the environment). Although the vocal expression of emotions has been investigated rather intensively, at least as far as simulated data is concerned (and empirical evidence has cumulated indicating how well basic emotions can be discriminated by human listeners in different cultures), there has been little research on inter-subject differences (within a culture) in emotion discrimination ability. Usually, the emotion recognition performance level of a group of test subjects is reported as a single numerical value, without making any intra-group distinctions. Thus there is very little reported empirical evidence concerning possible differences between female 119 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University speech samples) from two high-quality computer speakers. The emotional labels to choose between were limited to the intended emotions, not containing any distracters. The subjects heard the samples in random order in eight consecutive sessions within a period of two months (each session was arranged at the beginning of a lesson). and male subjects, for example, in their ability to infer emotional content from vocal cues only. This paper concentrates on the inter-gender differences in emotion recognition ability in a simulated emotional speech data context. The research question is: within a speech community, are female listeners better than male listeners at distinguishing between different emotions in speech? And if they are better, are they consistently so, that is, are they better than male listeners also at discriminating emotions from speech produced by male speakers? To our knowledge, these are questions no-one has systematically addressed in the literature on the vocal communication of emotion. Results Tables 1-9 show the results of the experiment: the emotion discrimination performance of the subjects is first presented in toto (female and male subjects listening to female and male speakers), and then the results are broken down into sub-categories (females listening to all speakers, females listening to females only, etc.). Each table is a confusion matrix, where the column on the left indicates the intended emotions and the rows indicate the recognized emotions. The underlined percentages indicate the average discrimination accuracy for each specific emotion. The average emotion recognition performance level in each setting is given as the “TOTAL” percentage. Speech data For the purposes of the present study, simulated emotional speech data was collected. Fourteen professional actors (eight men, six women) produced the speech data. The speakers were aged between 26 and 50 (average age was 39); all were speakers of the same northern variety of Finnish. The speakers read out a phonetically rich Finnish passage of some 120 words simulating three basic emotions, in addition to neutral: sadness, anger and happiness/joy. The text was emotionally completely neutral, representing matter-of-fact newspaper prose. The recordings were made in an anechoic chamber using a high quality condenser microphone and a DAT recorder to obtain a 48 kHz, 16-bit recording. The data was stored in a PC as wav format files. Each monologue was divided into five consecutive segments of equal duration for discrimination experiment purposes: thus there were a total of 280 emotional speech samples with an average duration of 13 seconds (five samples for four emotions by fourteen speakers). Human ments discrimination Table 1. Emotion discrimination from voice: females and males listening to females and males. experi- A performance test for human emotion discrimination was performed in the form of listening tests. The listeners were students in a junior high school, aged between 14 and 15. Fifty-one subjects (27 males, 24 females) participated as volunteers. All were speakers of the same northern variety of Finnish (the actors were speakers of the same variety of Finnish). The listening tests took place in a classroom where the subjects heard the speech data (280 TOTAL Neutral Sad 76.9 % Neutral 78.4 % 16.9 % Angry Happy 2.6 % 2.1 % Sad 12.9 % 85.3 % 1.0 % 0.8 % Angry 14.9 % 2.9 % 76.9 % 5.3 % Happy 24.3 % 5.4 % 3.3 % 67.0 % Table 2. Emotion discrimination from voice: females and males listening to males. TOTAL Neutral Sad 76.1 % Neutral 77.6 % 17.2 % Angry Happy 2.9 % 2.2 % Sad 14.7 % 83.9 % 1.1 % 0.4 % Angry 14.6 % 1.2 % 78.2 % 6.0 % Happy 26.2 % 5.3 % 3.6 % 64.9 % 120 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Table 3. Emotion discrimination from voice: females and males listening to females. Table 6. Emotion discrimination from voice: males listening to males. TOTAL Neutral Sad 77.9 % Neutral 79.5 % 16.3 % Angry Happy Angry Happy 2.2 % 2.0 % TOTAL Neutral Sad 73.9 % Neutral 77.2 % 16.6 % 3.7 % 2.5 % Sad 10.5 % 87.1 % 0.9 % 1.4 % Sad 15.9 % 81.8 % 1.8 % 0.5 % Angry 15.3 % 5.3 % 75.2 % 4.2 % Angry 16.4 % 1.8 % 74.9 % 6.9 % Happy 21.7 % 5.5 % 3.0 % 69.8 % Happy 28.3 % 5.6 % 4.7 % 61.4 % Table 4. Emotion discrimination from voice: males listening to females and males. TOTAL Neutral Sad 74.4 % Neutral 78.4 % 15.9 % Angry Happy 3.3 % 2.4 % Sad 14.4 % 83.0 % 1.6 % 1.0 % Angry 16.8 % 3.8 % 73.3 % 6.1 % Happy 26.8 % 6.0 % 4.3 % 62.8 % Table 7. Emotion discrimination from voice: females listening to males. TOTAL Neutral Sad 78.8 % Neutral 78.2 % 18.0 % Angry Happy 2.0 % 1.9 % Sad 13.2 % 86.3 % 0.2 % 0.2 % Angry 12.5 % 0.5 % 82.0 % 5.1 % Happy 23.9 % 4.9 % 2.4 % 68.8 % Table 5. Emotion discrimination from voice: females listening to females and males. Table 8. Emotion discrimination from voice: males listening to females. TOTAL Neutral Sad 79.7 % Neutral 78.4 % 18.0 % Angry Happy Angry Happy 1.8 % 1.8 % TOTAL Neutral Sad 75.1 % Neutral 80.1 % 14.9 % 2.7 % 2.3 % Sad 11.1 % 87.9 % 0.3 % 0.7 % Sad 12.5 % 84.6 % 1.3 % 1.6 % Angry 12.7 % 1.9 % 81.1 % 4.3 % Angry 17.3 % 6.5 % 71.2 % 4.9 % Happy 21.4 % 4.6 % 2.2 % 71.7 % Happy 24.9 % 6.6 % 3.9 % 64.7 % 121 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Table 9. Emotion discrimination from voice: females listening to females. TOTAL Neutral Sad 81.1 % Neutral 78.8 % 18.0 % Angry Happy 1.6 % 1.7 % Sad 8.2 % 90.1 % 0.5 % 1.3 % Angry 12.9 % 3.8 % 80.0 % 3.3 % Happy 18.2 % 4.2 % 2.0 % 75.6 % The average human emotion discrimination ability was approximately 77 %, which can be regarded as a good result in the light of earlier research. What is more interesting from the viewpoint of this paper is the systematic advantage of the female listeners. Discussion and conclusion Looking at the results, it can be seen that the female subjects were better than the males in the emotion discrimination task in each setting: they were better (79 %) than the male listeners (74 %) at inferring emotional content also from the speech data produced by the male speakers. The male-male listening setting (74 %) in fact produced the lowest emotion discrimination performance level among the nine settings. The best results were, unsurprisingly, obtained in a setting involving females listening to female speakers (81 %). However, it cannot be argued that the male listeners were really poor with female speech data (75 %) bearing in mind the general accuracy results reported in the literature. All in all, the female speakers produced vocal portrayals of emotions which were easier to interpret (by both sexes) than those produced by the male speakers (78 % vs. 76 %). The emotional state which was best recognized in the whole data was sadness in the femalefemale setting (90 %); the most difficult emotion was happiness in the male-male setting (61 %). It would be interesting to speculate about the possible reasons for this: maybe males are not interested in finding happiness in fellow males while females are generally empathetic towards other females in distress? That the female listeners/speakers were better with the vocal communication of emotion than the male listeners/speakers is not surprising. A relevant concept in this context may, in fact, be empathy. Psychological research (see e.g. Tannen 1991) has shown that female superiority in empathizing is manifested in interaction by the following trends, for example: females’ speech involves much more direct talk about feelings and affective states than “guy talk”, females are usually more co-operative and reciprocal in conversation than males, and females are much quicker to respond empathically/emotionally to the distress of other people. It has been shown that, form birth, females look longer at faces, and particularly at people’s eyes, while males are more prone to look at inanimate objects (Connellan et al. 2001). The results of this study support the consensus view that, emotionally, females are more sensitive than males; this time concrete evidence is presented for the vocal (prosodic, nonlexical) communication of emotion. To draw more far-reaching conclusions, however, we need more speakers to produce the speech data, so that we can exclude the possible effect of speaker-specific idiosyncrasies on the results of the listening tests. References Batliner A., Fischer K., Huber R., Spilker J. and Nöth E. (2003) How to find trouble in communication. Speech Communication 40, 117-143. Connellan J., Baron-Cohen S. Wheelwright S., Ba’tki A. and Ahluwalia J. (2001) Sex differences in human neonatal social perception. Infant behavior and Development 23, 113-118. Scherer K. R. (1989) Vocal correlates of emotion. In Wagner H. and Manstead A. (eds.) Handbook of Psychophysiology: Emotion and Social Behavior, 165-197. London:Wiley. Scherer K.R., Banse R. and Walbott H.G. (2001) Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology 32, 76-92. Tannen D. (1991) You just don’t understand: Women and men in conversation. London:Virago. ten Bosch L. (2003) Emotions, speech and the ASR framework. Speech Communication 40, 213-225. 122 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Vowel durations of normal and pathological speech Antonis Botinis1,2, Marios Fourakis3 and Ioanna Orfanidou1 1 Department of Linguistics, University of Athens, Athens, Greece 2 School of Humanities and Informatics, University of Skövde, Skövde, Sweden 3 Department of Speech and Hearing Science, The Ohio State University, Columbus, USA Abstract Experimental methodology This is an experimental investigation of vowel durations in Greek, produced by normal speakers as well as speakers with cerebral palsy mobility dysfunction. The results indicate that mobility, gender and stress have significant effects on vowel durations. There are also significant interactions between mobility and stress as well as between gender and stress but not between mobility and gender. The speech material under investigation consists of disyllabic nonsense words in the context of a meaningful carrier phrase. The words have a CVCV segmental structure where the first vowel (V) is one of the five Greek vowels, i.e. {i, e, a, o, u} in the carrier phrase “to kláb sVsa pézi kalí musikí” (The club sVsa plays good music). The nonsense key words were produced with lexical stress either on the first or second syllable and the speech material was produced in normal tempo with no prosodic brake on an individual basis. The speakers were six persons with cerebral palsy dysfunction and six persons with no known pathologies (henceforth called the “mobility factor”) with standard Athenian Greek pronunciation. Each group was comprised of three female and three male speakers. Acoustic analysis was carried out with the use of Wavesurfer and measurements were made of the vowel durations from the waveform. The results were subjected to statistical analysis with the StatView software package and ANOVA tests were carried out. In the remainder of this paper, the results are presented next, followed by discussion and conclusions. Introduction This is an experimental investigation of vowel durations in Greek as a function of mobility, gender, stress and vowel category. Two main questions are addressed: (1) what are the effects of the investigated factors? And (2) what are the interactions between the factors? Considerable research has been carried out on Greek and contrastive prosody with regards to temporal structures and vowel durations (see e.g. Fourakis, 1986, Botinis, 1989). Thus, different vowels have different intrinsic durations, according to which low vowels are longer than high vowels and back vowels tend to be longer than front vowels (Fourakis at al., 1999). Stress has a temporal effect on vowels, according to which stressed vowels are longer than unstressed vowels (Botinis, 1989, Botinis et al., 2001, 2002). Gender has also a temporal effect on vowels and thus vowels produced by female speakers are longer than vowels produced by male speakers. The effect of gender is a language-specific effect as it has been reported for some languages, such as Greek and Albanian, but not for others, such as English and Ukrainian (Botinis et al., 2003). Our knowledge with regards to pathological speech and cerebral palsy mobility dysfunction is very limited and the main target thus of the present investigation is to produce basic data and initialize research on speech produced by speakers with various pathologies. 123 Results The results are presented in Figures 1-6, based on the acoustic analysis and duration measurements of the total speech material in accordance with the experimental methodology. Figure 1 (next page) shows overall vowel durations as a function of mobility and gender. Vowels produced by speakers with cerebral palsy were significantly longer than vowels produced by speakers with no pathologies (F(1,596)=40.08, p<.001). Vowels produced by female speakers were longer than vowels produced by male speakers (F(1,596)=14.18, p<.001). The interaction was not significant. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University 160 Cerebral Palsy Normal 120 120 80 80 40 40 0 Normal 0 Female Male a Figure 1. Overall vowel durations (ms) as a function of mobility and gender. 160 Cerebral Palsy 160 Cerebral Palsy Normal e i o u Figure 4. Individual vowel durations (ms) as a function of mobility. Female 160 120 120 80 80 40 40 Male 0 0 Stressed a Unstressed Figure 2. Overall vowel durations (ms) as a function of mobility and stress. Female 160 Male i o u Figure 5. Individual vowel durations (ms) as a function of gender. Stressed 160 120 120 80 80 40 40 0 e Unstressed 0 Stressed Unstressed a e i o u Figure 3. Overall vowel durations (ms) as a func- Figure 6. Individual vowel durations (ms) as a function of gender and stress. tion of stress. Figure 2 shows overall vowel durations as a function of mobility and stress. As before, vowels produced by persons with cerebral palsy were significantly longer than vowels produced by persons with no pathologies. Stressed vowels were longer than unstressed vowels (F(1,596)=527, p<.001). However, the interaction was significant (F(1,596)=19, p<.001). The difference between the two groups of speakers was greater for the stressed vowels than for the unstressed vowels. Post hoc t-tests revealed that the differences between groups 124 were significant for both stressed and unstressed vowels. Figure 3 shows overall vowel durations as a function of gender and stress. As before, vowels produced by female speakers were longer than those produced by male speakers and stressed vowels were longer than unstressed vowels. However, the interaction was significant (F(1,596)=43, p<.001). Post hoc t-tests revealed that the difference was significant for stressed vowels (t(298)=6.714, p<.001) but not for unstressed vowels (t(298)=-1.574, p>.05). Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University The next set of comparisons examined individual vowel durations, in order to determine whether the three factors (mobility, gender, stress) affected all vowels uniformly or differently for each vowel. Figure 4 (previous page) shows individual vowel durations for each speaker group. As before, vowels produced by speakers with cerebral palsy were longer than vowels produced by speakers with no pathologies. In addition vowel category also significantly affected vowel duration (F(1,590)=33.56, p<.001). However, there is no significant interaction (F(4,590)=1.0, p>.05). Thus, the effect of group was uniform for all vowels. Figure 5 (previous page) shows individual vowel durations for each gender. As before, vowels produced by female speakers with were longer than vowels produced by male speakers. Vowel category also significantly affected vowel duration (F(1,590)=33.56, p<.001). However, there was no significant interaction (F(4,590)=2.364, p>.05). Even though the analysis of variance did not produce a significant interaction (due to large variances), it is evident from the figure that the vowel [i] was longer for male speakers than for female speakers. Figure 6 (previous page) shows individual vowel durations for each stress condition. As before, stressed vowels were longer than unstressed vowels. In addition vowel category also significantly affected vowel duration (F(1,590)=66.44, p<.001). The interaction was also significant (F4,590)=3.768, p<.01). The effect is due to the different behavior of high [i, u] versus nonhigh [e,o,a]. High stressed vowels were on the average 42 ms longer than their unstressed counterparts while nonhigh stressed vowels were on the average 56 ms longer than their unstressed counterparts. Discussion In the present investigation, some old knowledge has been corroborated and some new knowledge has been produced. The old knowledge concerns the vowel category durations as well as the effects of stress and gender on vowel durations and the new knowledge concerns the effects of cerebral palsy dysfunction on vowel durations. Our results indicate that the investigated factors of mobility, gender and stress have a significant effect on vowel durations, i.e. vow125 els produced by speakers with cerebral palsy dysfunction are longer than vowels produced by normal speakers, vowels produced by female speakers are longer than vowels produced by male speakers, and stressed vowels are longer than unstressed vowels. There were also significant interactions between mobility and stress as well as gender and stress, i.e. both mobility and gender temporal effects are mostly correlated to stressed syllables and have hardly any effect on unstressed syllables. Furthermore, the five Greek vowels have different intrinsic durations, which are mainly determined by the low vs. high dimension, i.e. low vowels are significantly longer than mid vowels which, in turn, are significantly longer than high vowels. The intrinsic vowel durations, also referred to as “microprosody”, are widely documented in the phonetics literature and reported in many languages, among them in Greek (Di Cristo and Hirst, 1986, Fourakis et al., 1999), and the present investigation corroborates earlier reports on this area. The prosodic correlates of lexical stress have also been studied extensively and the results indicate that, other phonetic contexts being equal, stressed syllables and hence vowels are most usually longer than unstressed syllables in a variety of related as well as unrelated languages (Beckman, 1986, Botinis, 1989, Sluijter, 1995, Fant et al., 1991, 2000, Botinis et al., 2002, de Jong, 2004). Duration has been reported as an invariable acoustic correlate, which also functions as a perceptual correlate of lexical stress distinctions in Greek (Botinis, 1989, Botinis et al., 1999). Lexical stress has variable effects on consonants and vowels in different languages. In some languages the effects of lexical stress are larger on vowels than on consonants, whereas, in other languages, the effects of lexical stress are more equally distributed on vowels and consonants (Botinis, 2003). The effects of lexical stress in Greek are larger than other prosodic effects such as focus and syllable position (Botinis et al., 2002). The effects of gender on segmental durations have not drawn particular attention in prosodic research and thus very little is known on this area. In the present investigation, experimental evidence has been provided that female speakers produce vowels with longer durations than that of male speakers. This is most probably a sociolinguistic effect as in Albanian and Greek, e.g., female vowel productions have longer durations than male vowel productions Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University whereas in other languages, such as English and Ukrainian, no gender effects on segmental durations have been observed. However, the effects of gender on vowel durations are mostly evident on vowels of stressed syllables and not on vowels of unstressed syllables. The most important finding and the main target of the present investigation concerns the mobility factor and the effects of cerebral palsy dysfunction on vowel durations. Obviously, cerebral palsy has a lengthening effect on vowel durations, which is however confined to the stressed syllables. Thus, cerebral palsy speakers have satisfactory temporal control with reference to vowel durations of unstressed syllables which implies that cerebral palsy temporal effects are not evident in general speech production but are rather confined to several prosodic and phonetic categories. The results of the present investigation provide a starting point for research in this area and further work is required before the temporal structure of speech produced by persons with cerebral palsy is basically understood. Beyond the present results, this investigation has led to further immediate questions with reference to consonant and vowel productions. Thus, on the one hand the effects of mobility on consonant durations and, on the other hand, the effects of mobility on quality and the formant structure of vowel productions are eminent questions to be dealt with in the framework of the present investigation. Conclusions In accordance with the results of the present investigation the following conclusions have been drawn: First, each mobility factor, gender factor and stress factor has a significant effect on vowel durations. Second there are significant interactions between mobility and lexical stress as well as between gender and lexical stress but not between mobility and gender. Thus, both mobility factor and gender factor have considerably bigger effects on stressed syllables than unstressed ones. Acknowledgements Our sincere thanks to the speakers with cerebral palsy dysfunction as well as to our students at the University of Athens for their participation in the production experiments. 126 References Beckman M. E. (1986) Stress and Non-stress Accent. Dordrecht: Foris. Botinis A. (1989) Stress and Prosodic Structure in Greek. Lund University Press. Botinis A., Bannert R. Fourakis M., and Dzimokas D. (2003) Multilingual focus production of female and male focus production. 6th International Congress of Greek Linguistics, University of Crete, Greece. Botinis A., Bannert R. Fourakis M., and Pagoni-Tetlow S. (2002) Crosslinguistic segmental durations and prosodic typology. International Conference of Speech Prosody 2002, 183-186, Aix-en-Provence, France. Botinis A., Fourakis M., Panagiotopoulou N., and Pouli K. (2001) Greek vowel durations and Prosodic interactions. Glossologia 13, 101-123. Botinis, A., Fourakis, M., and Prinou I (1999). Prosodic effects on segmental durations in Greek”. 6th European Conference on Speech Communication and Technology EUROSPEECH 1999, vol. 6, 2475-78, Budapest, Hungary. de Jong K. J. (2004) Stress, lexical focus, and segmental focus in English: patterns of variation in vowel duration. Journal of Phonetics 32, 493-516. Di Cristo A., and Hirst D. J. (1986) Modelling French micromelody: analysis and synthesis. Phonetica 43, 11-30. Fant G., Kruckenberg A., and Liljencrants J. (2000) Acoustic-phonetic analysis in Swedish. In Botinis A. (Ed.) Intonation: Analysis, Modelling and Technology, 55-86. Dordrecht: Kluwer Academic Publishers. Fant G., Kruckenberg A., and Nord L. (1991) Duration correlates of stress in Swedish, French and English. Journal of Phonetics 19, 351-365. Fourakis M. (1986) An acoustic study of the effects of tempo and stress on segmental intervals in Modern Greek. Phonetica 43, 172188. Fourakis M., Botinis A., and Katsaiti M. (1999) Acoustic characteristics of Greek vowels. Phonetica 56, 28-43. Sluijter A. (1995) Phonetic Correlates of Stress and Accent. The Hague: Holland Ac. Graphics. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Acoustic Evidence of the Prevalence of the Emphatic Feature over the Word in Arabic. Zeki Majeed Hassan Department of Linguistics Gothenburg University, Sweden Abstract. An acoustic study is carried out to see whether the phenomenon of pharyngalization and/or velarizsation is confined to the emphatic consonant and the adjacent vowels or it extends over the whole word in Arabic. Measurements in Hz of F1 & F2 of front unrounded vowels in monosyllabic, bisyllabic and trisyllabic words in ISA having emphatic vs. non-emphatic consonants were made. They showed significant narrowing between F1 & F2 for vowels in the vicinity of emphatic consonants than those in the vicinity of non-emphatic consonants. This is attributed to the secondary coarticulatory configuration formed in the pharyngeal region by the projection of the root of the tongue toward the back wall of the pharynx and possible lowering of the velum toward rising tongue dorsum which prevails, though in different levels of significance, over the other syllables of the word. Phonetic and phonological background. Arabic emphatic consonants are characterized by two types of articulations. Abercrombie (1967) calls them primary and secondary articulations. He describes the secondary articulation as a stricture that involves less constriction of the vocal tract than the primary stricture. The most evident feature characterizing this secondary articulation, as agreed by many phoneticians, is the constriction of the pharynx and whence the term pharyngalization. In their cinefluorographic study, Ali and Danilof (1972) found out that during the articulation of emphatic consonants the tongue exhibits a simultaneous slight depression of the palatine dorsum and a rearward movement of the pharyngeal dorsum toward the posterior pharyngeal wall. They also observed a lowering velum toward rising tongue dorsum. What is more interesting in Arabic phonology is that this second articulation determines phonemic distinction between sounds having the same primary articulation e.g / seef / “sword” and / seef / “summer” where / s / and / s / share the same primary articulation i.e. both are voiceless alveolar fricative except that / s / is velarized and/or pharyngalized. In so far as the vowels preceding and/or following emphatic consonants, Hassan (1981) found out that they show closer F1 & F2 than those adjacent to non-emphatic consonants. His acoustic and myodynamic data also showed longer vowel duration before emphatic consonants than non-emphatic ones and attributed this to an earlier execution of the secondary constriction. Hence these vowels undergo quantitative as well as qualitative difference. He asserts that the phonetic exponents of the emphatic feature in Arabic are not confined to either the consonantal or the vocalic segment but stretch over both or probably over the whole syllable. In Firth’s (1957) prosodic analysis terms, the emphatic feature is considered as a prosody which is best seen as extending over units which can encompass more than one segment. McCowley, cited in Hyman (1975) argues against Jacobson’s theory of distinctive features where he failed to account for velarized and pharyngalized vowels adjacent to Arabic velarized and phayngalized consonants and suggests new phonological rules to cover these features. In this study we are looking for acoustic evidence of how far this emphatic feature prevails over the syllables of the word in Arabic. Experimental procedure and measurements criteria. Monosyllabic, bisyllabic and trisyllabic words have been recorded by two native speakers of Iraqi spoken Arabic (hereafter ISA) directly to a PC computer using Speech Station 2 software copyright 1995 with a microphone connected to audiocard and noise being cut down. The words have been used in a carrier sentence; / iktibuu-------- sit mar'raat/ “Write ----six times”. The minimal pairs have been repeated six times in a continuous recording session. The two informants (male and female) are in 127 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University their twenties and have no history of speech pathology representing educated Iraqi Arabic Hassan (1981). The file was then transferred to Praat software version 4.2.21 copyright 19922004 as it was found more specific and thorough in so far as it concerns formant measurements. Formant Measurements criteria are mainly based on our vocoid duration criteria Hassan (2002) and segmentation criteria of Fant (1958, 1960) and Peterson & Lehiste (1960). Vocoid segments were identified by positioning a cursor at time points in the wave form as well as the segment onset and offset on spectrograms. Both aspiration and affrication are excluded from the domain of vocoid duration. After identifying the vocoid segment domain F1 & F2 measurements were obtained by clicking “formants”. In this way formant variations are calculated within the specified domain and average value in Hz is given for the formant concerned. This is believed to get as much precision as possible concerning formant measurement and their frequency regions. Values of F1 and F2 for each vocoid e.g. (V1, V2…) have been calculated for six tokens and then values of F1 are subtracted from values of F2 for each token. Resultant values, which are indicative of the distance between F1 and F2, for the six tokens of the word concerned are then computed with the six token of its counterpart using Mann-Whitney U test to see how significant the distances between F1 & F2 among the computed tokens. Average values of F2 - F1 for each vocoid of the word concerned with their probability values are tabulated in tables 1, 2 and 3. All the values above are average values of the tokens of both informants (male and female). Our choice of front unrounded vowels for analysis is due to the fact that the most important parameters affecting the movement and transition of formants are lip rounding and back constriction as both cause F2 to go down to lower regions of frequencies and F1 few frequencies up, though in different degrees. Lip rounding is excluded from the analysis to see how far the constriction in the velo-phryngeal region affects this formant structure. No measurements of F3 have been made as no conclusive finding has been seen in the literature concerning the relationship between velopharyngeal constriction and F3 lowering. Any possible formant structure and/or formant transitions in the regions of contoids patterns are excluded from the measurements and analysis as it is generally believed that the acoustic parameters affecting vocoids could be the same affecting contoids patterns particularly when it concerns emphatic feature as it is still controversial whether the consonant or the vowel is the main domain of the emphatic feature Hassan (1981). Results and conclusions As can be seen from table 1 where the vowel is either preceded or followed by an emphatic consonant in monosyllabic words the distance between F1 & F2 is very much affected by the velo-pharyngeal constriction as the narrowing between F1 & F2 is very significant no matter whether the vowel is followed or preceded by an emphatic consonant. Table 1. Average values of measurements in Hz of F2 - F1 of vowels in monosyllabic words containing emphatic vs. non-emphatic consonants with P values (Mann-Whitney U test). F2-F1 # Monosyllabic words 1. / seef/ V 1042 2. /seef/ /faad/ /faad/ 1571 582 896 U-test(2tailed) 0.002 P<0.01 0.002 P<0.01 This is very much in agreement with previous studies Hassan (1981).Table 2, however, shows that not only those vowels preceded or followed by emphatic consonants are affected by the emphatic feature but also those in the second syllable following or preceding the syllable containing the emphatic consonant. It is true that V2 where the emphatic consonant is in the first syllable or word medial position showed less level of significance but still the influence of emphatic is very much accounted for. That is perhaps why in /'faatir / both V1 and V2 showed equal level of significance as the emphatic is word medial position. The same applies for the third word in the table where the emphatic is word final position where V1 shows less value of significance than V2 as V2 is nearer to the emphatic than V1. Nevertheless, vowels in all bisyllabic words showed significant narrowing between F1 & F2 128 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University and the emphatic feature is seen as prevailing over the two syllables of the words. Table 2. Average values of measurements in Hz of F2 – F1 of V1, V2 in bisyllabic words containing emphatic vs. non-emphatic consonants with U-test values showing level of significance between means of six tokens for each word. 2. 3. F2-F1 V1 665 1492 0.002 V2 1492 1661 0.04 P<0.01 P<0.05 /'raakid/ 602 840 0.002 P<0.01 747 690 1161 0.002 P<0.01 898 /'raakid/ 824 1829 U-test(2tailed) 0.03 P<0.05 0.001 P<0.01 /'faatir/ /'faatir/ U-test(2tailed) 3000 2500 V1[a] F2 2000 V1[a] F1 V2[aa] F2 HZ Bisyllabic words 1. /'saaʡib/ / 'saaʡib/ U-test(2tailed) # sonant in the first syllable word initial position the influence is hardly noticeable in the third syllable where the difference is statistically insignificant. This is surprisingly in total agreement with the findings of Ali and Danilof (1972) where they found no difference in tongue position or the articulatory behavior for / r / in the two words of the same minimal pair investigated by them. However, the picture gets compatible again with table 2 concerning the second and the third minimal pairs. In /fa'saaʡil / and /fa'saaʡil / for example all vowels and hence all syllables showed significant difference, though slightly less significance in V3. V2[aa] F1 1500 F2_neutral F3_neutral 1000 F1_neutral V3[i] F2 500 Table 3. Average values of measurements in HZ of F2 - F1 of V1, V2 and V3 in trisyllabic words containing emphatic vs. non-emphatic consonants with Utest values showing level of significance between means of six tokens for each word. Trisyllabic # words 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 mm F2-F1 V1 V2 /ta'baaʃiir/ 610 720 /ta'baaʃiir/ 988 850 U-test(2tailed) 0.002 0.009 P<0.01 P<0.01 V3 1818 1824 0.8 P>0.05 2. /fa'saaʡil/ 558 /fa'saaʡil/ 1008 U-test(2tailed) 0.002 P<0.01 627 931 0.002 P<0.01 1643 1776 0.02 P<0.05 3. /fa'raaʡid/ 536 /fa'raaʡid/ 695 U-test(2tailed) 0.002 P<0.01 500 683 0.002 P<0.01 806 1245 0.002 P<0.01 1. V3[i] F1 Figure 1 Frequencies in Hz plotted against vocal tract length (glottis-----lips distance.) where values in Hz of F1 & F2 for V1, V2 and V3 of six tokens of /fa'raaʡid/ vs. /fa'raaʡid/ are spotted to see how these formants appear from a front constriction to a velopharyngeal constriction. In table 3 however, the picture is slightly different with trisyllabic words. The first word / ta'baaʃiir /for example where the emphatic con- What is more interesting is that in /fa'raaʡid / and /fa'raaʡid /all vowels showed the same high significance despite the fact that the emphatic consonant is in the last syllable word final position. It is speculated here that the velopharyngeal constriction is assumed in anticipation of the emphatic consonant in word final position. This is in sharp contrast to the assertion of Odisho (1975) that the primary stricture must be 129 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University properly retained while the rearward gesture is executed and in line with Hassan’s (1981) myodynamic findings of an earlier execution of secondary constriction of vowels preceding emphatic consonant which was seen as the main reason behind the longer duration of vowels before emphatic consonant than those before non-emphatic ones. This may also clarify Mitchell’s assertion cited in Ali and Danilof (1972) (personal communication) that emphasis in Arabic has no predetermined domain-it may be referable to one, two or three syllables. Also, this may bear out Hassan’s (1981) assertion that it is still uncertain whether the vowel or the consonant is the main domain of the emphatic feature of Arabic. Figure 1 shows clear descend of F2 from higher frequency regions for V1, V2 and V3 in /fa'raaʡid/ to lower regions for /fa'raaʡid/ for all the six tokens for each. However, F1 does not show as much elevation as the lowering of F2.This is in line with many studies that confirm the importance of F2 lowering over the F1 raising in the identification of the pharyngeal constriction.(Watson 2002.) Nevertheless, the figure shows very clear narrowing of F1 & F2 for all the vowels and hence the syllables of /fa'raaʡid/. Than those for /fa'raaʡid/ Another interesting measurement has been made for the word duration of /fa'raaʡid/ in ms and found to be 613 ms. This is taken to represent the duration of the emphatic feature as well but it can well last more than that if this word is followed by another word having an emphatic consonant. A more extensive study is needed to investigate this aspect in more detail. In consequence, it is plausible to conclude from the above discussion that the phenomenon of emphatic feature extends beyond the emphatic consonant and may well prevail, though in different levels of significance, over one, two or three syllables no matter whether the emphatic is word initial, word medial or word final position. This acoustic evidence can be added to other myodynamic, articulatory and perceptual cues that work in synergism to see how far this emphatic feature prevails over a multisyllabic word in Arabic. Acknowledgements I would like to thank Mohammed and Day Majeed for acting as informants for the speech material and to Raghad Majeed for her help in bringing figure 1 to its final shape. References Abercrombie, D. (1967) Elements of General Phonetics. Edinburgh University Press.. Ali, L. H. and Daniloff, R. E. (1972) A cineflorographic – phonological investigation emphatic sounds assimilation in Arabic. Proceedings of the 7th International Congress of Phonetic Science .Montreal 1971, Mouton 1972, 639-648. Fant G. (1958) Modern instruments and methods for acoustic studies of speech. Proceedings of the 8th International Congress of Linguistics, 282-358 Oslo University Press. Fant G. (1960) Acoustics Theory of Speech Production. The Hague, Mouton. Firth J. R. (1957) Sounds and prosodies in Papers in Linguistics. 121-138. London. Oxford University Press. Hassan Z. M. (1981) An experimental study of vowel duration in Iraqi Spoken Arabic Unpublished Ph.D. thesis U.K. Dept. of Linguistics & Phonetics, University of Leeds. Hassan Z. M. (2002) Gemination in Swedish and Arabic with a particular reference to the preceding vowel duration: an instrumental and comparative approach. Proceedings of Fonetik 2002 TMH-QPSR 44, 81-85. Hyman L. M. (1975) Phonology: Theory and Analysis. Holt, Rinehart and Winston. Odisho. E. Y. (1975) The phonology and phonetics of the Neo-Aramaic as spoken by the Assyrians in Iraq. Unpublished Ph.D. Thesis, Dept of Phonetics, University of Leeds. Peterson & Lehiste (1960) Duration of syllable nuclei in English. Journal of the Acoustical Society of America, 32 (6) 693-703. Watson,Janet C.E (2002) The Phonology and Morphology of Arabic.Oxford University Press. 130 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Athens 2006 ISCA Workshop on Experimental Linguistics Antonis Botinis1,2, Christoforos Charalabakis1, Marios Fourakis3 and Barbara Gawronska2 1 Department of Linguistics, University of Athens, Greece 2 School of Humanities and Informatics, University of Skövde, Sweden 3 Department of Speech and Hearing Science, The Ohio State University, Columbus, USA Abstract This paper refers to the forthcoming ISCA (International Speech Communication Association) Workshop “Experimental Linguistics” in Athens, Greece, in 2006. The major objectives of the Workshop are (1) the development of experimental methodologies in linguistic research with new perspectives for the study of language, (2) the unification of linguistic knowledge in relation to linguistic theory based on experimental evidence, (3) the design of multifactor linguistic models and (4) the integration of interdisciplinary expertise in linguistic applications. Key knowledge areas ranging from cognition and neurophysiology to perception and psychology are organised in invited lectures as well as in oral and poster presentations along with interdisciplinary panel discussions. Background The present paper refers to the forthcoming workshop “Experimental Linguistics”, which is an ISCA (International Speech Communication Association) interdisciplinary workshop, to be held in Athens, Greece, in 2006 (workshop details along with paper submission procedures will be announced later this year). The workshop is organised under the joint aegis of the University of Athens, Department of Linguistics, Greece, The Ohio State University, Department of Speech and Hearing Science, USA, and the University of Skövde, School of Humanities and Informatics, Sweden. The scientific study of language is the backbone of a variety of established disciplines, among them theoretical linguistics, experimental phonetics, computational linguistics and language technology. Language is a complex code system, the study of which is related to a variety of knowledge areas ranging from cognition and neurophysiology to perception and psychology. The code system of language consists of functional categories in variable combinations and relations with multiple interactions, which determine the linguistic structure and the communicative function of language. 131 The study of language necessarily involves a variety of scientific fields associated with different aspects of linguistic knowledge. Thus, although language is commonly studied within the context of specific questions and targets of investigation, it is nevertheless a multifactor complex code with distinctive interactions between factors. Variability in any factor may have a serial effect on other factors which determine both the linguistic structure and the communicative function of language. Theoretical linguistics is probably the most widespread discipline of linguistic inquiry. Specification of abstract linguistic categories and linguistic structures related to language functions in established theoretical frameworks leads to analytical methodologies and theoretical hypotheses, which define the type of linguistic knowledge. Experimental phonetics runs a course parallel to theoretical linguistics. Specification of discrete phonetic correlates of abstract phonetic categories and phonetic structures related to language functions in sophisticated laboratory environments leads to analytical methodologies and theoretical hypotheses, which define the type of phonetic knowledge. Computational linguistics focuses on data specification and data processing against the background of different theories and experimental methodologies in language applications. Language technology is also a language applications discipline with interdisciplinary anchoring in, among others, theoretical linguistics, experimental phonetics, computational linguistics and computer science. In general, linguistic knowledge is related to a variety of established disciplines. This knowledge is, however, fragmented into different areas with different objectives, typically leading to tailor-cut specific theories and variable trial and error methodologies. Although much progress has been achieved in integrating interdisciplinary research and applications, much still lies ahead, especially with reference to the development and use of experimental methodologies with theoretical impact and the unification of linguistic knowledge. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Objectives and perspectives The major objectives of the Workshop are (1) the development of experimental methodologies in linguistic research with new perspectives for the study of language, (2) the unification of linguistic knowledge in relation to linguistic theory based on experimental evidence, (3) the design of multifactor linguistic models and (4) the integration of interdisciplinary expertise in linguistic applications. Knowledge is a fundamental condition of any science, be that theoretical, experimental or applied. How can we however define knowledge? And how much do we know about language? Assuming that knowledge is the ensemble of existing hypotheses in a scientific area, we may proceed with the very basic hypothesis about the nature of language and its typical characteristics. Language is the verbal means of communication mainly between speaker(s) or writer(s), i.e. senders, and hearer(s) or reader(s), i.e. receiver(s). Assuming further that written language is a conventional representation of spoken language, we will hypothesise about the latter. The intent to communicate through language initialises in the sender abstract concepts which are coded and sent through air or compatible transmission means as physical signals. These signals reach the receiver, are decoded and interpreted, and the message (in an ideal case, the message intended by the sender) is perceived and a communicative act thus takes place. A communicative act may however be a complex process with variable interactions between sender and receiver. What is going on inside the sender’s and receiver’s heads respectively? And what is the relation between the physical signal and the perception of language? Assuming that the transfer of meanings is the basic function of language, where is meaning derived from? How much do we know about these questions and how can we increase our knowledge? Although intensive research has added to our knowledge regarding these questions, much is still required before we can successfully deal with basic aspects of these and similar questions. Nevertheless, we do have a considerable amount of scientific knowledge about language to consider and make some plausible assumptions. In order to be perceived, the intended message has to be produced and the outcome of this production is – in spoken communication – an acoustic signal. This signal is the end result 132 of motor commands from the brain to, and coordinated control of, the speech production mechanism. The acoustic signal is processed and decoded by the auditory system and the perceptual mechanism, which ultimately extract the intended meaning. Consequently, the mutual relation between acoustic signal and intended meaning may be considered the very basic functional anchoring of linguistic structure and language communication. There are however several discrepancies between acoustic signal and intended meaning, the most characteristic of which are the continuous vs. discrete, as well as the one-to-many, relationships between the two. Thus, the acoustic signal is basically a continuous process whereas the intended meaning is a structural unit which consists of discrete functional categories. On the other hand, some variations of the acoustic signal, even large ones, may have no functional effect, whereas other variations, even small ones, may have critical effects on the transmission of intended meanings. Also, the same parameter of the acoustic signal may have different effects, whereas different acoustic parameters may have the same effect on intended meanings in different contexts. A typical example of the case above is segmental categories where duration or aspiration parameters may independently or in combination determine variable distinctions such as in stop consonants, and may have critical effects on produced words and thus meanings in a variety of languages. Duration may determine several distinctions, such as lexical stress and other prosodic categories, which also have critical effects on produced meanings. Sentence types and intonation forms are also typical examples in which, independently or in combination with other linguistic markers, dissimilar intonation forms may define the same sentence type, and, inversely, different sentence types may be defined by similar intonation forms in different contexts. In relation to the question above “where is meaning derived from?”, if we hypothesise that meaning is basically derived from the acoustic signal, we are consequently led to the question “how is meaning derived?”. Linguistic theories have historically posited a variety of functional linguistic units such as phonemes, morphemes, phrases and so forth, which the acoustic signal is presumably organised into. However, how much are these units determined by the acoustic signal and how much by psycholinguistic proc- Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University esses and knowledge of a particular language? For example, is the segmentation of the continuous acoustic signal into discrete words determined by acoustic effects or by knowledge of the language and the corresponding words? Alternatively, are some functional linguistic units determined by acoustic effects and others by knowledge of the language? Phonetic research on both production and perception aspects of these and similar questions is most indicative but further progress is definitely needed before we have basic psycholinguistic knowledge in these areas. Each and every acoustic signal is a unique event whereas any produced meaning may be in principle infinitely reproduced by any speaker of a given language. Thus, no acoustic signal may be identically reproduced in strict physical terms by different or even the same speaker in general. In addition to linguistic meaning, the acoustic signal may however be related to a large variety of other functions such as paralinguistic and extra linguistic ones with reference to e.g. speaker’s emotional state, age or gender. Consequently, relating the acoustic signal to intended meaning is a complex process, during which the acoustic signal is subjected to a variety of functional filterings in relation to distinctive linguistic categories and structures. In this respect, language is presumably organised in different levels of abstraction involving multiple interactions between the acoustic signal and intended meaning. Linguistic theory studies, as a rule, language with reference to several linguistic components and key areas such as phonology, morphology and syntax in isolation rather than in interaction. Although isolation methodology has its own merits in linguistic analysis, linguistic communication functions as a whole, with multiple interactions between categories and structures of different components, which have, in addition, variable overlapping functions. In languages, for example, with inflectional morphology, case variability distinctions may have a series of functional effects not only on morphology proper but also on syntax, such as subject and object specifications, and ultimately on semantic interpretation of produced utterances. Interaction analysis and integration perspectives between linguistic categories and linguistic structures in the study of language have generally not drawn particular attention, and thus little is known about the contribution of different components and linguistic categories 133 to linguistic communication. Although different components have historically been the object of intensive discussions, basic questions of the effects of morphology or syntax on the language communication process in the first place have only cursorily been addressed. In the vein of the functional anchoring between acoustic signal and produced meaning, isolation methodology, despite theoretical simplicity and practical merits, may lead to serious inadequacies in the study of language and linguistic theory. Thus, although semantics, for example, is related to all linguistic categories and structures and to the variability of the acoustic signal, it is most usually studied with reference to selective functional variability in the lexicon and syntax. On the other hand, although phonetics studies the functional variability of the acoustic signal, it most usually concentrates on the phonetic correlates of phonetic categories and phonetic structures with no attempt to integrate phonetic functioning in general linguistic theory. Similarly, linguistic theory does not usually involve production and perception mechanisms, for which phonetics has an established expertise and sophisticated experimental methodologies. Produced meanings are not however defined unless the situational context is specified, and thus, the pragmatics of linguistic communication in relation to the acoustic signal is another dimension of the link and interactions between acoustic signal and produced meanings. The unification of linguistic knowledge and the overall integration of theoretical linguistics and experimental phonetics seems the most realistic and promising enterprise in the study of language and the development of linguistic theory. Coming back to the major objectives of the Workshop, the use of experimental methodologies in linguistic research, similar to the ones already established in phonetic research, may be a promising start on the road towards new methodologies for and new perspectives on the study of language. Standard scientific criteria may thus be applied on linguistic analysis and development of linguistic theory, based on solid premises provided by experimental evidence. Promotion and unification of linguistic knowledge will provide an interactive background to the forms and functions of different components which will pave the way to the design of multifactor linguistic models to be applied in a variety of language applications with interdisciplinary dimensions. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Organisation The Workshop is organized into invited speaker lectures, original research presented in oral and poster presentations, and interdisciplinary panel discussions. Both oral and poster papers will undergo standard peer review from independent reviewers of the International Scientific Committee and a post-workshop volume is planned with representative papers of key knowledge areas, in addition to the published proceedings which is common practice for ISCA Workshops. In accordance with the objectives and perspectives of the Workshop, key knowledge areas and language applications are shown in Table 1. Table 1. Key linguistic areas and language applications of the Athens 2006 ISCA Workshop on Experimental linguistics. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Theory of language Cognitive linguistics Neurolinguistics Speech production Speech acoustics Phonology Morphology Syntax Prosody Speech perception Psycholinguistics Pragmatics Semantics Discourse linguistics Computational linguistics Language technology Key linguistic areas and language applications are organized into linguistic domains, rather than theoretical premises and analysis methodologies, in order to allow for interdisciplinary approaches and interaction perspectives. The way we see and study language with reference to general linguistic theory, and the relation and interaction between phonetic production and produced meaning, are paid special attention. Extensive discussions of the theoretical assumptions of key linguistic areas as related to models and experimental methodol- 134 ogyies will be carried out. Plenty of room will also be provided for discussion of the nature of linguistic training and the requirements for linguistic education in a variety of scientific contexts in both theory and practice. Outlook Language has been studied throughout human history with various, sometimes overlapping, perspectives and objectives, and with various applications in mind. The pursuit of linguistic knowledge has been the driving force behind the study of language as one of the most defining characteristics of human beings. Language has been the primary means of communication in human societies and there have been several applications critical to the development of human societies as we know them. Among these applications, three have marked the route of human societies. First, early insights in basic forms and functions of phonetic systems led to the development of writing systems and the acquisition of reading and writing skills as part of educational systems, probably the most fundamental language application. Second, basic knowledge of voice characteristics and voice signal transmission led to the development of telephone systems and distant voice communication. Third, the growth of information technology and the advent of the Internet paved the way for a variety of language applications. Thus, in our days, linguistic research and language technology set common goals and joint efforts into multifunctional language applications. However, the main precondition for the achievement of these goals and the fruition of these efforts is that solid theoretical knowledge meets basic scientific criteria and reflects linguistic reality. As every era sets its conditions, our era sets additional requirements and alternative methodologies for the study of language. A new generation of linguists is to be educated and equipped with a rich arsenal of experimental methodologies and new perspectives in linguistic research, and the present Workshop is organised in order to discuss and set the groundwork for this in an interdisciplinary context. Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University A positional analysis of quantity complementarity in Swedish with comparison to Arabic. 1 Zeki Majeed Hassan and 2Barry Heselwood 1 Department of Linguistics, University of Gothenburg, Gothenburg.Sweden 2 Department of Linguistics and Phonetics, University of Leeds, Leeds, UK. *Alphabetical order. Abstract The most favoured solution to the problem of quantity complementarity in Swedish has been to claim that vowel length is phonemic and consonant length is predictable (Linell, 1978). Evidence from listeners’ perceptual behaviour supports this over the reverse claim that it is only consonant length which is distinctive (cited in Czigler, 1998), a position that has nevertheless been argued for (Eliasson & La Pelle, 1973). However, there is a phonological cost: the vowel inventory must be doubled. We present an analysis based on positional criteria to account for the phonetic facts reported in instrumental studies such as Czigler (1998), Hassan (2002), Strangert & Wretling (2003), without the cost of additional phonemes. It takes Trubetzkoy’s (1969) ‘correlation of syllable contact’ and develops it according to more recent functionalist principles of phonotactic analysis (Mulder, 1989; Dickins, 1998). Vowel and consonant length are predicted by whether there is a consonant in the phonotactic position immediately following the syllable nucleus. Quantity complementarity in Swedish is compared to vowel and consonant length in Arabic and shown to bear out Hassan’s (2003: 48) assertion that the phenomenon of length ‘constitutes a systematic difference between the phonological systems of both languages’. Duration and length in Swedish stressed syllables In Swedish, stressed syllables1 can be of three types regarding the content of their rhyme structure. This can be expressed as ‘Danell’s formula’ (Wittting, 1977): A) long vowel only, e.g. se ‘see’ B) short vowel + long consonant, e.g. hatt ‘hat’ C) long vowel + short consonant, e.g. hat ‘hate’ It is the types B) and C) which have intrigued phoneticians and phonologists because of the relationship between the vowel and its follow- ing consonant. From a phonetic point of view, the relationship is one of inverse co-variation of duration. Because the durational differences for both the consonants and the vowels are significantly well above the absolute difference limens (Hassan, 2002), we can say that we are dealing with length differences as well as durational differences. We therefore have inverse co-variation of phonological length – when the vowel is long, the consonant is short, and vice versa, although there is considerable variation in degree of complementarity cross-dialectally (Strangert & Wretling, 2003). As well as a long consonant, a cluster of two short consonants can follow a short vowel. Examples are given in Table 1. Throughout this paper, discussion and exemplification will be restricted to monosyllabic forms in order to avoid the issue of where the syllable boundary is in forms such as Swedish titta ‘to look’, or Arabic kattab ‘he made somebody write’. Also excluded are forms with a cluster of three consonants such as svensk, tjänst on the assumption that the final consonant is outside the domain of quantity complementarity. Morphological boundaries are ignored. Table1. Inverse co-variation of vocalic and consonantal length in Swedish stressed syllables. Long vowel + short Short vowel + long conconsonant sonant hat [ : ] ‘hate’ hatt [ :] ‘hat’ lam [ väg [ köp [ : : : ]‘lame’ lamm [ :] ‘lamb’ ] ‘road’ vägg [ :] ‘wall’ ] ‘buy’ köpt [ ] ‘bought’ As implied by the transcriptions, as well as a difference in vowel length there is also a clearly noticeable difference in vowel quality. 135 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Although vowel quality has been found not to be robust as a perceptual cue across all shortlong pairs (Behne, Czigler & Sullivan, 1996; 1998, cited in Czigler, 1998), Linell (1978) adduces it as evidence that there is a phonemic opposition between short and long vowels. The quality of consonants in terms of place and manner of articulation is, however, not noticeably affected by length. The taxing question is, how should these phonetic facts be analysed phonologically? There is not space in this paper to discuss the relative merits of all possible answers to this question, but they include, using hatt-hat as examples, the following: 1) phonemic vowel length and phonemic consonant length - /hat:/, /ha:t/; 2) phonemic consonant length with vowel length predictable - /hat:/, /hat/; 3) phonemic vowel length with consonant length predictable (Elert, 1964; Witting, 1977; Linell, 1978; Czigler, 1998) /hat/, /ha:t/; 4) singleton-geminate consonant opposition with vowel length predictable (Eliasson & La Pelle, 1973; Eliasson, 1978) - /hatt/, /hat/. Each of the above analyses seems to account equally for the phonetic facts, so we need to bring the criterion of economy of description to bear. 1) requires doubling the vowel inventory and the consonant inventory and could be rejected as an uneconomic solution. 2) requires doubling the consonant inventory but not the vowel inventory. 3) requires doubling the vowel inventory but not the consonant inventory, which is to be preferred over 2) because the vowel inventory is smaller in the first place; it is the solution favoured by most writers on the subject, following Elert (1964), although Linell (1978: 125), an advocate of this solution, admits that vowel length is ‘distinctive only before single consonants’. 4) does not require any additional phonemes in the inventories, but does lead to an increase in the complexity of some of the phonological forms that take part in quantity complementarity. For example, the phonological form of hatt must comprise four phonemes instead of three and therefore falls foul of a phonotactic simplicity metric. All the above analyses involve some difference in phonemic content to account for the quantity complementarity. It is worth consider- ing an alternative approach which accounts for quantity phenomena in prosodic terms rather than in phonemic terms, and that is Trubetzkoy’s correlation of syllable contact. Trubetzkoy’s ‘correlation of syllable contact’ analysis Trubetzkoy (1969: 199-201) provides a prosodic analysis of quantity complementarity in Swedish instead of the kind of phonemic analyses outlined in (1)-(4) above. That is to say, the phonological difference between the pairs in table 1 and others like them is due to a prosodic opposition, not a phonemic opposition. According to Trubetzkoy, postvocalic consonants in stressed syllables in Swedish can relate to the preceding vowel with either close contact or open contact. In the case of close contact, the vowel is predictably shortened and the consonant equally predictably lengthened. In the case of open contact, the vowel is predictably long and the consonant equally predictably short. Both vowel length and consonant length are predictable and hence non-distinctive, occurring as a consequence of, respectively, open and close syllable contact. When there is no following consonant the situation is as for open contact – it is pertinent here to note that short vowels do not occur in open stressed syllables in Swedish. Trubetzkoy’s syllable contact analysis has the advantage of descriptive economy over alternatives (1)-(3) above in that it does not necessitate setting up either long vowel or long consonant phonemes in opposition to short ones. Neither does it require increasing the complexity of phonological forms as is the case in (4). However, while the phonetic lengthening of vowels in open contact is entirely plausible in order to fill what we can think of as a vacated space, Trubetzkoy’s analysis fails to give an adequate explanation for why, in close contact, the postvocalic consonant is lengthened, or of why it is not lengthened when it is part of a cluster of two consonants. A positional analysis of Swedish quantity complementarity We propose that it is possible to recast Trubetzkoy’s analysis in such a way as to retain its insights and advantages while rendering it more explanatory, and that the way to do this is to account for the distribution of phonemes in Swedish stressed syllables in terms of a frame of phonotactic positions known as a phonotagm (Mulder, 1989; Dickins, 1998). Mulder (1989: 444) explains that a phonotagm is the ‘minimum type of structure within which the distri- 136 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University bution of cenotactic (natural language: phonotactic) entities can be described completely and exhaustively’. It comprises a set of positions to which the constituent phonemes of a phonological form can be assigned by functional criteria. Unlike the syllable, realisational properties of phonemes are ignored when assigning phonemes to positions. Quantity complementarity in Swedish stressed syllables can be accounted for by setting up a phonotagm with two post-nuclear positions which we shall call postnuclear1 and post-nuclear2. Together with the nucleus itself (the ‘identity element’ of a phonotagm), this is somewhat analogous to the three rhyme ‘X-slots’ set up for English by Giegerich (1992). The nucleus is always occupied by a vowel phoneme, while the postnuclear positions can either be phonologically empty or occupied by a consonant phoneme. Examples are given in table 2. Table 2. Positional analysis of quantity complementarity onset nucleus postnuclear1 postnuclear2 A generalised realisation statement to the effect that an empty position is filled by the phonetic material from the preceding position accounts for the phonetic facts. Positional analysis gives a coherent phonotactic interpretation to Trubetzkoy’s ‘syllable contact’ analysis. Close contact is equivalent to the consonant occupying POST-NUCLEAR1, and open contact equivalent to its being in POSTNUCLEAR2. The advantage over Trubetzkoy’s analysis is that we can explain the lengthening of the vowel in rhyme types A and C, and the lengthening of the consonant in rhyme type B, in the same terms. It also accounts for why a consonant in POST-NUCLEAR1 does not lengthen if it is part of a cluster, i.e. if there is another consonant in POST-NUCLEAR2. It shares with Trubetzkoy’s analysis the advantage of not having to set up additional phonemes in the inventory of Swedish because it renders both vowel and consonant length predictable and therefore non-distinctive. The opposition between pairs such as hat-hatt etc. is set up purely as a phonotactic difference – the phonological forms comprise the same phonemes, but distributed differently in the phonotactic frame. Comparison to vowel and consonant length in Arabic The situation regarding vowel and consonant length in Arabic is rather different and does not lend itself to the analysis proposed above for Swedish. While there is evidence of inverse covariation of duration between consonants and vowels in the Iraqi Arabic data examined by Ghalib (1984) and Hassan (1981, 2002), whether this extends to inverse co-variation of length is not so clear. According to Hassan (2002), vowel duration differences before singleton and geminate consonants do not significantly exceed difference limen values. In this he is in agreement with Ghalib (1984) who also concluded that such vowel duration differences are negligible. The really important difference between Arabic and Swedish in respect of quantity concerns predictability. Above we presented a positional analysis in which Swedish quantity is predictable on the basis of the contrastive distribution of phonemes within the phonotagm. The reason this analysis works is that there are no stressed syllables in Swedish of the type CV:C:2 or CVC. Arabic presents a different picture because these types do exist in opposition to CVC: and CV:C. In fact in Arabic it appears that all combinations of short and long vowels and consonants can occur in stressed syllables. Mitchell (1990: 65) gives the minimal pair example / aam/ ‘year’ and / aamm/ ‘public’ showing that consonant length is distinctive after a long vowel, and Hassan (2003: 45-6) provides /samm/ ‘poison’ and /saamm/ ‘poisonous’ to show that vowel length is distinctive before a long consonant. There is therefore no inverse co-variation of quantity in Arabic: vowel quantity and consonant quantity 137 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University vary independently and neither can be predicted from knowing the other. This, we suggest, bears out Hassan’s (2003: 48) contention that quantity in Swedish and Arabic ‘constitutes a systematic difference between the phonologies of both languages’. Notes 1. By ‘stressed syllable’ we mean one that is not unstressed, i.e. it includes secondary stress (or what has been called ‘reduced main stress’) as well as primary stress. 2. Witting (1977) cites moln ‘cloud’ as an example of CV:CC to argue that vowel length is not predictable when followed by a cluster, but the pronunciation [mo:ln] is described as a ‘regional exception’ by Czigler (1998: 23) and marked as an exception by Linell (1978: 123), we therefore discount it. References Behne, D.M., Czigler, P.E. & Sullivan, K.P.H. (1996) Acoustic characteristics of perceived quantity in Swedish vowels. Speech Science and Technology ’96,(Adelaide), 49-54. Behne, D.M., Czigler, P.E. & Sullivan, K.P.H. (1998) Perceived vowel quantity in Swedish: effects of postvocalic voicing. Proceedings of the 16th International Congress of Acoustics and the 135th Meeting of the Acoustical Society of America, (Seattle), 2963-64. Czigler, P.E. (1998) Timing in Swedish VC(C) sequences. PHONUM 5, Dept of Phonetics, Umeå University. Dickins, J. (1998) Extended Axiomatic Linguistics. Berlin: Mouton de Gruyter. Elert, C.-C. (1964) Phonologic Studies of Quantity in Swedish. Uppsala: Almqvist & Wiksell. Eliasson, S. (1978) Swedish quantity revisited. In Gårding, E., Bruce, G. & Bannert, R. (eds) Nordic Prosody. Dept of Linguistics, Lund University. 111-122. Eliasson, S. & La Pelle, N. (1973) Generative regler för svenskans kvantitet. Arkiv för nordisk filologi 88, 133-148. Giegerich, H.J. (1992) English Phonology. Cambridge: Cambridge University Press. Ghalib, G.B.M. (1984) An experimental study of consonant gemination in Iraqi Spoken Arabic. Unpublished PhD Thesis, University of Leeds. Hassan. Z.M (1981) An experimental study of vowel duration in Iraqi Spoken Arabic. Unpublished Ph.D. Thesis U.K: Dep. of Linguistics & Phonetics, University of Leeds. Hassan, Z.M. (2002) Gemination in Swedish and Arabic with a particular reference to the preceding vowel duration: an instrumental and comparative approach. In Proceedings of Fonetik 2002 TMH-QPSR 44, 81-85. Hassan, Z.M. (2003) Temporal compensation between vowel and consonant in Swedish & Arabic in sequences of CV:C & CVC: and the word overall duration. PHONUM 9, 4548, Dept of Phonetics, Umeå University. Linell, P. (1978) Vowel length and consonant length in Swedish word level phonology. In Gårding, E., Bruce, G. & Bannert, R. (eds) Nordic Prosody. Dept of Linguistics, Lund University. 123-136. Mitchell, T.F. (1995) Pronouncing Arabic. Oxford: Clarendon Press. Mulder, J.W.F. (1989) Foundations of Axiomatic Linguistics. Berlin: Mouton de Gruyter. Strangert, E. & Wretling, P. (2003) Complementary quantity in Swedish dialects. PHONUM 9, 101-104, Dept of Phonetics, Umeå University. Trubetzkoy, N.S. (1969) Principles of Phonology. Berkeley: University of California Press. Witting, C. (1977) Studies in Swedish Generative Phonology. Stockholm: Almqvist & Wiksell. 138 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University Author index Ayusawa, Takako 103 Asu, Eva Liina 29 Bannert, Robert 75 Bjursäter, Ulla 55 Blomberg, Mats 51 Bodén, Petra 37 Bonsdroff, Lina 59 Botinis, Antonis 95, 99, 123, 131 Cerrato, Loredana 41 Charalabakis, Christoforos 131 Edlund, Jens 107 Eklund, Petra 63 Elenius, Daniel 51 Engstrand, Olle 59, 63, 67 Fourakis, Marios 123, 131 Fransson, Linnéa 79 Ganetsou, Stella 95 Gawronska, Barbara 131 Griva, Magda 95 Gunnarsdotter Grönberg, Anna 5 Gustafsson, Kerstin 63 Gustavsson, Lisa 83 Hassan, Zeki Majeed 127, 135 Heselwood, Barry 135 Hincks, Rebecca 45 House, David 107 Huber, Dieter 49 Ivachova, Ekaterina 63 Jande, Per-Anders 25 Jensen, Christian Kadin, Germund Karlsson, Fredrik Karlsson, Åsa Kim, Yuni Klintfors, Eeva Kostopoulos, Yannis Krull, Diana Lacerda, Francisco Lindh, Jonas Nagano-Madsen, Yasuko Nikolaenkova, Olga Nolan, Francis Oppelstrup, Linda Orfanidou, Ioanna Schaeffler, Felix Schötz, Susanne Segerup, My Seppänen, Tapio Skantze, Gabriel Strangert, Eva Stölten, Katrin Sundberg, Ulla Themistocleous Thorén, Bosse Toivanen, Juhani Tøndering, John Väyrynen, Eero 139 111 67 71 63 9 83 99 33 55, 83 17, 21 103 99 29 51 123 1 87 13 119 107 79 91 55 99 115 119 111 119 Proceedings, FONETIK 2005, Department of Linguistics, Göteborg University 140