MTRANS: A multi-channel, multi
Transcription
MTRANS: A multi-channel, multi
M TRANS: A multi-channel, multi-tier speech annotation tool Julián Villegas1 , Martin Cooke1,2 , Vincent Aubanel1 , and Marco A. Piccolino-Boniforti3 1 2 Ikerbasque (Basque Science Foundation), Spain Language and Speech Laboratory, Universidad del Pais Vasco, Spain 3 Dept. Linguistics, Univ. Cambridge, UK [email protected],[email protected],[email protected],[email protected] Abstract M TRANS, a freely available tool for annotating multi-channel speech is presented. This software tool is designed to provide visual and aural display flexibility required for transcribing multi-party conversations; in particular, it eases the analysis of speech overlaps by overlaying waveforms and spectrograms (with controllable transparency), and the mapping from media channels to annotation tiers by allowing arbitrary associations between them. M TRANS supports interoperability with other tools via the Open Sound Control protocol. Index Terms: speech tools, multi-channel speech annotation, conversation analysis. 1. Introduction The analysis of multi-channel speech material such as that available in the ICSI [1] and AMI [2] corpora, or recordings which result from exploring the effect of background conversations on foreground conversations [3] often requires the transcription and annotation of recordings of more than two talkers, here denoted ‘multi–channel’ conversations. While some tools can handle multi-channel media, they rarely possess the flexibility in audio output and visual display required when transcribing and annotating multi-channel conversational speech. Overlaps, frequent in natural conversations, are particularly interesting, and their analysis would benefit from support for visual signal separation in both time and frequency domains. A further issue is the mapping from annotation tiers to channels, where typical one-to-one assumptions are too restrictive in cases where the tier corresponds to events, such as inter-turn gaps, which are only meaningful with reference to more than one channel. To support rapid and flexible annotation of multi-channel speech with many annotation levels, we have developed MTRANS, a free and open source tool available from www.laslab.org/tools/mtrans. 2. Tools and Frameworks Tools such as Praat [5] (probably the most widely used software for speech analysis), Speech Filing System [6], or arguably ® MATLAB [4] (via its Signal Processing Toolbox), offer solutions to speech-signal analysis including modules for capturing, displaying, editing, annotating, and analysing media. These powerful tools also benefit from user-contributed plug-ins and libraries such as the ‘voicebox’ collection of MATLAB functions [7]. In some cases, these applications are forced to compromise between flexibility and generality yielding them somewhat cumbersome for tasks required in the analysis of multichannel speech. Numerous software applications for editing signals can be used for conversational analysis. A complete review of these applications is beyond the scope of this article: only the most relevant tools will be discussed.1 Anvil [8], a video-oriented annotation tool, integrates multimodal information including motion capture data. The CLAN programs [9] adhere to the CHAT transcription convention, and are useful for frequency counts, co-occurrence and interactional analyses. E LAN [10] is a multimedia (video and audio) annotation tool supporting multiple channel video. The EMU Speech Database System [11] allows signal editing, hierarchical annotation handling and automatised signal modification via batch processes. Being database oriented, large corpora can be analysed in detail and through its interface with R it is possible to perform sophisticated statistical analyses and create figures in an integrated environment. E X MAR a LDA [12], through its Partitur-Editor allows multiple level annotation of single audio and video files. Folker [13] shares many similarities with E XMARaLDA’s Partitur-Editor. LaBB CAT (a.k.a. ONZE Miner) [14], offers monochromatic spectrogram display (indirectly) for single channel analysis. The commercial application Transana [15] is another video-annotating oriented tool which is simple to use and works with single sources. Transcriber [16], designed specifically for the annotation of broadcast news recordings, supports multiple media channels via the ‘channelTrans’ extension.2 Note that there is a homonymous version of Transcriber3 currently not maintained, which allows monophonic transcription of audio files, with spectrogram, signal, and energy plots. Wavesurfer [17] supports multi-channel audio analysis and spectrogram view; its development, stopped for about six years, has been recently resumed. This tool is frequently used, e.g., EMU signal editing functionality is built upon Wavesurfer. The aforementioned applications provide the user with common features such as time-alignment, basic signal processing (intensity control, lateralisation, etc.), basic editing (cut, copy, and paste), interfaces to other programs and several import/export formats. Table 1 summarises some of these software features. Handling multi-channel audio is achieved in some applications (e.g., EMU and Praat) by synchronously rendering several media files in independent windows. Such a mechanism creates flexibility and scalability problems. 1 A representative list of annotation tools can be found at annotation.exmaralda.org/index.php/ Linguistic_Annotation 2 www.icsi.berkeley.edu/Speech/mr/channeltrans 3 www.isip.piconepress.com/projects/speech/ software/legacy/transcriber Figure 1: M TRANS interface. This example has four audio channels of which two spectrograms and 14 tiers are shown. Table 1: Some applications related to MTRANS (all are freely available, multi-platform, and feature waveform representation.) Anvil C LAN E LAN E MU E XMARaLDA Folker LaBB - CAT M TRANS Praat S FS Transcriber Wavesurfer spectrogram other media multi- open channel source scriptable No No No Yes No No Yes Yes Yes Yes No Yes Video, motion Video Video jaw, tongue Video No No Partial No No No No Yes No Yes No No No No Yes Partial No Yes Yes No Yes Yes Yes No No No Yes Yes Yes Yes Yes No Yes Yes Yes No No Yes Yes Yes Yes Yes Yes 3. M TRANS philosophy One of the goals in creating MTRANS was to cover the gap left by existing applications in the analysis of natural, real-world, multi-party conversations (rich in overlaps) by allowing the visual inspection of overlaid signals in different representations (spectrographic and waveform). In its design, the following features were considered important: Ease of experimentation: As an example, spectrogram parameters (e.g., maximum frequency, frequency resolution, energy range, and individual signal transparency) can be adjusted continuously in real-time. This fine control facilitates visual speech analysis which often demands tuning of display parameters dependent on the speaker, task and segment of speech material currently under analysis. Flexibility: Arbitrary tier grouping and ordering allows a logical relationship between tier type and channels (e.g., overlaps between channels) and the determination of which signals are required for playback. Tier type and tier association provides useful semantic information in subsequent processing of annotation files. Information filtering: To better focus on a particular task, allows the user to display/hide information such as individual tiers, tier groups, waveforms, annotation, and spectrograms. Aurally, one can lateralize each audio channel as well as increase its relative sound intensity (reflected in its spectrogram transparency). MTRANS Interoperability: M TRANS specialises in creating annotations from multiple audio channels but it is capable of sending real-time transport information (e.g., playback starting point, segment duration, and speed rate) to other applications via the Open Sound Control (OSC) protocol.4 Provided that OSCenabled clients exist, there is no need to switch between tools for a particular task nor to overload MTRANS with peripheral features used infrequently. The use of OSC allows MTRANS to focus on its core functionality while recruiting other specialised tools for tasks such as video playback (see later) with no additional coding. Speed and reactiveness: Regardless of media length, By implementing resizable panels, users can examine speech in more detail, or conveniently shrink them for improved performance. Since only the visible spectrogram pixels are computed at given time, reducing the size of this panel increases overall performance. MTRANS is designed to be fast. 4 opensoundcontrol.org 4. Main Features M TRANS runs on platforms supporting MATLAB (such as Windows, Mac OS X and Linux) and it is extensible via MATLAB scripts. It consists of two windows: the main display and a control & search panel. Table 2 summarises the main features of M TRANS. Table 2: M TRANS features Browsing Audio channel lateralization and volume control, transparency control, several levels of zooming, resizable panels, overlaid hue-encoded spectrograms and waveforms, text and regular expression search, help via tool-tips. Interoperation Import/Export tiers to Praat or HTK annotation files. real-time inter-operation with OSC-enabled software. Supports tier lexica to ensure uniformity amongst transcribers. Automatic generation of overlap and turn tiers as the starting point for manual correction. Navigation Several ways to move between annotations (skip to the previous/next non-empty annotation, or with the same label, etc.), access annotations through search results. Playback Audio streams are processed with a vocoder to speed control the reproduction speed with no significant formant artefacts. real-time Spectrogram and waveforms are rendered online according to the current parameters, e.g., panel size, gain, transparency, zoom, frequency resolution. (a) monochromatic version (b) hue-encoded Figure 2: Hue-encoded spectrograms facilitate visual source separation at overlaps and f0 contour visualisation. This feature is especially useful in combination with transparency control in the analysis of overlaps when the signals are very similar. formation improves performance and usage of the visual space. Display control of tiers and tier types is possible via the bottom panel. 4.1. Main Window The resizable panels inside the main window (shown in Fig. 1) correspond to spectrogram, waveform, and tier view of the multiple channels to analyse. Hue-encoded spectrograms, with controlled transparency, facilitate visual source separation at overlaps as shown in Fig. 2. Identifying crosstalk between audio channels (common even when the audio is captured using closetalking microphones) improves the analysis quality, and controlling each channel transparency favours crosstalk detection. Annotations can be created, deleted, or edited in place using standard GUI conventions; their contents can be typed in or selected from user-defined lexica specific to each tier type. In fast annotation mode, a single click inserts annotation boundaries and, optionally, a default symbol. Tiers map to multiple or individual media channels, e.g., in Fig. 1, overlap tiers are associated with two audio channels. By clicking on a tier one can add, remove, and edit it. Support for large numbers of tiers is achieved via grouping, flexible ordering, hiding/display, and colouring options. 4.2. Control and Search Window The control tab (highlighted in left window in Fig. 1) provides a mechanism for selecting what and how to display. The top panel allows online adjustments to, for example, frequency resolution, upper frequency limit, and playback speed. These parameters can be adjusted to permit visual analysis of formants and f0 contours. Through the middle panel users can select what information is displayed for each audio channel including waveform, annotation, spectrogram, etc. Deselecting unwanted in- Figure 3: Search & Browse window (fragment). The resizable bottom panel shows the result tree. Through the Search tab (shown in Fig. 3), users can navigate a set of retrieved annotations by freely selecting them from the result tree or consecutively by using arrow-keys. 4.3. O SC functionality To demonstrate the benefits of this feature, a simple fourchannel video player (shown in Fig. 4) was created. This OSCclient performs synchronised video playback corresponding to the annotation currently being played in MTRANS. For instance, the application can reproduce the video segment at the playback speed specified in MTRANS. M TRANS users can specify server, port, and OSC-bundling (by default localhost, 2012, and individual messages, respectively). If media and OSC-applications are available, one could for example enrich the analysis of conversations by also looking at participants’ video, motion capture, articulatory settings, as well as other synchronous data such as EEG, etc. without necessarily harming each individual application’s performance and Acknowledgements. The authors thank EU Future and Emerging Technology (FET– OPEN) Project LISTA (The Listening Talker) and EU Marie Curie Research Training Network S2S (Sound to Sense) for support and encouragement. 7. References [1] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gelbart, A. Janin, T. Pfau, E. Shriberg, and A. Stolcke, “The Meeting Project at I CSI,” 2001. [2] J. Carletta, “Announcing the A MI Meeting Corpus,” The E LRA Newsletter, vol. 11, no. 1, pp. 3–5, Jan. 2006. [3] J. C. Webster and R. G. Klumpp, “Effects of Ambient Noise and Nearby Talkers on a Face-to-Face Communication Task,” J. Acoust. Soc. Am., vol. 34, no. 7, pp. 936–941, 1962. [4] Mathworks, “M ATLAB,” Software, available [Mar. 2011] from www.mathworks.com. Figure 4: Multi-video player synchronised with MTRANS via the OSC protocol (Videos from the AMI corpora [2]). crucially without additional coding. 5. Experience with MTRANS The need for a detailed analysis of multi-channel speech data motivated the development of MTRANS . Speech material came from an experiment on speaking in the presence of competing speech. MTRANS was used to annotate pairs of simultaneous conversations recorded in three sessions of 20 minutes each. Five channels of synchronous audio data (four closetalking microphones and an omnidirectional table microphone) were available during annotation. Annotation involved more than 20 tiers grouped into seven tier types (session structure, speech/non-speech, orthography, interactional events, inter-turn gaps, turn-types, miscellaneous). More than 7000 individual annotation symbols were created, including nearly 1300 events (dysfluencies, alignments, etc.) and more than 2000 turn components (back-channels, interruptions, smooth change, etc.). 6. Further developments M TRANS currently saves transcriptions in plain text files for ease of use in subsequent text processing pipelines, but they could be converted to XML files or stored in a database. Retrieving corpora from central databases via a database connection is also an attractive feature to implement given the current trends in data storage, retrieval and shared documents in the speech community. Much of the information necessary to generate a conversation analysis style of transcription (e.g. following Jefferson’s conventions [18]) is available in MTRANS. This output format will be automatically generated in future versions. LATEX commands can be directly inserted in annotations to produce the most common diacritics and digraphs used with the Latin alphabet, Greek letters, and some phonetic symbols. A full support of other languages and IPA symbols would be available via Unicode encoding (UTF-8) but is yet to be fully implemented in MATLAB. The current version of MTRANS is a hybrid of MATLAB and Java code. Full migration to Java is planned for future versions. So far, MTRANS has OSC server capabilities. Implementing client capabilities is currently under consideration. [5] P. Boersma and D. Weenink, “Praat: doing phonetics by computer,” available [Mar. 2011] from www.praat.org. [6] M. Huckvale, D. Brookes, L. Dworkin, M. Johnson, D. Pearce, and L. Whitaker, “The S PAR Speech Filing System,” in Proc. of European Conf. on Speech Technology, Edinburgh, 1987, pp. 305–308. [7] M. Brookes, “VOICEBOX: Speech Processing Toolbox for M ATLAB ,” Software, available [Mar. 2011] from www.ee.ic.ac.uk/hp/ staff/dmb/voicebox/voicebox.html. [8] M. Kipp, “Spatiotemporal coding in A NVIL,” Proc. Sixth Int. Conf. on Language Resources and Evaluation (LREC-08), 2008. [9] B. MacWhinney, The C HILDES project: tools for analyzing talk, 3rd ed. Mahwah, NJ, USA: Lawrence Erlbaum, 2000. [10] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes, “E LAN: a professional framework for multimodality research,” in Proc. of Language Resources and Evaluation Conference (LREC), 2006. [11] S. Cassidy and J. Harrington, “Multi-level annotation in the Emu speech database management system,” Speech Commun., vol. 33, pp. 61–77, Jan 2001. [12] T. Schmidt, “E XMARaLDA: un système pour la constitution et l’exploitation de corpus oraux ,” in Pour une épistémologie de la sociolinguistique. Actes du colloque Int. de Montpellier 10–12 déc. 2009. Lambert-Lucas, 2010, pp. 319–327. [13] T. Schmidt and W. Schütte, “F OLKER: An Annotation Tool for Efficient Transcription of Natural, Multi-party Interaction,” in Proc. Seventh Conf. on Int. Language Resources and Evaluation (LREC’10), Valletta, Malta, 2010. [14] R. Fromont and J. Hay, “O NZE Miner: the development of a browser-based research tool,” Corpora, vol. 3, no. 2, pp. 173–193, 2008. [15] D. Woods and C. Fassnacht, Transana. Madison, WI, USA: The Board of Regents of the U. of Wisconsin System [Computer software], 2007. [16] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman, “Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production,” Speech Communication, vol. 33, no. 1–2, pp. 5–22, 2001. [17] K. Sjölander and J. Beskow, “WaveSurfer - an open source speech tool,” in Proc. of ICSLP 2000: Sixth Int. Conf. on Spoken Language Processing, B. Yuan, T. Huang, and X. Tang, Eds., vol. 4, Beijing, Oct. 2000, pp. 464–467. [18] G. Jefferson, Structures of Social Action: Studies in Conversation Analysis. New York, USA: Cambridge University Press, 1984, ch. Transcript notation.