D3.1-Report on available cutting-edge tools
Transcription
D3.1-Report on available cutting-edge tools
FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Information and Communication Technologies (ICT) Programme WP3 D3.1 Report on Available Cutting-Edge Tools Leading Author(s): Qianni Zhang, Julie Wall (QMUL) Status -Version: Version 1.0 Contractual Date: 29 February 2012 Actual Submission Date: 29 February 2012 Distribution - Confidentiality: Public Code: REVERIE_D3_1_QMUL_V01_20120124.docx 1 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Copyright by the REVERIE Consortium Disclaimer This document contains material, which is the copyright of certain REVERIE contractors, and may not be reproduced or copied without permission. All REVERIE consortium partners have agreed to the full publication of this document. The commercial use of any information contained in this document may require a license from the proprietor of that information. The REVERIE Consortium consists of the following companies: No Participant name 1 STMicroelectronics 2 Queen Mary University of London 3 Short name Roll Country Co-ordinator Italy QMUL Contractor UK CTVC Ltd CTVC Contractor UK 4 Blitz Games Studios BGS Contractor UK 5 Alcatel-Lucent Bell N.V. ALU Contractor Belgium 6 Disney Research Zurich DRZ Contractor Switzerland 7 Fraunhofer Heinrich Hertz Institute HHI Contractor Germany 8 Philips Consumer Lifestyle PCL Contractor Netherlands 9 Stichting Centrum voor Wiskunde en Informatica CWI Contractor Netherlands 10 Institut Telecom - Telecom ParisTech TPT Contractor France 11 Dublin City University DCU Contractor Ireland 12 Synelixis Solutions Ltd SYN Contractor Greece 13 CETRH/Informatics and Telematics Institute CERTH Contractor Greece ST The information in this document is provided “as is” and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. 2 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Contributors Name Company Qianni Zhang QMUL Julie Wall QMUL Sigurd van Broeck ALU Fons Kuijk CWI Aljoscha Smolic DRZ Petros Daras CERTH Rufael Mekuria CWI Catherine Pelachaud IT/TPT Angélique Drémeau IT/TPT Philipp Fechteler HHI George Mamais SYN Philip Kelly DCU Menelaos Perdikeas SYN Gaël Richard IT/TPT Tamy Boubekeur IT/TPT Internal Reviewers Name Company Daniele Alfonso ST Petros Daras CERTH 3 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Document Revision History Date Issue Author/Editor/Contributor Summary of main changes 24/01/2012 1.0 Q. Zhang Document structure 22/02/2012 2.0 J. Wall First integrated version 29/02/2012 2.1 D. Alfonso, P. Daras, all Revised version 29/02/2012 3.0 J. Wall Final version 05/03/2012 4.0 J. Wall Final version with the new logo 4 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Table of contents Abbreviations ................................................................................................................................ 6 Executive Summary ....................................................................................................................... 9 1. Related Tools to WP4: Multi-modal and multi sensor signal acquisition .................................. 10 1.1. Content creation........................................................................................................................ 10 1.2. Multi-modal Capturing .............................................................................................................. 17 1.3. Performance Capture ................................................................................................................ 29 1.4. Activity recognition ................................................................................................................... 35 1.5. 3D reconstruction ...................................................................................................................... 37 1.6. 3D User-Generated Worlds from Video/Images ....................................................................... 46 2. Related Tools to WP5: Networking for immersive communication .......................................... 49 2.1. Network Architecture ................................................................................................................ 49 2.2. Naming ...................................................................................................................................... 53 2.3. Resource Management ............................................................................................................. 58 2.4. Streaming .................................................................................................................................. 62 2.5. Signaling for 3DTI real-time transmission ................................................................................. 70 2.6. MPEG-V Framework .................................................................................................................. 72 3. Related Tools to WP6: Interaction and autonomy ................................................................... 75 3.1. 3D Avatar Authoring Tools ........................................................................................................ 75 3.2. Animation Engine ...................................................................................................................... 77 3.3. Autonomous Agents .................................................................................................................. 81 3.4. Audio and speech tools ............................................................................................................. 84 3.5. Emotional Behaviours ............................................................................................................... 86 3.6. Virtual Worlds ........................................................................................................................... 88 3.7. User-system interaction ............................................................................................................ 90 4. Related Tools to WP7: Composition and visualisation ............................................................. 96 4.1. Rendering of human characters ................................................................................................ 96 4.2. Scene recomposition with source separation .......................................................................... 103 4.3. 3D audio rendering of natural sources .................................................................................... 105 4.4. Composition and synchronization ........................................................................................... 107 4.5. Stereoscopic and autostereoscopic display ............................................................................. 112 5. References ............................................................................................................................ 113 5 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Abbreviations 3DTI 3GPP AAC AEGIS AI API BA BIFS BML CCN CDN CGAL CISS CPM CPU CUDA DASH DDC DESAM DIVE DMIF DNS DoW E2E ECA EPVH ER FACS FIPS FML FMTIA FOV GPL GPU GSM GUI GUID HCI HMM HRIR HRTF IETF IMS IMU IP IR ISP 6 3D Tele-Immersion 3rd Generation Partnership Project Augmentative and Alternative Communication Accessibility Everywhere: Groundwork, Infrastructure, Standards Artificial Intelligence Application Programming Interface Bundle Adjustment Binary Format for Scene Behaviour Markup Language Content Centric Networking Content Distribution Network Computational Geometry Algorithms Library Coding-Based Informed Source Separation Componential Process Model Central Processing Unit Compute Unified Device Architecture Desktop and mobile Architecture for System Hardware 3D Digital Content Décomposition en Eléments Sonores et Applications Musicales Distributed Interactive Virtual Environment Delivery Multimedia Interface Domain Name Server Description of Work End to End Embodied Conversational Agents Exact Polyhedral Visual Hull Early Reflections Facial Action Coding System Federal Information Processing Standard Function Markup Language Future Media Internet Architecture Think Tank Field of View GNU Public License Graphics Processing Unit Global System for Mobile Communications Graphical User Interface Globally Unique Identifier Human Computer Interaction Hidden Markov Model Head-Related Impulse Response Head Related Transfer Functions Internet Engineering Task Force IP Multimedia Subsystem Inertial Measurement Unit Internet Protocol Infrared Internet Service Provider FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools ISS Informed Source Separation ITU International Telecommunication Union LGPL Lesser General Public License LM Levenberg-Marquardt LR Late Reverberation MD Message Digest MFCC Mel-Frequency Cepstral Coefficients MM Man Months MPEG Moving Picture Experts Group MPLS Multiprotocol Label Switching MST Minimum Spanning Tree MTU Maximum Transmission Unit MVC Multi-view Video Coding NI Natural Interaction NP-hard Non-deterministic Polynomial-time hard NPC Non-Player Character NVBG Non-verbal Behaviour Generator PAD Pleasure, Arousal and Dominance model PCI Peripheral Component Interconnect PCL Point Cloud Library PHB Per Hop Behaviour PMVS Patch-based Multi-View Stereo PSTN Public Switched Telephone Network QoS Quality of Service RC Reverberation Chamber REPET REpeating Pattern Extraction Technique RFC Request for Comments ROS Robot Operating System RSVP Resource Reservation Protocol RTCP Real-Time Transport Control Protocol RTP Real-Time Transport Protocol RTSP Real Time Streaming Protocol RTT Round Trip Time SAL Sensitive Artificial Listener SBA Sparse Bundle Adjustment SC Steering Committee SCAPE Shape Completion and Animation of People SDC Sensory Devices Capabilities SDP Session Description Protocol SE Sensory Effects SEC Sequential Evaluation Checks SEDL Sensory Effect Description Language SEM Sensory Effect Metadata SEV Sensory Effect Vocabulary SfM Structure from Motion SHA Secure Hash Algorithm SI Sensed Information SIP Session Initiation Protocol SLA Service Level Agreement SMI SensoMotoric Instruments SMIL Synchronized Multimedia Integration language SMILE Speech & Music Interpretation by Large-space Extraction 7 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools STFT TCP TMC TUM UDK UDP URC URI URL URN USEP USS UUID VCEG VOIP VR VRML Short-Time Fourier Transform Transmission Control Protocol Technical Management Committee Technische Universität München Unreal Development Kit User Datagram Protocol Uniform Resource Characteristics Uniform Resource Identifier Uniform Resource Locator Uniform Resource Name User’s Sensory Effect Preferences Underdetermined Source Separation Universally Unique Identifiers Video Coding Experts Group Voice Over IP Virtual Reality Virtual Reality Modeling Language VTLN Vocal Tract Length Normalization WP Work Package WWW World Wide Web W3C WWW Consortium YAAFE Yet Another Audio Features Extractor 8 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Executive Summary The main objective of REVERIE can be summarised as to develop an advanced framework for immersive media capturing, representation, encoding and semi-automated collaborative content production, as well as transmission and adaptation to heterogeneous displays as a key instrument to push social networking towards the next logical step in its evolution: to immersive collaborative environments that support realistic inter-personal communication. The targeted framework will exploit technologies and tools to enable end-to-end processing and efficient distribution of 3D, immersive and interactive media over the Internet. REVERIE envisages an ambient, content-centric Internet-based environment, highly flexible and secure, where people can work, meet, participate in live events, socialise and share experiences, as they do in real life, but without time, space and affordability limitations. In order to achieve this goal, the enhancement and integration of cutting-edge technologies related to 3D data acquisition and processing, sound processing, autonomous avatars, networking, real-time rendering, and physical interaction and emotional engagement in virtual worlds is required. The project consists of eight WPs. The division between WPs has been chosen to group activity types and skills required to implement the work programme. WP1 includes all management and overall project coordination activities. The ‘integration, prototyping and validation’ cluster embraces WP2 and WP3, which contains all activities related to requirement analysis, the design of the REVERIE framework architecture and the integration of REVERIE prototype applications. The next cluster – ‘adaptation of cutting-edge tools and R&D’ – contains four WPs (WP4-WP7). The activities in this cluster focus on fundamental research leading to the implementation of the tools needed to instantiate the REVERIE prototypes. The last cluster contains WP8. This cluster is dedicated to activities related to business models exploitation and dissemination of REVERIE’s outcomes. This deliverable focuses on the cutting‐edge technologies related to 3D data acquisition & processing, sound processing, autonomous avatars, networking, real‐time rendering, and physical interaction & emotional engagement in virtual worlds. 9 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 1. Related Tools to WP4: Multi-modal and multi sensor signal acquisition 1.1. Content creation Interactive 3D virtual environments comprise digital content designed or adapted and implemented for use within the intended environment. Whether specially created for the targeted platform, or adapted for use, at some point in the process, the digital content has been created from a blank canvas. In “To Infinity and Beyond: The Story of Pixar Animation Studios”, Paik and Iwerks (2007) illustrates the magnitude of the task: “In the computer, nothing is free. A virtual world begins as the most purely blank slate one can imagine. There are no sets, no props, no actors, no weather, no boundaries, not even the laws of physics. The computer is not unlike a recalcitrant genie, prepared to carry out any order you give it – but exactly and no more than that. So not only must every item be painstakingly designed and built and detailed from scratch, every physical law and property must be spelled out and applied.” Realisation of interactive virtual environments is thus both technically complex, and artistically challenging. Moreover, currently no single set of definitive all-encompassing tools exist to produce high-quality meaningful results without significant technical expertise and substantial time expenditure. This therefore leaves ample room for improvement as WP4 is attempting to address. WP4 focuses on all aspects related to the capturing of real world objects, scenes and environments and to the interaction between real and virtual worlds (REVERIE, 2011). This section reports on current state-of-the-art available cutting-edge tools to create content for interactive virtual worlds. The digital content (assets) which is necessary to populate virtual environments can include the background environment, set, scene objects, characters, and props; some require animation, and some are simply static objects. A common approach for content creation is to first create assets in a 3D Digital Content Creation package (DDC) and optimise them for realtime use and then pass them on to an interactive 3D development tool or framework to add interactively, behaviours and implement the interactive world itself. Error! Reference source ot found.1 illustrates an overview of the initial steps involved. The stages shown in Error! Reference source not found.1 begin with creating a 3D mesh nd texturing the mesh to provide realistic or stylised visual detail. Dependent on whether animation is required on the asset, the mesh is prepared for animation by adding a control structure to the mesh (rigging), and then animating as explained in more detail in the following paragraphs. Sound (usually edited in a sound editing package) can be added to enable lipsynching or other synchronisation of animation to sound. Techniques for creating the 3D model can include modelling and sculpting using dedicated digital content creation software packages, or creating the mesh through the use of scanning or other 3D reconstruction techniques. Low polygon meshes are generally used within the actual real-time environment, although high polygon models are often used in the production stages to improve appearance, by applying the more detailed normal maps obtained from the high polygon model to the lower polygon model. This enables the lower polygon model to mimic its higher polygon counterpart without the extra polygon count burden (Cignoni et al. 1998; Cohen et al. 1998). 10 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Once the model is created, shaders (also known as materials) are applied to define the surface shading of the mesh. This is normally achieved through a combination of the basic attributes of the shader such as transparency and specularity, plus any applied textures. Textures can provide added realism or stylised detail such as complex colour, reflections, course and fine displacements, specular intensity and surface relief. Textures can be procedurally driven, or be bitmaps created from photographs or from a 2D/3D paint package. The process of applying texture maps usually involves creating UV maps to map the 2D texture information onto the 3D mesh. Packages like Autodesk Maya, and Autodesk 3ds Max provide facilities for UV mapping, but like many other parts of the content creation process, the process of creating UV maps can be complex and time consuming, and require a high degree of skill. Researchers at Walt Disney Animation Studios, and the University of Utah developed an alternative solution called PTex: Per-Face Texture Mapping for Production Rendering (Burley and Lacewell, 2008). PTex eliminates the need for UV assignment by storing texture information per quad face of the subdivision mesh, along with a per-face adjacency map, in a single texture file per surface. The adjacency data is used to perform seamless anisotropic filtering of multi-resolution textures across surfaces of arbitrary topology. Although originally targeted for off-line rendering, adaptations have since been made for use in real-time (McDonald and Burley 2011). Figure 1: Content Creation Pipeline Stage 1: Creating Assets for real-time use Realistic lighting can prove computationally expensive, and therefore a common practice is to pre-light scenes, incorporating view-independent light information back into the textures. This process is known as baking texture light maps. Lighting is then removed from the scenes, and original materials are replaced with the baked textures containing the pre-calculated lighting solution to provide the appearance of realistic light without adding the computational burden of calculating the lighting solution in real time (Myszkowski and Kunii, 1994; Moller, 1996). Figure 2 shows an example of a baked texture, with the original tiled texture on the left, and the tiles with pre-calculated lighting baked into the texture on the right. Once the asset has been modeled, if animation is required then the asset must normally be prepared through design and implementation of a rig; control structures set up to articulate the mesh. Rigging requires associating the vertices of the geometric mesh of the asset with hierarchical structures, such as internal virtual skeletons, iconic controls and blend shapes and other deformers to facilitate articulation of the mesh without the animator having to resort to 11 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools complex direct manipulations of the individual vertices that make up the mesh, see Figure 3. When a virtual skeleton joint system is used, the vertices of the mesh are associated with joints through a process known as skinning. Different joints will have weighted influences on the vertices depending on their location. Figure 2: Original texture (left), baked texture with pre-calculated light (right) Rigging is another time-consuming task in the pipeline that requires a high-degree of skill, 1 and therefore attempts have been made to provide auto-rigging tools. For example, Mixamo is a tool to auto-rig characters which are then compatible for use in the popular game engine 2 Unity . Figure 3: Stylised character & rig (available from http://www.creativecrash.com/maya/downloads/character-rigs/c/blake) Rig compatibility challenges can also arise when dealing with motion capture data. One skeleton may be needed for the actual motion capture, another for the 3D DCC package, and another to deal with refining the animation in a package like Autodesk’s MotionBuilder; a realtime 3D character animation software which provides an industry leading toolset for manipulation of motion capture data. Zhang et al (2010) have taken on the task of simplifying the process of dealing with motion capture data across rigs, by proposing a software pipeline that converts the skeleton rigs to adhere to the motion capture rig (Zhang et al. 2010). All stages in the pipeline discussed so far need to take into account optimising for the end goal, i.e. implementation as interactive non-linear presentation in real-time. This can include double-checking geometry for flaws, optimising textures, baking animation, and removing rig elements that are incompatible with the interactive 3D development tool. Furthermore, DCC packages save the file data in their own proprietary formats. In order to be passed on through the process into a framework, game engine, or virtual environment development tool, the 1 2 http://www.mixamo.com/ http://unity3d.com/ 12 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools assets normally have to be exported into a common format. COLLADA3 and FBX4 are two popular digital asset exchange formats. Popular engines to create interaction, and implement the environments include Unreal5, 2 Crytech6, Unity , Vision7 and BlitzTech8, which work optimally with hand crafted 3D assets (REVERIE, 2011). Such engines provide toolsets to deal with scripting, imported assets, incorporate effects, real-time shaders, sophisticated lighting, and integrating sound; see Figure 4. Given that no definitive solution exists to produce content for virtual worlds, a wide variety of different solutions currently exist to create such environments, including sophisticated off-the shelf tools, proprietary software & bespoke development frameworks. Some well-established virtual worlds such as SecondLife9 enable users to populate virtual worlds with contents created from incorporated proprietary toolsets in addition to importing geometric meshes from 4 4 standard off-the shelf digital 3D DDC packages such as Autodesk Maya and Autodesk 3ds Max . Figure 4: Content Creation Pipeline Stage 2: Interaction Pipeline Semertzidis et al. (2011) have devised a system that takes advantage of a more simple and natural interface. Rather than handcrafting the 3D models, the researchers have devised an innovative system for creating 3D environments from existing multimedia content (Semertzidis et al. 2011). The user sketches the desired scene in 2D, and the system conducts a search over content-centric networks to fetch 3D models that are similar to the 2D sketches. The 3D scene is then automatically constructed from the retrieved 3D models. Although innovative, it has two main limitations; one being that it focuses on existing models, therefore ruling out the possibilities of populating the world with new creations; and secondly no mention is made of handling animation. 3 https://collada.org/mediawiki/index.php/COLLADA_-_Digital_Asset_and_FX_Exchange_Schema http://usa.autodesk.com/ 5 http://www.unrealengine.com/ 6 http://www.crytek.com/ 7 http://www.trinigy.net/ 8 http://www.blitzgamesstudios.com/ 9 http://secondlife.com/ 4 13 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 4 Autodesk’s Catch123D minimises the expertise needed to create 3D models. 123DCatch enables users to take photographs of desired objects from multiple viewpoints, and have them stitched together in the cloud to generate high quality 3D models incorporating the photographs as textures. Many other attempts have been made to facilitate creating virtual environments. Catanese et al. (2011) have devised an open-source framework which adapts existing platforms to facilitate the design of 3D virtual environments. The framework includes OGRE (3D rendering engine), OGREOggSounds (audio library which acts as a wrapper for the Open AL API), Phsyx (physics simulator) and NxOgre (physics connector library), combined with Blender extended with customised plugins (cross-platform open-source graphics and modelling application). Fitting into the pipeline discussed earlier, the scene itself is designed and created in Blender and then passed on to the appropriate OGRE based managers. Liu et al. (2009) use a similar customised approach based on OGRE, Raknet (cross-platform, open-source, C++ networking engine) and an Autodesk 3ds Max pipeline. Varcholik et al (2009) propose using Bespoke 3DUI XNA Framework as a low-cost platform for prototyping 3D spatial interfaces in video games. Their framework extends the XNA Framework; Microsoft’s set of managed libraries designed for game development based on Microsoft .NET Framework. Adhering to the format illustrated by Figure 1 and Figure 4, other researchers have utilised a pipeline that includes creating the assets and then passing them on to an interactive development kit to add further functionality and implementation. For example pipelines involving 3ds Max and 3DVIA Virtools (a virtual environment development and deployment platform) have been used by researchers investigating anxiety during navigation tasks within virtual environments (Maiano et al. 2011). The 3ds Max and Virtools pipeline was also used to create virtual environments to treat social phobia (Roy et al. 2003) In other studies focusing on anxiety, Robillard et al. (2003) created therapeutic virtual environments derived from computer games. Half-Life (1998-2000) was used to custom-make arachnophobia environments and to populate them with animated spiders of different shapes and sizes. Acrophobia and claustrophobia environments were based on Unreal Tournament (2000). Using a dedicated virtual environment development tool, Slater et al (1999) created virtual environments to provide research participants with a simulation of a public speaking environment for research into fear of public speaking. VRML (Virtual Reality Modeling Language) assets were created, and DIVE (Distributed Interactive Virtual Environment) was used to implement the environments (Slater et al. 1999; Pertaub et al. 2001; Pertaub et al. 2002). The Unity 3D development platform was recently used to realise an immersive journalism experience, presented at the 2012 Sundance Film Festival (Sand Castle Studios LLC, 2012), consisting of a virtual recreation of an eyewitness account of a real food bank line incident. Participants can use a headmouted display to walk around and interact with the virtual reproduction. This section has discussed that given that there is no ideal way of creating content, a wide variety of approaches have been adopted to create content for virtual environments. The next section provides more detail on the cutting-edge tools currently available. Available tools from the literature and REVERIE partners 14 Autodesk Maya / 3ds Max Description: Autodesk Maya / 3ds Max are industry standard 3D content and animation software for animation, modelling, simulation, visual effects, rendering, etc. Both are FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools widely used for both film production, and real-time games. Maya is PC, Linux, and Mac OSX compatible. 3ds Max is primarily Windows based, but Max 2012 can be run on a Mac using a windows partition. http://usa.autodesk.com/ Blender Description: Blender is a free open source 3D content creation suite. Blender is compatible with Windows, Linux or Mac OSX http://www.blender.org/ Autodesk 123D Catch Description: Autodesk’s Catch123D enables users to take photographs of desired objects from multiple viewpoints, and have them stitched together in the cloud to generate high quality 3D models incorporating the photographs as textures. http://usa.autodesk.com/ Autodesk MotionBuilder Description: Autodesk MotionBuilder is a real-time 3D character animation software widely used in film and game animation pipelines. Complex character animation, including motion capture data, can be edited, and played back in a highly responsive, interactive environment. MotionBuilder is ideal for high-volume animation and also includes stereoscopic toolsets. http://usa.autodesk.com/ Autodesk Mudbox Description: Autodesk Mudbox is a 3D digital sculpting and digital painting software that enables creation of production-ready 3D digital artwork for ultrarealistic 3D character modeling, engaging environments, and stylized props. Available for Mac, Microsoft Windows, and Linux operating systems. http://usa.autodesk.com/ Pixologic Zbrush: Description: ZBrush is a digital sculpting and painting program that has offers the world’s most advanced tools for digital artists. ZBrush provides the ability to sculpt up to a billion polygons. http://www.pixologic.com/home.php Unity Description: Unity is a development platform for creating games and interactive 3D for multi-platform deployment on the web, iOS (iPhone/iPod Touch/iPad, Mac), PC, Wii, Xbox 360, PlayStation 3, and Android. Its scripting language support includes JavaScript, C#, and a dialect of Python named Boo. Unity also contains the powerful NVIDIA® PhysX® next-gen Physics Engine. All major tools and file formats are supported, and it has the added advantage of being able to import Autodesk 3ds Max or Autodesk Maya files directly in, and converting it to FBX format automatically. The assets in a Unity project can continue to be updated in its 15 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools creation software, and Unity will update automatically upon save, even while the game is being played inside the Editor. http://unity3d.com/ Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max/ Blender. Unreal Development Kit (UDK) Description: UDK is a complete professional development framework which produces advanced visualisations and detailed 3D simulations on the PC and iOS. http://www.udk.com/ Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max /Blender OGRE (Open Source 3D Graphics Engine) Description: OGRE is a scene-oriented, flexible 3D engine designed to produce 3D interactive applications. http://www.ogre3d.org/ Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max /Blender Crytek Description: The CryENGINE is an advanced development solution enables creation of games, movies, high-quality simulations and interactive applications. It provides an all-inone game development solution for the PC, Xbox 360™ and PlayStation®3 http://www.crytek.com/ Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max /Blender. Vision Description: Game Engine gives that enables creation of games on most major platforms (PC, consoles, browsers, handhelds, mobiles), as well as services such as XBLA™, PlayStation®Network, and WiiWare. http://www.trinigy.net/products/vision-engine Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max /Blender. BlitzTech Description: The BlitzTech game engine offers cross-platform runtime code which provides all hardware specific and common code for the game titles supporting PC, PSP™, Xbox 360®, PS3™, Wii™ and Mac OS X, Browser, PlayStation®Vita, iOS, Android and emerging platforms. Highly integrated with BlitzTech Tools, it facilitates rapid prototyping and development by providing a common code framework right out of the box. http://www.blitzgamesstudios.com/blitztech/engine/ Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max /Blender. 16 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 1.2. Multi-modal Capturing 3D motion capture Optical motion capture systems, such as Vicon or Codamotion, are typically considered the gold standard techniques for capturing the movement of human subjects. Both these systems are optical Infrared (IR)-based depth capture techniques that use a number of cameras to track the temporal positions of a number of markers fixed directly onto a human body, or on a tightfitting suit. Passive reflective markers are used in Vicon, whereas active IR led-based markers are used in the Codamotion system. The results obtained from either system are highly accurate, typically reconstructing human motion to within a few millimeters of the ground truth motion, and have been used extensively in the computer animation, sports science and bio-mechanics communities. However, these systems tend to be highly expensive and typically require trained expert users to capture data. Chai and Hodgins (2005) proposed a smaller scale optical capture system that needed far fewer markers and only two cameras. A database of poses was used to do lazy-learning and infer the person’s pose using only the locations of a few tags on the body. However, the accuracy of this approach is significantly reduced if the captured subject is not directly facing the capture device. The recent emergence of the Kinect has provided controllerfree full-body gaming in home environments using image processing techniques coupled with IR-based depth capture hardware. As opposed to Vicon, they acquire a depth map of the individual it is surveying and infers human motion and actions directly from this depth map. These devices are cheap, easy to use, and can track the limb motions of a tracked individual if they are facing the capture device (the device’s accuracy can be significantly reduced if the captured subject takes up alternative orientations). These advantages have resulted in significant adoption of the device in the computer gaming community. Finally, Shiratori et al. (2011) takes an interesting alternative approach, instrumenting an actor with multiple outwardlooking cameras and acquiring pose and global position of the subject using image processing techniques. However, capturing motion with optical systems can be impractical and cumbersome in some scenarios. Performance capture within large spatial volumes, in particular, is challenging because it may be difficult to densely and safely populate the capture area with enough cameras to ensure adequate coverage (Kinect based capture tends to significantly decrease in accuracy after 8 metres). Additional complications arise during outdoor data capture sessions, where there is little control over other variables such as lighting conditions, occlusions and even a subject’s movements in or out of capture areas, all of which decrease the robustness of optical marker tracking algorithms. As such, the use of optical motion capture systems tends to be limited to indoor settings. Furthermore, the effectiveness of optical motion capture systems can further be hampered when strict constraints on reconstruction time leave little scope for manual intervention to correct artifacts in the captured motions. This is especially true when motion from high-velocity movements is required and tracking (of markers or limbs) can easily be lost. In addition, Kinect reconstruction accuracy can be hampered for body orientations that are not perpendicular to the camera’s principal axis (such as when a user lies on the ground) and for fine grained movements that do not significantly deform either the depth or shape silhouette (such as rotations around longitudinal axis of bones, twisting a wrist/foot). A review of the technologies available for motion capture is provided in (Welch and Foxlin, 2002), while a review of over 350 publications of computer vision based approaches to human motion capture is provided in (Moeslund et al. 2006). As lighting changes, coverage and occlusion introduce robustness issues for most optical systems, acquiring motion from alternative sensor technologies has been investigated. Sensor 17 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools data from accelerometers (Slyper and Hodgins, 2008; Tautges et al. 2011), Inertial Measurement Units (IMUs) (Vlasic et al. 2007) and pressure sensors (Ha et al., 2011) have been employed to capture the biomechanical motion of a human performer in real-time. These techniques have been employed in a variety of applications where traditional motion capture techniques are unfeasible, including real-time game interaction and biomechanical analysis of injury rehabilitation exercises in the home, for example. Techniques that use wearable sensor data can be split into two general groups. The first category acquires temporal limb and body positions of a subject by directly interpreting the acquired sensor data, a well-known use of this technology is the Nintendo Wii controller. When held in a static position, a tri-axial accelerometer outputs a gravity vector that points towards the earth. This information alone is enough to determine the sensor’s pitch and roll. In prior work, the assumption of low acceleration motion has been used to determine body limb positions using wearable accelerometers (Lee and Ha, 2001; Tiesel and Loviscach, 2006). Combining an accelerometer with a magnetometer or a digital compass, additionally allows sensor yaw to be determined. A gyroscope can be used to account for high acceleration motions. The combination of all three sensors into one unit is generally referred to as an IMU and has been incorporated into commercial products (Xsens). However purely inertial systems are prone to drift in accuracy over time, especially if fast movements with high accelerations are performed, although incorporating even further sensors on the body can reduce the negative impact of this drift (Vlasic et al. 2007). In addition, although many of these approaches can acquire the motion of subjects’ limbs, they are unable to synthesise the temporal location and movement of a person through an environment accurately. However, a trade-off exists, as these devices can be cheap, can operate outside of the lab and free-form motion can be obtained in some scenarios in real-time. In such scenarios, if human actions are required, as well as human motion, then machine learning techniques are also used to infer the actions directly from the motion (e.g. if in tennis a significant forward motion for a right arm is determined over a period of time, then a forehand shot action can be inferred). The second category of techniques are data-driven, typically using accelerometer sensor data to index into a motion database and animate motion segments that exhibit similar inertial motion data to those acquired by the worn sensors (Slyper and Hodgins, 2008; Kumar, 2008; Tautges et al. 2011). Each of these techniques animates an avatar in near-real time using readings from wearable accelerometers from a database of pre-recorded motion of the target activities. These techniques are typically not affected by the drift, even when only accelerometer sensors are used, making the techniques applicable to athletes or on-camera actors, who can tolerate only a lightweight and hidden form of instrumentation. The trade-off for obtaining high quality capture from lightweight instrumentation is that these approaches require a pre-captured database of short recaptured segments of motion that cover the basic behaviours expected from the subject during the capture. As such, an inherent limitation of these techniques is their inability to reconstruct free-form motion that is significantly different from the movements stored in the motion database. However, if the database of motions is labeled with specific actions, then not only motion, but actions can also be acquired without the need for machine learning techniques. Audio capture Sound capturing is performed by means of microphones. A large variety of microphones exist. They are either classified by their specific mode of transduction of the sound wave into an electric signal or by their characteristic of directivity. For example, omnidirectional directivity is obtained if the sensor captures the overall sound pressure. To obtain a specific directivity a pressure gradient needs to be measured (Rossi, 1986). 18 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools In practice, the use of microphones in a closed space is directly associated to two essential concepts in room acoustics: the concept of reverberation and critical distance. The sound field at the microphone is composed of a direct sound field (i.e. the direct path between microphone and sound source) and a diffuse field, also called reverberation field (composed by the signals produced by the sound source after having been altered by reflection, diffusion and diffraction). The principal goal of sound capturing is to capture the direct sound field. However capturing the diffuse sound field may be either desirable or problematic, depending on the situation/use case. Indeed, it is desirable to capture part of the diffuse field to reproduce the ambiance in which the sound sources are placed or to reproduce specific room characteristics (in particular this is often desired for classical music or live pop music recordings). However, in other situations, the diffuse sound field may principally contain sources of perturbation (such as noises and concurrent voices) which are problematic and lower the speech intelligibility or the music recording quality. Another concept of interest is the capability to record an audio scene with the spatial information of each source. This is usually done by using multiple microphones. It is finally worth noting that a large number of aspects of audio recording are influenced by the capturing process itself; thus it is possible to adopt a number of sound processing techniques that would improve the sound recording (such as noise reduction, and source separation). Conventional stereophony Conventional stereophony encodes the relative position of the sound sources (during the recording) in terms of intensity and delay differences between the two signals that would be emitted by the two loudspeakers. This is commonly obtained, during the recording, by using two microphones of different directivity which are either placed at different positions or oriented in different directions. When the microphones are placed in different locations we refer to a “couple of non-coincident microphones”. In the other case, when the two microphones are collocated, we refer to a “couple of coincident microphones”. Figure 5: Schematic diagram of coincident microphones 19 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 6: Stereophonic AB-ORTF couple Surround sound recording When an improved surround sound experience is desired, it is common practice to increase the number of microphones. This is indeed the case for the sound recording in the Ambisonics format (called UHJ or B Format for the 1st order approximation) which can be obtained, for example, through a sound-field microphone with 4 sensors. More information on Ambisonics recordings and synthesis can be found at http://www.ambisonic.net/ and http://en.wikipedia.org/wiki/Ambisonics. Using the sound-field microphone, we obtain four signals from the microphone capsules; this is often referred as A-Format. The resulting signals are then further processed and encoded in Bformat; for example see (Craven and Gerzon, 1975). Figure 7: Schematic diagram of a 4 sensors sound field microphone Figure 8: Plug-and-Play setup for Surround Ambience Recording ORTF Surround Outdoor Set with the wind protection (http://www.schoeps.de/en/products/ortf-outdoor-set) Microphone directivity As briefly outlined above, microphones can be classified according to their directivity characteristics. We provide below typical directivity patterns for specific microphones from 3 different categories: omnidirectional, cardioid, and shotgun. The following directivity pattern is 20 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools typical of an omnidirectional microphone where the sensitivity is similar in all directions; except for very high frequencies above 10 kHz. Figure 9: Directivity pattern for microphone U89 in omnidirectional position The directivity pattern of cardioid microphones is clearly asymmetric. Figure 10: Directivity pattern for microphone U89 in cardioid position The directivity pattern of shotgun microphones is close to hypercardioid. Shotgun microphones are known to be quite robust to wind interference in outdoor recordings. Figure 11: Directivity pattern for shotgun microphone KMR 82 21 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Sound capturing and critical distance When dealing with sound capturing it is of primary importance to assess the impact of the microphone selectivity on the recording. This aspect is particularly important: in outdoor recordings when noise sources may be prominent; and in indoor situations when the room has a strong influence on the sound characteristic. For indoor recordings, the concept of critical distance is particularly relevant. The room impulse response clearly depends on the location of the sound source and on the microphone (or sensor) location. The sound captured by/at the microphone is the sum of several components: the direct sound field, the early reflections (ER) and the diffuse sound field (or reverberation). A key-aspect of sound recording is how the sound field is modified when the sources or sensors are moving. It is possible to show that the diffuse sound field is relatively independent of the source and microphones’ locations, while the direct sound field component largely depends on those positions. The critical distance is defined as the distance for which the energy of the direct sound field is equal to the energy of the diffuse sound field. It is therefore possible to compare the critical distance of several microphones types, obtaining an indication for their appropriate use in real recordings. Figure 12: Influence of directivity on the critical distance Noise reduction Noise reduction has been a field of intense research for many decades. We often subdivide the research domain depending on the number of microphones or sensors used. Spatial information is not available when only one microphone is employed. Therefore, the resulting performances are of lower quality than those that can be attained when multiple sensors can be employed. In most situations, it is possible to collect an example of the background noise (without the signal of interest), and thus to estimate the background noise spectral properties. For audio signals, an excerpt of 0.5 seconds is typically sufficient to obtain a satisfying estimation of the background noise spectral properties. It is however necessary to automatically detect such portions of the signals, where only the background noise is present. This is usually done by considering that the noise has different statistical properties than the signal of interest. This type of approach is well suited for stationary noises (noises whose statistical properties of the two first orders are independent of time). In practice, the stationary hypothesis is only valid for finite time horizons and it is thus necessary to rely on adaptive algorithms which will adaptively estimate the noise spectral properties. While in some cases noise reduction can benefit from a 22 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools source production model (in particular this is possible for noise reduction of speech signals), in general it is not possible to exploit such models due to the high variability of audio sources. However, methods initially introduced for speech signals, which rely on the fact that the signal of interest is periodic or quasi-periodic, will also work to a certain extent for general audio signals. Such methods are usually based on the principle of spectral subtraction, which consists in lowering or even suppressing the low energy component (or those which are between harmonics) of the short term Fourier transform. The interested reader may consult the following references (Cappé, 1994; Ephraim and Malah, 1984; Ephraim and Malah, 1985; Ephraim and Malah, 1998; Godsill and Rayner, 1998). When multiple microphones are available the spatial information can also be exploited. The most used principle consists of re-synchronizing multiple microphones and then adding the different contributions. As a result, the signal of interest would be amplified while the undesired noise signals will have no reason to interfere positively. An example of such setting for a linear array of sensors is shown in Figure 13. Figure 13: Schematic representation for beam forming in a linear microphone array Improved noise-reduction performances can be obtained with such arrays. Note that a wide range of combinations and structures can be designed for microphone arrays. The interested reader may in particular consult (Brandstein and Ward, 2001). Source separation This brief overview of source separation was extracted from (Mueller et al. 2011). The goal of source separation is to extract all individual sources from a mixed signal. In a musical context, this translates to obtaining the individual track of each source or instrument (or individual notes for polyphonic instruments, such as a piano). A number of excellent overviews of source separation principles are available, see (Virtanen, 2006; Comon and Jutten, 2010). In general, source separation refers to the extraction of full bandwidth source signals but it is interesting to mention that several polyphonic music processing systems rely on a simplified source separation paradigm. For example, a filter bank decomposition (splitting the signal in adjacent well defined frequency bands) or a mere Harmonic/Noise separation (Serra, 1989) (as for drum extraction (Gillet and Richard, 2008) or tempo estimation (Alonso et al. 2007)) may be regarded as instances of rudimentary source separation. Three main situations occur in source separation problems: the determined case corresponds to the situation where there are as many mixture signals as different sources in the mixtures; contrary, the overdetermined/underdetermined case refers to the situation where there are 23 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools more/less mixtures than sources; Underdetermined Source Separation (USS) is obviously the most difficult case. The problem of source separation classically includes two major steps that can be realized jointly: estimating the mixing matrix and estimating the sources. Let X = [x1(n), .., xN(n)]T be the N mixture signals, S = [s1(n), .., sM(n)]T the M source signals, and A = [a1, a2, .., aN]T the N×M mixing matrix with mixing gains ai = (ai1, ai2, .., aiM). The mixture signals are then obtained by: X = AS. This readily corresponds to the instantaneous mixing model (the mixing coefficients are simple scalars). The more general convolutive mixing model considers that a filtering occurred between each source and each mixture (see Figure 14). Figure 14: Convolutive mixing model In this case, if the filters are represented as N×M finite impulse response filters of impulse response hij(t), the mixing matrix is given by A = (h1(t), h2(t), ..., hN(t))T with hi(t) = [hi1(t), ... , hiM(t)], and the mixing model corresponds to X = A * S. A wide variety of approaches exist to estimate the mixing matrix and rely on techniques such as independent component analysis, sparse decompositions or clustering approaches (Virtanen, 2006). In the determined case, it is straightforward to obtain the individual sources once the mixing matrix is known: S = A-1X. The underdetermined case is much harder since it is an illposed problem with an infinite number of solutions. Again, a large variety of strategies exists to recover the sources including heuristic methods, minimization criteria on the error ||X −AS||2, or time-frequency masking approaches. One popular approach, termed adaptive Wiener filtering, exploits soft time-frequency masking. Music signal separation is a particularly difficult example of USS of convolutive mixtures (many concurrent instruments, possibly mixed down with different reverberation settings, many simultaneous musical notes and in general a recording limited to, at best, 2 channels). The problem is often tackled by integrating prior information on the different source signals. For music signals, different kinds of prior information have been used including timbre models (Ozerov, 2007), harmonicity of the sources (Vincent et al. 2010), temporal continuity or sparsity constraints (Virtanen, 2007). In some cases, by analogy with speech signal separation, it is possible to exploit production models, see (Durrieu et al. 2010). Concerning evaluation, the domain of source separation of audio signals is also now quite mature and regular evaluation campaigns exist along with widely used evaluation protocols (Vincent et al. 2005). As examples of multimodal capture set-up, we can mention the dance-class scenario considered in the 3DLife/Huawei ACM MM Grand Challenge (http://perso.telecomparistech.fr/~essid/3dlife-gc-11/) which uses a set of microphones, inertial sensors, cameras and Kinects, and the IDIAP Smart Meeting Room (Moore, 2002) consisting of a set of microphones and cameras. Available tools from the literature and REVERIE partners 24 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools (a) Kinect depth maps (b) Skeleton tracking from Kinect depth data and using the OpenNI library. Figure 15: Kinect depth maps and OpenNI skeleton tracking 25 Microsoft Kinect RGB+Depth sensor Description: The Kinect sensor features 1. a regular RGB camera and 2. a depth scanner, consisting of a stereo pair of an IR projector and an IR camera (monochrome CMOS sensor), with a baseline of approximately 7.5cm. The depth values are inferred by measuring the disparity between the received IR pattern and the emitted one, which is a fixed pattern of light and dark speckles. The Kinect driver outputs a Nx × Ny = 640 × 480 depth grid with a precision of 11 bits at 30 frames/sec. The RGB image is provided in the same resolution and frame rate with the depth data. According to the Kinect-ROS (Robot Operating System) wiki, the RGB camera’s Field of View (FOV) is approximately 58 degrees, while the depth camera’s FOV is approximately 63 degrees. Microsoft officially specifies that the Kinect’s depth range is 1.2−3.5 meters, but it can be experimentally verified that the minimum distance can be as low as 0.5 meters and the maximum distance can reach 4 meters. Essentially, the IR projector and camera constitute a stereo pair, hence the expected precision on Kinects’ depth measurements is proportional to the square of the actual depth. The experimental data presented in the ROS wiki confirm this precision model, showing a precision of approximately 3mm at a distance of 1 meter and 10mm at 2 meters. The accuracy of a calibrated Kinect sensor can be very high, of the order of ±1mm. The Kinect sensor will be useful mainly in Task 4.3 (Capturing and Reconstruction) and may be used in Task 4.4 (Multimodal User Activity Analysis) for capturing the human body motion (exploiting the OpenNI API, see below). http://www.ros.org/wiki/kinect_calibration/technical FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools http://www.ros.org/wiki/openni_kinect/kinect_accuracy http://www.xbox.com/kinect Dependencies on other technology: PrimeSense Kinect driver; OpenNI API or the Microsoft Kinect SDK (MS Windows 7 or newer). OpenNI API (http://www.openni.org/) Description: The OpenNI framework provides an API for writing applications utilizing NI devices, such as the Kinect sensor. The API covers both the low-level communication with the devices, as well as it provides high-level middleware solutions (e.g. for visual human tracking, see Figure 15(b)). The OpenNI library is necessary for communicating with the Kinect sensor and using the RGB+Depth measurements that it captures. It will be useful in Task 4.3 (Capturing and Reconstruction) and it can be probably used in Task 4.4 (Multimodal User Activity Analysis) for human skeleton tracking . Dependencies on other technology: MS Kinect Sensor and PrimeSense’s Kinect driver. Computational Geometry Algorithms Library (CGAL) Description: The CGAL open source project aims at providing easy access to efficient and geometric algorithms in the form of a C++ library. CGAL can be used in various areas needing geometric computation, such as in computer graphics. The CGAL library offers optimized C++ implementations of algorithms, which may be necessary in Tasks 4.3 (Capturing and Reconstruction) and 4.5 (3D User-Generated Worlds from Video/Images), including: triangulations (e.g. 2D constrained triangulations and Delaunay triangulations in 2D and 3D), mesh generation (2D Delaunay mesh generation and 3D surface and volume mesh generation), geometry processing (e.g. mesh simplification), alpha shapes, convex hull algorithms (in 2D and 3D). http://www.cgal.org/ Dependencies on other technology: Boost library (http://www.boost.org/). For a detailed list of its dependencies see: http://www.cgal.org/Manual/latest/doc_html/installation_manual/ Point Cloud Library (PCL) Description: The PCL is an open project for 3D point cloud processing. The PCL framework contains implementations of numerous state-of-the art algorithms that would be useful in Tasks 4.3 (Capturing and Reconstruction) and 4.5 (3D User-Generated Worlds from Video/Images), including algorithms for point-cloud’s registration and surface reconstruction. http://www.pointclouds.org/ Dependencies on other technology: A set of third-party libraries is required for the compilation and usage of the PCL library. For a detailed list, see: http://www.pointclouds.org/downloads/source.html Nvidia’s Compute Unified Device Architecture (CUDA) SDK Description: CUDA™ is a parallel computing programming model that enables parallel computations on modern GPUs (Nvidia’s CUDA-enabled GPUs) without the need for mapping them to graphics APIs. The use of the CUDA SDK can dramatically increase computing performance, needed in Task 4.3 (Capturing and Reconstruction) needed to realize real-time reconstructions. 26 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools http://www.nvidia.com/object/cuda_home_new.html Dependencies on other technology: Nvidia’s CUDA-enabled GPUs. ROS Kinect Description: The ROS Kinect open-source project focuses on the integration of the Microsoft Kinect sensor with ROS. The Kinect ROS stack contains some packages/components that may be useful in the 3D reconstruction task from Kinect data. http://www.ros.org/wiki/kinect Dependencies on other technology: OpenNI Kinect drivers and API, PCL, OpenCV Bundler: Structure from Motion (SfM) for Unordered Image Collections Description: Bundler is a SfM system for unordered image collections (for instance, images from the Internet) written in C and C++. Bundler takes a set of images, image features, and image matches as input, and produces a 3D reconstruction of camera and (sparse) scene geometry as output. Bundler has been successfully run on many Internet photo collections, as well as more structured collections. http://phototour.cs.washington.edu/bundler/ Dependencies on other technology: Sparse Bundle Adjustment (SBA) package of Lourakis and Argyros Generic SBA C/C++ Description: C/C++ package for generic SBA that is distributed under the GNP. Bundle Adjustment (BA) is almost invariably used as the last step of every feature-based multiple view reconstruction vision algorithm to obtain optimal 3D structure and motion (i.e. camera matrix) parameter estimates. Provided with initial estimates, BA simultaneously refines motion and structure by minimizing the reprojection error between the observed and predicted image points. The minimization is typically carried out with the aid of the Levenberg-Marquardt (LM) algorithm. http://www.ics.forth.gr/~lourakis/sba/ Dependencies on other technology: SBA relies on LAPACK (http://www.netlib.org/lapack/) for all linear algebra operations arising in the course of the LM algorithm. Patch-based Multi-View Stereo (PMVS) Software Description: PMVS is a multi-view stereo software that takes a set of images and camera parameters, then reconstructs a 3D structure of an object or a scene visible in the images. Only rigid structure is reconstructed, in other words, the software automatically ignores non-rigid objects such as pedestrians in front of a building. The software outputs a set of oriented points instead of a polygonal (or a mesh) model, where both the 3D coordinate and the surface normal are estimated at each oriented point. http://grail.cs.washington.edu/software/pmvs/ Multicore BA Description: This project, considers the design and implementation of new inexact Newton type BA algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. This approach overcomes severe memory and bandwidth limitations of current generation GPUs and leads to more space efficient algorithms and to surprising savings in processing time. 27 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools http://grail.cs.washington.edu/projects/mcba/ Vicon Description: Motion capture or mocap are terms used to describe the process of recording movement and translating that movement on to a digital model. It is used in military, entertainment, sports, and medical applications, and for validation of computer vision. The Vicon infra-red motion capture system is the industy gold-standard for passive-tag motion capture. It is a semi-automated optical motion capture system that tracks the 3D position of infra-red reflective markers in 3D space using specialized camera hardware. Each of the marker tags can be tracked with a high degree of accuracy. A performer wears markers near each joint to identify the motion by the positions or angles between the markers. The motion of the subject can then be reconstructed to a high degree of accuracy using these tracked markers. http://www.vicon.com Codamotion Description: A technology similar to Vicon, but uses active sensors, which are powered to emit their own light. The power to each marker can be provided sequentially in phase with the capture system providing a unique identification of each marker for a given capture frame at a cost to the resultant frame rate. http://www.codamotion.com Xsens MVN - Inertial Motion Capture Description: Xsens MVN performs motion capture using inertial sensors which are attached to the body by a lycra suit. This approach gives you freedom of movement because MVN uses no cameras. It is a flexible and portable Motion Capture system that can be used indoors and outdoors. Xsens MVN requires minimal clean-up of captured data as there is no occlusion or marker swapping. http://www.xsens.com Xsens MVN – BIOMECH Description: MVN BIOMECH is an ambulant, full-body, 3D human kinematic, camera-less measurement system. It is based on inertial sensors, biomechanical models and sensor fusion algorithms. MVN BIOMECH is ambulatory, can be used indoors and outdoors regardless of lighting conditions. The results of MVN BIOMECH trials require minimal postprocessing as there is no occlusion or lost markers. MVN BIOMECH is used in many applications like biomechanics research, gait analysis, human factors and sports science. Kinect SDK Dynamic Time Warping Gesture Recognition Description: An open-source project allowing developers to include fast, reliable and highly customisable gesture recognition in Microsoft Kinect SDK C# projects. This project is currently in setup mode and only available to project coordinators and developers. http://kinectdtw.codeplex.com/ Recovering Articulated Pose of 3D Point Clouds (Fechteler and Eisert, 2011) Description: We present an efficient optimization method to determine the 3D pose of a human from a point cloud. The well-known ICP algorithm is adapted to fit a generic 28 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools articulated template to the 3D points. Each iteration jointly refines the parameters for rigid alignment, uniform scale as well as all joint angles. In experimental results we demonstrate the effectiveness of this computationally efficient approach. Figure 16(from left to right): Input scan, template model, both meshes superimposed on each other initially and after pose adaptation. Alcatel-Lucent Background Removal Software Description: The proprietary Alcatel-Lucent background removal software uses 2D video input. It uses a non-disclosed set of algorithms to efficiently and qualitatively remove the background from the user’s moving image in the foreground in a 2D video stream. http://www.alcatel-lucent.com Digital Y-US1 Yamaha Piano Description: Digital Y-US1 Yamaha Piano with the Mark III disklavier system which enables the simultaneous recording of MIDI type information. Large anechoic room Description: The large anechoic room (125 m3) has a reverberation time lower than 30 msec at 125 Hz. Mounted on trapezes with high quality material cones, this anechoic room possesses excellent anechoic characteristics which permit high quality recordings and 3D audio rendering experiments. Audio recording studio Description: The recording studio allows for 16 tracks of professional quality recordings. It is fully equipped for high quality sound recording using multiple sensors (softwares, mixing tables, directional and omnidirectional microphones, KEMAR® headset or/and cameras). Loudspeakers for 3D audio Description: 3D audio system with a large number of dedicated loudspeakers including a set of 12 passive TANNOY® loudspeakers for realistic 3D sound rendering. 1.3.Performance Capture A lot of effort has been made by the computer graphics community to find efficient and accurate methods to reconstruct and capture dynamic 3D objects. The methods range from global performance capture to more constrained facial performance capture. The most 29 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools interested branches of industry in this kind of method are the movie and video game studios. This kind of application needs a very high quality synthesis that can be editable and that is coherent over time. The real time goal isn’t the main target of these methods; however they give an interesting insight about what can be done with typical sensors, such as cameras, structured lighting, a lot of processing power, etc. On a quite global point of view some research focus on reconstructing a consistent 4D surface (3D space and time) from time varying points cloud (Wand et al., 2007). Others try to estimate directly a consistent surface with a smooth or simplified template, in this category some use a skeleton-based meshed template or a fully general meshed template. Figure 17: Side-by-side comparison of input and reconstruction of a dancing girl wearing a skirt. Figure 18: Input scans (left) and the reconstruction result (right) Figure 19: 1st image: the articulated template model. 2nd image: using the estimated surface of the previous frame, the pose of the skeleton (1st image) is optimized such that the deformed surface fits the image data. 3rd image: since skeleton-based pose estimation is not able to capture garment motion the surface is refined to fit the silhouettes. 4th image: the final reconstruction with the fitted skeleton and the refined surface. Figure 20: From left to right, our system starts with a stream of silhouette videos and a rigged template mesh. At every frame, it fits the skeleton to the visual hull, deforms the template 30 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools using Linear Blend Skinning and adjusts the deformed template to fit the silhouettes. The user can then easily edit the geometry or texture of the entire motion. Some address the full task of reconstructing a consistent surface without any prior i.e. template like (Wand et al., 2009) or with the only prior that the observed surface comes from an articulated object. Some results are shown in the following figures. Figure 21: Left: input range scans. Right: poseable, articulated 3D model. The articulated global registration algorithm can automatically reconstruct articulated, poseable models from a sequence of single-view dynamic range scans. Figure 22: Results from (Wand et al., 2009). Top: input range scans. Middle: reconstructed geometry. Bottom: the hand is mapped to show the consistency of the reconstructed surface. Vlasic et al. (2009) tried to estimate the normals along with the position of the surface by using an active lighting setup that permits the extraction of information of the normals. While providing a really vivid geometry (with a lot of details, thanks to the normals estimation) they do not output a consistent geometry throughout the entire sequence since each mesh is computed 31 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools independently at each time step. Results are shown in Figure 23. Figure 23: Results from (Vlasic et al., 2009). Top and middle: photometric inputs, normal maps and reconstruct geometry. Bottom: The acquisition setup consists of 1200 individually controllable light sources. Eight cameras are placed around the setup aimed at the performance area. These methods differ also by their input data type: 3D point clouds for (Wand et al., 2009, 2007), multi-view cameras (with a strong use of visual hull) for (Vlasic et al., 2009). If we look at more constrained setups like those used for facial performance capture we end up with ultra-high quality reconstructions. Some research (Wilson et al., 2010) tries to tackle one of the caveat of (Vlasic et al., 2009) (mainly the necessary high video/photometric throughput to perform the normal estimation). Others rely on passive multi-view stereo only by tracking the very fine details of the skin such as pores (which give a tremendous amount of information of how the surface of the face deforms itself). Some results are shown in the following figures. Figure 24: Results from (Wilson et al., 2010). Smiling sequence captured under extended spherical gradient illumination conditions (top row), synthesized intermediate photometric normals (center row), and high-resolution geometry (bottom row) of a facial performance as reconstructed with the same temporal resolution as the data capture. 32 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 25: Top row: one reference frame, the reconstructed geometry (1 million polygons), final textured result. Bottom row: two different frames from the final textured result, acquisition setup. Some methods use a learning stage to extrapolate high dimensional synthesis from low dimensional capture. This involves the use of a sparse setup of markers on the face of an actor, to learn how the skin folds and deforms according to the displacement of the markers to synthesize densely in time and space the deformation of a template. In this category we find Weise et al. (2011) who tries to compute the most probable actors’ face pose from previous computed ones in the blendshape space (which is a low dimensional representation of facial expressions). A learning stage permits the learning of the probability model (the prior) in the blendshape space which enables a full expression sequence in real time from very noisy on-line input (dynamic 3D scans from a Kinect device). Results are shown in the following figures. Figure 26: They synthesize new high-resolution geometry and surface detail from sparse motion capture markers using deformation-driven polynomial displacement maps. Top row from left to right: motion capture markers, deformed neutral mesh, deformed neutral mesh 33 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools with added medium frequency displacements. Bottom row from left to right: deformed neutral mesh with added medium and high frequency displacements, ground truth geometry. Figure 27: Results from (Weise et al., 2011). For each row, from left to right: input color image (from a Kinect sensor), input depth map (from a Kinect sensor), tracked facial expression, retargeting on a virtual avatar (thanks to the convenient blendshape framework). Usefulness to the project Performance capture is becoming a standard pipeline for high quality movie SFX production with animated characters. Developing a specific methodology approximating the result of such offline procedures will provide a high-end version of the REVERIE framework, for users running REVERIE solutions on powerful computers and high quality capture setup. Available tools from the literature and REVERIE partners Dense 3D Motion Capture for Human Faces (Furukawa and Ponce, 2009b) Description: A novel approach to motion capture from multiple, synchronized video streams, specifically aimed at recording dense and accurate models of the structure and motion of highly deformable surfaces such as skin, that stretches, shrinks, and shears in the midst of normal facial expressions. Solving this problem is a key step toward effective performance capture for the entertainment industry, but progress so far has been hampered by the lack of appropriate local motion and smoothness models. This contributes a novel approach to regularization adapted to non-rigid tangential deformations. Concretely, we estimate the non-rigid deformation parameters at each vertex of a surface mesh, smooth them over a local neighborhood for robustness, and use them to regularize the tangential motion estimation. To demonstrate the power of the proposed approach, the performances of the algorithm was tested on three extremely challenging face datasets that include highly nonrigid skin deformations, wrinkles, and quickly changing expressions. Additional experiments with a dataset featuring fast-moving cloth with complex and evolving fold structures demonstrate that the adaptability of the proposed regularization scheme to non-rigid tangential motion does not hamper its robustness, since it successfully recovers the shape and motion of the cloth without over fitting it despite the absence of stretch or shear in this case. 34 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 28(from left to right): sample input image, reconstructed mesh model, estimated motion, and a texture mapped model for one frame with interesting structure/motion. The right two columns show the results in another interesting frame. As performance capture is still a rather novel and very active research field, there is almost no software resource available and specific development will be required. Still, Microsoft Kinect SDK can be used as a basic tool chain to perform capture from multiple Kinects. 1.4. Activity recognition Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. This is a very challenging problem for traditional optical based approaches which attempt to track and understand the behaviour of agents through videos using computer vision. Action recognition is a very active research topic in computer vision with many important applications, including user interface design, robot learning, and video surveillance, among others. In this work we specifically address techniques and applications for human-computer interaction. Historically, action recognition in computer vision has been sub-divided into topics such as gesture recognition, facial expression recognition and movement behaviour recognition. Typically generic activity recognition algorithms involve the extraction of some features, segmentation of possible actions from a sequence and the classification of this action using the extracted features using machine-learning techniques. A thorough overview of computer vision techniques for activity recognition are out of the scope of this document, interested readers are directed to three recent publications (Aggarwal and Ryoo, 2011; Weinland et al. 2011; Poppe, 2010), which together provide a thorough overview and survey of the state-of-the-art research into human body motion activity recognition using computer vision techniques, both in terms of feature extraction and classification techniques. In these papers, the authors tend to focus on determining the actions and activities of a single human subject’s body, whose motion appears in the FOV of a single standard visual spectrum camera. Although most of the existing work on human body activity recognition involves the use of computer vision, recently gesture recognition has been investigated using non-visual sensors. An approach for spotting the temporal locations of sporadically occurring gestures in a continuous data stream from body-worn inertial sensors. These gestures are directly subsequently classified using Hidden Markov Models (HMM). In (Benbasat and Paradiso, 2001) an additional step is added, where data from six-axis IMUs are firstly categorized on an axis-by-axis basis as simple motions (straight line, twist, etc.) with magnitude and duration. These simple gesture categories are then combined both concurrently and consecutively to create specific composite gestures, which can then be set to trigger output routines. The work of Ward et al. (2005) extends the use 35 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools of solely inertial data by augmenting a wrist worn 3-axis accelerometer and a wrist worn microphone for continuous activity recognition. In this scenario, characteristic movements and sounds were used to classify hand actions in a workshop scenario, such as sawing and drilling. Until now, this review has focused on human body motion activity recognition. Prior literature has also investigated analyzing the motion of multiple users for activity recognition. This work has particular application in sport, where computer vision techniques are applied to analyze complex and dynamic sport scenes for the purpose of team activity recognition for use in applications such as tactic analysis and team statistic evaluation (which is useful for coaches and trainers), video annotation and browsing, automatic highlight identification, automatic camera control (useful for broadcasting), etc. For group activity recognition, a specific group action may involve multiple people performing significantly different actions individually. This makes it difficult to find effective descriptors for group actions. Some of the techniques in this area described in the literature are of interest to researchers in REVERIE as they make use of multi-camera systems to overcome the limitations of using single moving or static camera systems, such as occlusions or inaccurate subject localization. Interesting papers in this area include the work of Intille and Bobic (2001) who propose Bayesian belief networks for probabilistically representing and recognizing agent goals from noisy trajectory data of American football players. Blunsden et al. (2006) present both global and individual model approaches for classifying coordinated group activities in handball games. Perse et al. (2010) and Hervieu et al. (2009) propose trajectory-based approaches to the automatic recognition of multi-player activity behaviour in basketball games and team activities in handball games respectively. Finally in this section, we review state-of-the-art approaches for the automatic recognition of facial expressions and actions using computer vision based approaches. For human beings, facial expression is one of the most powerful and natural ways to communicate their emotions and intensions. As such, it is important to review such approaches within REVERIE. Although a human being can detect facial expressions without effort, automatic facial expression recognition using computer vision techniques is a challenging problem. There are three sub problems while designing an automatic facial expression recognition system: face detection, extraction of the facial expression information, and classification of the expression. A system that performs these operations more accurately and in real time would be crucial to achieve a human-like interaction between man and machine. Many techniques adopt the use of the Facial Action Coding System (FACS), to encode expressions, which are then used as features for expression classification. Using the FACS system, facial expressions can be decomposed in terms of 46 component movements, which roughly correspond to the individual facial muscles. (Bartlett et al. 2006; and Braathen et al. 2002) present machine learning approaches, using support vector machines, to automatically apply FACS action unit intensities to input faces. Although most research focuses on the extraction of features from visual spectrum cameras, Busso et al. (2004) discusses an alternative approach where multi-modal data including speech is employed and four main emotions are determined (anger, sadness, happiness, neutral). Finally, a review of works in this area is presented in (Patil et al. 2010). Usefulness to the project Activity recognition is integral in two distinct aspects of the REVERIE framework. Firstly, determining human activity from their full body motion in a virtual environment; as opposed to accurately rendering the volumetric representation of a subject, accurately determining human activity behind the motion is essential in order to correctly interpret a user’s movement, desires and requirements in a system. Determining human activity and requirements significantly 36 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools increase the ability for characters to efficiently interact with each other, and the environment surrounding them. In addition, this data can be fed into an animation module to allow each character in the environment can be animated in a realistic manner. Secondly, this area also covers extracting and determining human affects, emotion and interest of users from user face and body motion. This characterization of users’ emotional states and expressive intents is critical for WP5, as this information will be required as an essential input for the artificial intelligence (AI) modules of autonomous avatars. As an input to other modules in the framework, it will also significantly allow an increase in accurate rendering of virtual characters facial features, so determined emotional states can be efficiently rendered on virtual user avatars. 1.5. 3D reconstruction The algorithms for 3D reconstruction can be classified according to the following diagrams. In the first classification approach, see Figure 29(a), existing 3D reconstruction methodologies are classified based on the target application. The target application specifies the requirements on reconstruction accuracy and time. For example, the reconstruction of cultural-heritage objects requires very high accuracy. In REVERIE Task 4.3 (Capturing and Reconstruction), the 3D reconstruction of moving humans and foreground objects for tele-immersive applications demands real-time processing. In a second categorization approach, see Figure 29(b), the 3D reconstruction methodologies can be classified based on the employed sensing equipment. This can range from inexpensive passive sensors, such as multiple RGB cameras, to more expensive active devices (laser scanners, TOF cameras, range scanners, etc.). In Task 4.3 (Capturing and Reconstruction), multiple RGB cameras, as well as inexpensive active sensors, such as Kinect sensors, will be employed. Thirdly, see Figure 29(c), reconstruction methods can be classified with respect to the resulting reconstruction type, into a) volumetric and b) surface-based approaches. Volumetric methods (Curless and Levoy, 1996; Kutulakos and Seitz, 2000; Matsuyama et al. 2004) are based on discretization (voxelization) of the 3D space. They are robust, but they are either prone to aliasing artifacts (with low resolution discretizations) or they require increased computational effort (with high-resolution discretizations). On the other hand, surface-based methods (Turk and Levoy, 1994; Matusik et al. 2001; Franco and Boyer, 2009) explicitly calculate the surfaces from the given data. A more detailed classification scheme of reconstruction algorithms is given in Figure 29(d). Multi-view based 3D reconstruction approaches are classified into a) methods that are not based on optimization (Kutulakos and Seitz, 2000; Matsuyama et al. 2004; Matusik et al. 2001; Franco and Boyer, 2009), and b) optimization-based methods (Boykov and Kolmogorov, 2003; Paris et al. 2006; Zeng et al. 2007; Pons et al. 2007; Vogiatzis et al. 2007; Kolev et al. 2009), i.e. methods that minimize a certain cost function that incorporates photo-consistency and smoothness constraints. In the former category, i) silhouette-based approaches extract the foreground objects’ silhouette in each image and construct the objects’ “visual-hull” (intersection of the silhouette’s visual cones) (Matsuyama et al. 2004; Matusik et al. 2001; Franco and Boyer, 2009). They are simple, fast and robust, but they lack reconstruction accuracy (they are able to reconstruct details and especially concavities). In another subcategory, ii) voxel-coloring or space-carving techniques (Kutulakos and Seitz, 2000; Yang et al. 2003), recover the objects’s “photo-hull” (the shape that contains all possible photo-consistent reconstructions), by sequentially eroding photo-inconsistent voxels in a plane sweeping 37 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools framework. Space-carving techniques are generally more accurate than silhouette-based approaches, but are slower and less robust. 38 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools (a) Classification based on the target application (b) Classification based on the sensing technology (d) Classification based on the algorithmic approach Figure 29: Classification of 3D reconstruction techniques 39 (c) Classification based on the resulting reconstruction type ` FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools In the second category, optimization-based approaches introduce an objective function to be minimized/maximized, which (apart from photo-consistency) incorporate additional constraints, mainly smoothness constraints on the reconstructed surface. Various mathematical tools have been employed for optimization, such as i) active contour models (e.g. snakes and level-sets (Pons et al. 2007; Jin et al. 2005)) and ii) graph-cut methods (Boykov and Kolmogorov, 2003; Paris et al. 2006; Vogiatzis et al. 2007). Furthermore, they can be classified based on whether they apply i) global (Boykov and Kolmogorov, 2003; Vogiatzis et al. 2007; Tran and Davis, 2006), or ii) local optimization (Zeng et al. 2007; Furukawa and Ponce, 2009a). Optimization-based approaches can produce highly accurate reconstructions. However, they are unsuitable for real-time reconstruction applications, such as tele-immersion, because the required computation time is very high, ranging from several minutes to hours. In another category, reconstruction is achieved by the fusion of dense depth maps, which are produced either by i) active direct-ranging sensors (Curless and Levoy, 1996; Turk and Levoy, 1994; Soucy and Laurendeau, 1995), or ii) by passive stereo camera pairs (Kanade et al. 1999; Mulligan and Daniilidis, 2001; Merrell et al. 2007; Vasudevan et al. 2011). Most of the methods in (i) present relatively high accuracy, but they work off-line to combine range data generated at different time instances. In (ii), some methods achieve fast, near real-time reconstruction: Kanade et al. (1999), using a large number of distributed cameras, achieved very fast, but not real-time, full-body reconstruction (less than a single frame per second). Mulligan and Daniilidis (2001) used multiple camera triplets for image-based reconstructions of the upper human body, in one of the first teleimmersion systems. In a very recent work (Merrell et al. 2007), multiple depth maps are computed in real-time from a set of images, captured by moving cameras and a viewpoint-based approach for quick fusion of stereo depth maps is used. In a state-of-the-art work (Vasudevan et al. 2011), a method for the creation of highly accurate textured meshes from a stereo pair is produced. Exploiting multiple stereo pairs, multiple meshes are generated in real-time, which are combined to synthesize high-quality intermediate views for given viewpoints. However, the authors in (Vasudevan et al. 2011) do not address the way separate meshes could be merged to produce a single complete 3D mesh, rather than intermediate views. The Microsoft Kinect sensor, released in November 2010, has attracted the attention of many researchers due to its ability to produce accurate depth maps, compared to its low price. During 2011, many Kinect-based applications appeared on the internet, including 3D reconstruction applications. However, most Kinect-based 3D reconstruction approaches combine depth data, captured by a single Kinect from multiple view-points, to produce off-line reconstructions of 3D scenes. Only a few official Kinect-based works have been published so far (Zollhoefer et al. 2011; Izadi et al. 2011). In (Zollhoefer et al. 2011), personalized avatars are created from a single RGB image and the corresponding depth map, captured by a Kinect sensor. In (Izadi et al. 2011) (which is more relevant to Task 4.5), the problem of dynamic 3D indoor scenes reconstruction is addressed, by the fast fusion of multiple depth scans, captured by a single hand-held Kinect sensor. Currently, IT/TPT works on real-time 3D reconstruction of moving humans and foreground objects from multiple Kinects and have obtained promising results (http://www.reveriefp7.eu/resources/demos/). The relevant work is shortly described later in this document. IT/TPT aims at optimizing the approach during the project. Below, the most promising state-of-the-art approaches for fast 3D reconstruction are given. Silhouette-based approach for 3D reconstruction and efficient texture mapping (Matsuyama et al. 2004) Description: A fast volumetric silhouette-based approach for 3D reconstruction and efficient texture mapping. The method is described in Figure 30 and consists of the following stages: i) silhouette extraction (background subtraction), as shown in the upper row of Figure 30(a); ii) silhouette volume intersection: Each silhouette is back-projected into the common 3D volumetric space and the visual cones are intersected to produce the “visual-hull” volumetric data. This is accomplished using a fast plane-based intersection method (see Figure 30(b)); iii) 40 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools application of the Marching Cubes algorithm (Lorensen and Cline, 1987) to convert the voxel representation to surface representation; iv) Texture mapping. The proposed intersection method ensures real-time generation of the volumetric data, but the authors state that their Marching Cubes implementation is time-consuming and their overall method cannot run in real-time. However, IT/TPT is aware of and have already used a real-time Marching Cubes implementation in NVidias’ CUDA. With this implementation, IT/TPT expects to realize very fast executions of the underlying method. Dependencies on other technology: Multiple RGB cameras Figure 30: Outline of the silhouette-based method from Matsuyama et al. 2004; Images taken from Matsuyama et al. 2004. 41 Exact Polyhedral Visual Hulls (EPVHs) (Franco and Boyer, 2009) Description: A fast and efficient silhouette-based approach for 3D reconstruction, depicted in the following Figure 31(a). The methodology is surface-based, rather than a volumetric one. It computes directly the visual-hull surface in the form of a polyhedral mesh. It relies on a small FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools number of geometric operations to compute the visual hull polyhedron in a single pass. The algorithm combines the advantages of being fast, producing pixel-exact surfaces, and yielding manifold and watertight polyhedra. Some results are given in Figure 31(c). These results were produced from 19 views. The algorithm is very fast, but still not in real time; the average computation time is 0.7 seconds per frame. A single 3GHz PC was used with 3Gb RAM. The source code of the EPVH algorithm was freely distributed until 2007 (http://perception.inrialpes.fr/~Franco/EPVH). However, the authors established a company (4D View solutions - http://www.4dviews.com), producing solutions that are based on the EPVH software, see Figure 31(b). Dependencies on other technology: Multiple RGB cameras (a) EPVH (b) 4D View solutions product, using EPVH (c) Published results of EPVHs– From 19 views. Average computation time: 0.7 seconds per frame, using a single 3GHz PC with 3Gb RAM. Figure 31: EPVHs (Franco and Boyer, 2009) Viewpoint-based approach for quick fusion of multiple stereo depth maps (Merrell et al. 2007) Description: A viewpoint-based approach for quick fusion of multiple stereo depth maps. Depth maps are computed in real-time from a set of multiple images, using plane-sweeping stereo. The 42 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools depth maps are fused within a visibility-based framework and the methodology is applied for viewpoint rendering of scenes. The method was also tested for 3D reconstruction on the MultiView Stereo Evaluation dataset (http://vision.middlebury.edu/mview/), presenting high accuracy, yet requiring an execution time of a few seconds/frame, even with a GPU implementation. Dependencies on other technology: Multiple RGB cameras 3D Tele-Immersion (3DTI) system (Vasudevan et al. 2011) Description: A high quality full system for 3DTI. A method for the creation of highly accurate textured meshes from a stereo pair, in real-time. Exploiting multiple stereo pairs, they generate multiple high-quality meshes in real-time, which are combined to synthesize intermediate views for given viewpoints. However, they do not address the way that separate meshes could be merged to a single, complete full 3D mesh. Dependencies on other technology: Multiple stereo cameras (a) The intermediate-view composition approach, given multiple triangular meshes. (b) Result of the approach Figure 32: Synthesis of high-quality intermediate views for given viewpoints. Image taken from (Vasudevan et al. 2011). Kinect Fusion (Izadi et al. 2001) Description: The problem of dynamic 3D indoor scenes reconstruction is addressed, by the fast fusion of multiple depth scans, captured with a single hand-held Kinect sensor. It is one of the most high-quality works with Kinect. However, it addresses mainly the problem of 3D indoor scene reconstruction, by capturing the scene from multiple view-points with a single device and is more relevant to Task 4.5 (3D User-Generated Worlds from Video/Images). http://research.microsoft.com/en-us/projects/surfacerecon/ Dependencies on other technology: Kinect sensor Fast silhouette-based 3D reconstruction (visual hulls) Description: IT/TPT has experience in silhouette-based 3D reconstruction and has implemented relevant algorithms in C++, exploiting CUDA to achieve (near) real-time reconstruction. Dependencies on other technology: Multiple cameras 43 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 33: Near real-time silhouette-based 3D reconstruction of humans. Real-time 3D reconstruction of humans from multiple Kinects Description: IT/TPT currently works on the real-time construction of full 3D meshes of humans and foreground objects, having obtained promising results. The approach is based on a) capturing of RGB and depth images from multiple Kinects (30 fps with a single Kinect, ~20 fps with 4 Kinects), b) construction of separate meshes from the separate depth maps, c) alignments of the meshes, d) Fusion (zippering) of the meshes to produce a single combined mesh. http://www.reveriefp7.eu/resources/demos/ Dependencies on other technology: Multiple Kinect sensors, OpenNI API Figure 34: Real-time 3D reconstruction from 4 Kinects A Global Optimization Approach to High-detail Reconstruction of the Head (Schneider et al. 2011) Description: An approach for reconstructing head-and-shoulder portraits of people from calibrated stereo images with a high level of geometric detail. In contrast to many existing systems, these reconstructions cover the full head, including hair. This is achieved using a global intensity-based optimization approach which is stated as a parametric warp estimation problem and solved in a robust Gauss-Newton framework. A computationally efficient warp function for mesh-based estimation of depth is formulated, which is based on a well-known imageregistration approach and adapted to the problem of 3D reconstruction. The use of sparse correspondence estimates for initializing the optimization is addressed as well as a coarse-to-fine scheme for reconstructing without specific initialization. We discuss issues of regularization and brightness constancy violations and show various results to demonstrate the effectiveness of the approach. 44 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 35: Very high detail reconstructions (rendered depth maps), computed with the coarse-tofine scheme without initial shape estimate. Fine structures become visible on the face as well as on clothing. Multiple View Segmentation and Matting (Kettern et al. 2011) Description: A robust and fully automatic method for extracting a highly detailed transparencypreserving segmentation of a persons’ head from multiple view recordings, including a background plate for each view. Trimaps containing a rough segmentation into foreground, background and unknown image regions are extracted exploiting the visual hull of an initial foreground-background segmentation. The background plates are adapted to the lighting conditions of the recordings and the trimaps are used to initialize a state-of-the-art matting method adapted to a setup with precise background plates available. From the alpha matte, foreground colours are inferred for realistic rendering of the recordings onto novel backgrounds. Figure 36(from left to right): Background, object, bi-map, tri-map, alpha matte, realistic foreground rending of object. Model-Based Camera Calibration Using Analysis by Synthesis Techniques (Eisert, 2002) Description: A new technique for the determination of extrinsic and intrinsic camera parameters is presented. Instead of searching for a limited number of discrete feature points of the calibration test object, the entire image captured with the camera is exploited to robustly determine the unknown parameters. Shape and texture of the test object are described by a 3D computer graphics model. With this 3D representation, synthetic images are rendered and matched with the original frames in an analysis by synthesis loop. Therefore, arbitrary test objects with sophisticated patterns can be used to determine the camera settings. The scheme can easily be extended to deal with multiple frames for a higher intrinsic parameter accuracy. 45 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 37: Two different calibration text objects (left) and visualization of perspective projection. Alcatel-Lucent Stereo Matching Software Description: The proprietary Alcatel-Lucent stereo matching software uses two video cameras as input. It uses a non-disclosed set of algorithms to efficiently and qualitatively compute the depth map of the scene captured by the video cameras. http://www.alcatel-lucent.com Telecom ParisTech CageR System Description: The proprietary Telecom ParisTech CageR system allows the reconstruction of an animated cage from an animated mesh to provide editing, deformation transfer, compression and motion manipulation on raw performance capture data. http://www.tsi.telecom-paristech.fr/cg Dependencies on other technologies: Telecom ParisTech space deformation tool set. 1.6. 3D User-Generated Worlds from Video/Images The Computer Vision community has put a lot of effort in to developing new approaches for modelling and rendering complex scenes from a collection of images/videos. While a few years ago the main applications were dedicated to robot navigation and visual inspection, recently, potential fields of application for 3D modelling and rendering include computer graphics, visual simulation, VR, computer games, telepresence, communication, art and cinema. Additionally, there is an increasing demand for rendering new scenes from (uncalibrated) images acquired from normal users using simple devices. SfM techniques are used extensively to extract the 3D structure of the scene as well as camera motions by analyzing an image sequence. Most of the existing SfM techniques are based on the establishment of reliable correspondences between two/multiple images, which are then used to compute the corresponding 3D scene points by using a series of computer vision techniques such as camera calibration (Curless and Levoy, 1996), structure reconstruction and BA (Kutulakos and Seitz, 2000). In most cases, it is impossible to detect image correspondences by comparing every pixel of one image with every pixel of the next because of the high combinatorial complexity and computational cost. Hence, local-scale image features are extracted from the images and matched on this scope. Feature matching techniques can be divided in two categories: narrow- and wide-baseline. Though dense short-baseline stereo matching is well understood (Triggs et al. 1999; Pollefeys et al. 1999), its wide baseline counterpart is much more challenging since it can handle images with large perspective distortions, extended occluded areas and lighting changes. Moreover, it can yield more accurate depth estimates while requiring fewer images to reconstruct a complete scene. The main idea for wide-baseline two-image matching is to extract local invariant features independently from two images, then characterise them by invariant descriptors and finally build up correspondences between them. The most influential local descriptor is SIFT (Lowe, 2004), where key locations are defined as maxima and minima of the result of the difference of Gaussian functions applied in scalespace to a series of smoothed and re-sampled images. Dominant orientations are assigned to 46 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools localized keypoints. Finally, SIFT descriptors are obtained by considering pixels around a radius of the key location, blurring and resampling of local image orientation planes. GLOH (Mikolajczyk and Schmid, 2004) is an extension of the SIFT descriptor, designed to increase its robustness and distinctiveness. The RIFT (Lazebnik et al. 2004), a rotation-invariant generalization of SIFT, is constructed using circular normalized patches divided into concentric rings of equal width and within each ring a gradient orientation histogram is computed. The more recent SURF descriptor (Bay et al. 2006) is computationally effective with respect to computing the descriptor’s value at every pixel, but it introduces artefacts that degrade the matching performance when used densely. Recently, the DAISY (Tola et al. 2010) descriptor was proposed, which retains the robustness of SIFT and GLOH, while it can be computed quickly at every single image pixel like SURF. In order to establish reliable correspondences between image descriptors and remove outliers, feature matching algorithms are exploited. Traditional approaches are mainly based on extensions (Philbin et al. 2007; Chum and Matas, 2008) of the RANSAC algorithm (Fischler and Bolles, 1981) or extensions (Lehmann et al. 2010) of the Hough transform (Ballard, 1981). The sets of image correspondences are exploited by SfM algorithms, in order to recover the unknown 3D scene structure and retrieve the poses of the cameras that captured the images. So far, the most common data sources of large SfM problems have been video and structured survey datasets, which are carefully calibrated and make use of surveyed ground control points. Recently, there has been observed an interest in using community photo collections such as Flickr, which are unstructured and weakly calibrated since they are captured by different users. As the number of photos in such collections continues to grow into hundreds or even thousands, the scalability of SfM algorithms becomes a critical issue. With the exception of a few efforts (Mouragnon et al. 2009; Ni et al. 2007; Agarwal et al. 2009a; Agarwal et al. 2009b; Li et al. 2008; Schaffalitzky and Zisserman, 2002; Snavely et al. 2006) the development of large scale SfM algorithms has not yet received significant attention. Existing techniques performing SfM from unordered photo collections (Agarwal et al. 2009a; Agarwal et al. 2009b; Li et al. 2008; Schaffalitzky and Zisserman, 2002; Snavely et al. 2006) make heavy use of nonlinear optimization, which highly depends on initialization. These methods run iteratively, starting with a small set of photos, then incrementally add photos and refine 3D points and camera poses. Though these approaches are generically successful, they suffer from two significant disadvantages: a) they have increased computational cost and b) the final SfM result depends on the order that photos are considered. This sometimes leads to failures because of mis-estimations in the camera poses or drifting into bad local minima. Recent methods have exploited clustering and graphs to minimize the number of the images considered in the SfM process (Agarwal et al. 2009a; Agarwal et al. 2009b; Li et al. 2008; Bajramovic and Denzler, 2008; Snavely et al. 2008; Verges-Llahi et al. 2008). However, the graph algorithms can be costly and may not provide a robust solution. Factorization methods (Soucy and Laurendeau, 1995; Verges-Llahi et al. 2008) that attempt to solve the SfM problem in a sole batch optimization are difficult to be applied to perspective cameras with significant outliers and missing data (which are present in Internet photo collections). Available tools from the literature and REVERIE partners Below are provided, the most promising state-of-the-art approaches for generating 3D data from video/image input. Large-scale multi-view stereo matching (Tola et al. 2011) Description: A novel approach for large-scale multi-view stereo matching, which is designed to exploit ultra-high-resolution image sets in order to efficiently compute dense 3D point clouds. Based on a robust descriptor named Daisy (Tola et al. 2010), it performs fast and accurate matching between high-resolution images that limits the computational cost of other algorithms. 47 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Experimental results showed that this algorithm is able to produce 3D point clouds containing virtually no outliers. This fact makes it exceedingly suitable for large-scale reconstruction. Figure 38: Lausanne Cathedral Aerial Reconstruction (Tola et al. 2011). Structure-from-Motion method for unstructured image collections (Crandall et al. 2011) Description: An innovative SfM method for unstructured image collections, which considers all the photos at once rather than incrementally building up a solution. This method is faster than current incremental BA approaches and more robust to reconstruction failures. This approach computes an initial estimate of the camera poses using all available photos, and then refines that estimate and solves for scene structure using BA. Figure 39: Central Rome Reconstruction (Crandall et al. 2011) Reconstruction of accurate surfaces (Jancosek and Pajdla, 2011) Description: A novel method for reconstructing accurately surfaces that are not photo-consistent but represent real parts of a scene (e.g. low-textured walls, windows, cars) in order to achieve complete large-scale reconstructions. The importance of these surfaces towards complete 3D reconstruction in exhibited on several real-world data sets. 48 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 40: A 3D reconstruction before (on the right) and after (on the right) applying the method in (Jancosek and Pajdla, 2011) Fast automatic modeling of large-scale environments from videos (Frahm et al. 2010) Description: This approach tackles the active research problem of the fast automatic modeling of large-scale environments from videos with millions of frames and collection of tens of thousands of photographs downloaded from the Internet. The approach leverages recent research in robust estimation, image based recognition and stereo depth estimation, at the same time it achieves real-time reconstruction from video and reconstructs within less than a day from tens of thousands of downloaded images on a single commodity computer. 2. Related Tools to WP5: Networking for immersive communication 2.1. Network Architecture Many types of network architectures for added services in the internet exist. Research is conducted on overlay networking, content-based networking and P2P structures, all to make the internet handle real-time multimedia traffic streams in a more optimal way. REVERIE will be the first overlay network architecture that will allow real-time interactive 3D video over the internet mixed with traditional media delivery techniques. REVERIE will provide the overlay and synchronization mechanisms for multiple correlated streams that can serve as a reference for future internet architectures that enable 3DTI. Usefulness to the project This task helps to create an overview of WP5 tasks and facilitate its integration. The architectural considerations itself are also research area as deployment of proper architecture can greatly enhance system performance. Available tools from the literature and REVERIE partners This first version focuses on the state of the art for Tasks 5.1, 5.4 and 5.6 .The issue of resource management is also important to determine the overall Network architecture (5.1) and the overall REVERIE architecture (2.2). The proposed architecture is shown in Figure 41. The architecture consists of two clouds that serve two different types of communication purposes. The left cloud emphasizes real-time interactive communication where the delay from sender to receiver should be minimal. The Content Distribution Network (CDN) based media distribution cloud provides enhanced ways to distribute pre-generated media content to different receivers. The first cloud will encompass tasks 5.1 and 5.4 49 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools and 5.6 while the second cloud is more related to WP 5.1, 5.2 and 5.5 and will rely on previous approaches from the COAST project. Real-Time Interactive Communication Cloud CDN Based Media Distribution Cloud Static Resources (internet) Avatar Engine Audio Composition Real-Time Video Scene Composition Media Composition Server Signalling Sync REVERIE UPLOAD TOOL Publishing Front End Signalling Entry Point Media Cloud QoS Monitor Federated Cache Admission Control Active participant Networking and monitoring Networking Signalling Avatar data Real-time streaming Depacketize, decode, synchronization Ad. compression World state Depacketize, decode, synchronization Real time audio/video clouds Cached media, audio, video, images Upload tool Real time capturing Rendering Active participant Rendering Passive Participant (viewer) Figure 41: System Architecture Figure 41 shows the current envisioned architecture for REVERIE. Currently in REVERIE, 4 types of traffic are defined: signaling data, avatar control traffic, real-time media (audio/video) for conversation and cached/stored media such as videos or pictures. Active participants can stream their captured content after compression and networking decisions. All this is done in an adaptive fashion to the current network and REVERIE conditions. Signaling data invokes data for starting/ending sessions and monitoring/reporting transmission statistics from clients. The avatar control traffic is generated from signaling related to the autonomous avatars in REVERIE based on their movement, emotions, actions and graphics. The REVERIE architecture adopts the approach from COAST for publishing media through the publisher front-end with optimized caching and naming schemes. A user that decides to upload its own content or content referred to by a URL somewhere in the internet currently does so via the REVERIE upload tool that subsequently uses the publishing front-end to enable naming and caching functionalities. The scene composition for realtime interaction currently consists of both the video compositor and the audio compositor. To ensure synchronization they are represented in a single block. The powerful composition server, including audio and video decomposition, executes complex media operations. While the exact functionalities are not yet clearly specified and described, this server is already anticipated /expected to be a bottleneck to the REVERIE application. Therefore a distributed implementation of this server in the network is a possibility in a later stage; however the consequences for synchronization and real-time capabilities have to be clearly assessed. Another architectural decision was taken to limit the scope; security issues from the network architecture perspective are not taken into account in REVERIE but are left for future research. Currently, REVERIE supports two different use cases. The first use case is more geared towards the use of autonomous avatars, 50 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools multiple users and web-interfaces. The second use case aims to have less users, but with more capability for 3D video stream traffic. The architecture for both use cases is the same, but the implementation of the components might be different. Table 1 Component development in use case 1 and use case 2 Use Case 1 2 Composition Server Light Heavy Avatar Control Heavy None/Light Cached media Signaling Monitoring Heavy optional Heavy Light Light Heavy Current architectural decisions for the REVERIE network architecture: 1. An overlay network architecture is defined for REVERIE by the recommendations of the Future Media Internet Architecture Think Tank (FMTIA). This constitutes an overlay on top of the current internet that can be separated in two media aware clouds, one focusing on interactive communication, and the other on content distribution. 2. The real time aware cloud that supports real-time immersive 3D communication, signaling and monitoring support for sessions, real-time video and audio composition. 3. The media distribution aware cloud, that supports caching and efficient naming schemes to support efficient delivery of third party or user uploaded content for immersion in 3D scenes. 4. Support for different terminal capabilities i.e. scalable terminals. Currently 3 types of terminals are defined for participants and an optional viewer terminal. 5. The architecture features independent components for monitoring, signaling media composition and avatar reasoning, that can be usefully applied in different application areas. 6. Six types of communication traffic are defined: audio, video, avatar control, session control/monitoring, streams of cached content, upload data. 7. Powerful centralized server for composition of audio and video in the scene representation. 8. No implementation of security issues arising in the network. 9. Architecture supports both use cases, components can be different according to their relevance to the use case 1 2.1.1. REVERIE future internet network overlay Currently, a debate on how the internet should evolve to support novel multimedia applications is ongoing. One important topic is how the internet can efficiently support video delivery content. Better client/server software does not solve this because application bandwidth requirements is the bottleneck. Such scenarios demand changes in the network infrastructure to achieve theiraims (bandwidth allocation is the main network issue, but security, session control and intellectual property management also play a role in the network). Enabling real-time interactive applications, such as video conferencing and possibly 3D immersion as depicted in REVERIE, may also need support from the underlying network. Massive distribution of video to multiple users has been a challenge for the internet, massive streaming of 3D videos will pose even larger challenges. Clearly, REVERIE can provide great insights to aid development in the architecture of the future of the internet to enable real-time 3D video. In fact, the network architecture developed in REVERIE will be interpreted as future internet architecture based on the concepts from FMTIA for 3D-tele immersion. To illustrate the concepts described by FMTIA we will first discuss the architecture proposed by them. Figure 42 shows the high level architecture developed by FMTIA. It supports overlay networks (in the blue areas) to support novel services and functionalities. An overlay network refers to a network deployed on top of existing network. In REVERIE, an overlay for 3DTI has to be deployed that provides scalability, low latency and bandwidth adaption. 51 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 42: FMTIA future internet overlay These application-aware overlays consist of nodes (edge router, home gateway, terminal devices, and caches) and overlay links between them. Note that one link in the overlay network can span multiple links in the underlying network (the regular internet). Between these nodes a network can be dynamically constructed usually based on the application needs and the conditions of the underlying network (as far as this is known). For achieving this aim, the topology of the constructed overlay is important (similar as in regular networking), also algorithms/heuristics to dynamically find topologies are of interest. We give a list of relevant topologies/algorithms/heuristics below: Client Server: In REVERIE, all real-time data is relayed through the composition server. Software like Skype and QQ also use such an approach. Its’ advantage is that firewall blocks are more easily avoided and there is centralized control. However, in the case of 3DTI the burden of video streaming and processing can be too much in the case of multiple video streams. Full Mesh: All peers are connected and send data directly to each other, this will result in a generally lower end-to-end latency. However because all terminals send data to each other, bandwidth can become a bottleneck due to high node degrees (high number of outgoing links). Hybrid Scheme: Within a hybrid scheme there is more than one application-aware node that forwards data (i.e. more than in the client server approach). This can alleviate the link bandwidth. The main difference is that the number of forwarding nodes is fixed. A CDN network for video distribution with caches in the overlay network can be seen as an example of a hybrid scheme. Application level Multicast: Here the overlay network is dynamically constructed between a not necessarily limited numbers of application=aware nodes. Paths between terminals can be determined using different algorithms and heuristics such as Kruskal’s minimum spanning tree, minimum latency tree, delay variation multicast algorithm and others. 52 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools For REVERIE, the currently defined architecture for Real-time interaction between participants is chosen to be a client server model. While this approach is not really scalable to support heavy streams with many users, the architecture has some practical advantages such as centralized functionalities and avoiding firewall policies at end terminals. When the core functionality of the system performs well, peer-to-peer opportunities and more complex overlays can eventually be studied to improve scalability and communication latency. For distribution of more statically allocated content, the caching scheme in the content aware cloud can decrease the overall load on the links. 2.2. Naming We describe, in this section the state of the art in naming the current and next generation of networks for the delivery of various types of multimedia content. Interestingly, naming can appear a very mundane concept but actually the concept of a name, the conventions around it, the way we employ it, the mechanics of how we assign it, store it and publish it, and more importantly the semantics we attach to it, can ultimately involve arcane technical, linguistic and philosophical issues. According to the American Heritage Dictionary a name is "a word or words by which an entity is designated and distinguished from others". However even in the physical world, where "entity" is more often than not used to denote concrete objects or things, the concept could in cases be blurry enough to at least inspire the famous "what's in a name?" line and, we suspect, some philosophical debate as well (for which we can provide no references). In computer science and engineering, the realm of possible entities is expanded to include concepts that can be quite abstract, such as a database record, a variable, a video, a segment of a video or even something that appears inside a video. Accordingly the semantics attached to names assigned to such entities require more rigorous definitions. Skipping the first few decades of computer science the first systematic approach to define a naming framework that extended past the confines of the computer system or network and was also externally visible to the end users came with the advent of the World Wide Web (WWW). WWW standards define three kinds of names: Uniform Resource Names (URNs) that uniquely identify entities, not necessarily web resources. A URN does not imply availability of the identified resource, i.e. a representation for it may not be obtained. E.g. the valid, registered URN "urn:ietf:rfc:2648" identifies the IETF (Internet Engineering Task Force) RFC (Request for Comments) 2648 "A URN Namespace for IETF Documents" but a user can't paste it in to their browsers’ address bar and fetch the document. URNs were introduced with RFC 1737 (1994) which doesn't really formally define what a URN is but rather specifies a minimum set of requirements for this class of "Internet resource identifiers" and states that "URNs are used for identification". Uniform Resource Locators (URLs). According to RFC 1738 (1994) these are "compact string representations for a resource available via the Internet". Uniform Resource Identifiers (URIs). These are defined in RFC 3986 (2005) as a "compact sequence of characters that identifies an abstract or physical resource". It is telling that even an engineering body such as IETF has had to refine and further qualify the above concepts, over a period of more than ten years, through a series of updates and additional RFCs to fully clarify their role and their relationship to one another. At this time, the consensus is that URNs are used to identify resources, URLs are used to obtain them, and URIs are the parent concept, the base class if one is to use a programming analogy, i.e. both URLs and URNs are URIs. However, the URL and URN sets are "almost disjoint", i.e. there are schemes that combine aspects of both a URL and a URN. The triad of URIs, URLs and URNs is depicted in Figure 43. 53 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools URI URN URL Figure 43: Relationship between URIs, URNs and URLs. Setting aside the nuances surrounding URIs, URNs and URLs, there is even some room for debate on what actually a resource is and how it is defined. For instance, consider the URL www.bbc.co.uk/weather/Paris. Should one construe this URL to denote a resource whose representation will provide: The current weather in Paris, in human readable form? The current weather in Paris, in machine-readable form (according to a grammar published elsewhere)? An index to historical data on Parisian weather (again in human or machine-readable form?) A five day forecast for the weather in Paris? It is almost impossible to devise a framework that would allow one to formally define what a particular URL is, unless one also has a world ontology (and such does not exist at the end of 2011). The IETF had, circa 1995, a working group on Uniform Resource Characteristics (URC) with the more modest goal of providing a metadata framework for Web resources but the approach was ill-fated although it influenced related technologies like the Dublin Core and the Resource Description Framework. We spent some time above discussing the naming schemes used in the context of the WWW for four reasons: 1. To highlight some of the dimensions, properties and challenges of a naming scheme. 2. Since the WWW standardization effort is a rigorous and yet open process that allows one to view the historical evolution of these concepts through the published documents (RFCs) 3. Since it is the system that most non-technical users are aware of (even though the undefined term "web address" is usually used). 4. Since web browsers are ubiquitous client applications and multimedia content is mostly consumed through them. In general the following dimensions or properties define a naming scheme or a class of naming schemes working together: semantics uniqueness persistency verifiability metadata versus opacity We discuss the above dimensions in the subsections that follow, briefly presenting the state of the art in each dimension, when applicable. We conclude the section with some thoughts on the REVERIE approach on this matter. Semantics The meaning we attach to identifiers and names can vary considerably within each different scheme. We already discussed how URNs are an example of a class of schemes that identifies 54 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools resources as opposed to URL, which is a class of schemes that provide a method for retrieving a representation of these resources. This seems to indicate a dichotomy: "name of the object" - vs. "name of a method to get the object". Taking into account the full Web software stack and the domain name translation that inevitably takes place behind the scenes, there is really a trichotomy: the name of the object how to fetch it the location of the server where the object is hosted Although URLs also denote a location since they include a domain and a port component in their syntax: scheme://domain:port/path?query_string#fragment_id In today's web, there is a variety of methods by which the actual location is not really constraint by the domain component. For instance, CDNs use a variety of techniques to actually route requests for the same domain to different physical servers: Server balancing DNS-based request routing Dynamic metafile generation HTML rewriting BGP-based anycasting The end goal of these techniques is to ensure optimal routing; the criteria can be diverse but usually the predominant consideration is the geographical proximity of the server, i.e. the closest of a group of servers is chosen10. Other possible criteria could be server load or the "cost" of certain network routes (in the case of where another provider's infrastructure has to be used). The above methods vary greatly in the mechanics they employ, the infrastructure they assume, their degree of transparency to different layers of the software stack or network equipment and the portion of the path that they can affect but the end result is the same: they succeed in introducing a further degree of separation between a URL and the actual location. URLs in such a scheme then become in a sense more like URNs as they no longer denote a location. However, even the above discussion makes sense only in the context of today's Internet, and more precisely, only when one assumes the use of the Internet Protocol (IP) for the network layer (or OSI Layer 3). In more than one sense they are hacks that violate the end-to-end principle the Internet was built on. The various efforts of the last ten years towards Content Centric Networking (CCN) will (if fruitful) remove the need for such wizardry by redesigning the software stack. Topologically, the existing Internet can be viewed as a graph of physical equipment; without any loss of generality we can only consider routers in this discussion. Though strictly not a tree, in the graph theory sense, the Internet can be approximated as a tree with the edge devices and user equipment (clients and servers) located at the leaves and the core routing equipment at the "branches" of the various "levels"11. Routing tables of the IP then operate to ensure reliable transport of data from any leaf to any other leaf. Although this model is accurate, it is also very general and fails to take advantage of the well-known disparity in the relative populations of data sources versus data sinks or of the fact that data, and especially multimedia content, is usually produced only once and can remain in demand for years, consumed by multiple source sinks. 10 Geographical distance is obviously not the same a network distance (e.g. number of hops or Layer 3 routers in the path), but empirical results have shown it to be a good enough approximation. The haversine formula is typically used to compute distance based on the geo-coordinates, if such are available, or using a reverse GeoIP database. 11 These terms appear in quotes as they are non-technical, without a definition in graph theory. 55 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Research in CCNs was prompted by this observation. Accordingly, in CCNs the focus shifts from the leaves of the Internet tree to the data or content itself (hence the content-centric designation). This is reflected by the fact that names are now used, even at routing tables, to designate not the communication endpoints but rather the content itself. The TRIAD project at Stanford circa 1999 was the first one that proposed avoiding Domain Name Server (DNS) lookups by using the name of an object to route towards a close replica of it. A few years later (2006), the DONA project at Berkeley built upon TRIAD by incorporating security (authenticity) and persistence in the architecture. In 2009, PARC announced their content-centric architecture within the Van Jacobsonled CCNx project; specifications for interoperability and an initial open source GPL implementation were released later the same year. Naming is critical to all CCN projects as identifiers for content are directly used at the routing tables in most of these architectures. This also forces naming to operate at a fairly low level as the name of the content has to sufficiently describe the data; in other words the name should uniquely identify a byte stream or a portion thereof. Since CCN architectures obviously rely on packet switching and have to support streaming modes of content delivery, naming operates at the packet level; i.e. names are assigned to packets of data, not to the entire content as a whole. This necessitates automatically naming such packets in a unique way and the usual approach is to assign a hash, usually a cryptographic hash such as MD5 or SHA-256 for at least some of the name components. This obviously does not preclude the use of other components to store metadata such as IPR owner, version, encoding format, etc. On the other hand, at the routing level the network can simply take the names as they are without trying to infer meaning; conventions and their enforcement can be That's the approach CCNx takes, where the name is formally defined using the BNF notation as: Name ::= Component* Component ::= BLOB That is, the CCNx protocol does not specify any assignment of meaning to names, only their structure; assignment of meaning (semantics) is left to applications, institutions or global conventions which will arise in due time. From the network's point of view, a CCNx name simply contains a series of Component elements where each Component is simply a sequence of zero or more bytes without any restrictions as to what this byte sequence may be. In general, it is fair to say that CCNs have not yet entered the main stream nor do the dynamics (at least in terms of community size, publications or interest) appear to suggest that it will in the short or even medium time frame. The CCNx project led by Van Jacobson seems to be the leading effort in this field for the last few years but even there, the current state is described as follows: CCNx technology is still at a very early stage of development, with pure infrastructure and no applications, best suited to researchers and adventurous network engineers or software developers12. In our view, this is not to be attributed to any deficit in the theoretical foundations or the engineering behind the CCNx effort but rather to the fact that existing solutions like CDNs are too widespread and seem to work just well enough, for the time being, even though undoubtedly lacking in grace. It may also be the case that the current Internet architecture and the often criticized "hourglass" IP stack represent such a massive investment in terms of capital and human talent (theoretical work, applications, protocols, deployed assets) that is going to take a comparable commitment or a clearly perceived technical roadblock to prompt any serious redesign. Perhaps recognizing that, certain application-layer designs have also been proposed. The Juno middleware is 12 Source: http://www.ccnx.org/about/ accessed November, 2011. 56 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools such a content-centric solution, which again relies on "generating one or more hash values from the content’s data" to implement a self-certifying content naming approach13. Uniqueness In the previous section on semantics where we surveyed some prevalent and emerging networking approaches, we mentioned that generating hashes (preferably cryptographic hashes such as those in the MD and SHA families) from the content data is the obvious way to generate unique identifiers. These hashes may then be further extended with additional name components which may be administrative-derived or assigned by the applications and which can be used to store metadata. Another way to guarantee uniqueness is to rely entirely on central or hierarchical registration authorities and (for the last part of a name) to administrative house-keeping within an organization or application. This is the approach URIs take, which rely on the hierarchical DNS system. In these approaches, there is no automatic validation built in the naming which is why other methods have to be used for that purpose (relying on certificates or securing the https channel). Finally, there are also completely or partially random approaches at naming which generate unique identifiers without any reference to content, by hashing together information such as a computer's MAC address, a timestamp plus a 64-bit or 126-bit random number to make it sufficiently improbable that no two identifiers generated in such a scheme will ever be unique. Such identifiers are known as Universally Unique Identifiers (UUID) and are documented in RFC 4122 and other equivalent ITU-T or ISO/IEC recommendations. Microsoft's Component Object Model was the first to use them (called GUIDs for "globally unique"). Open source implementations to generate such identifiers are available for most programming language and database frameworks. Naming Persistency Defined rather crudely and by way of example, naming persistency is the property of a naming system that would allow one to write down on a piece of paper (or store in a file if the name is not human-friendly) a name used to identify some data (or a service), disconnect from the system and come back after an arbitrary period of time (during which length of time servers may have undergone reboots or even crashes) and use the name to retrieve the same or equivalent data provided of course that the data itself has survived. The Web is an example of a system without naming persistency, as witnessed almost every day by broken links and 404 responses, yet some persistency for practical purposes is still assumed (which is why even this document provides URLs for references). Naming Persistency is an especially challenging requirement in Peer to Peer networks due to the dynamics of the peer population and the ad-hoc formation and disbanding. In such a context (i.e. one with a non-stable population of nodes, especially in high churn situations), distributed name registration and resolution are used to provide persistency. However, it is not always the case that the benefits of such a property outweigh the complexity it adds. The paper from Bari et al (2011) explores some of these issues. Verifiability The sub-section on semantics also discussed some of the verifiability concerns. We summarize here and provide some additional pointers since this property merits independent treatment. Verification in the case of immutable data isn't really an issue as any hash can be used to provide reasonable validation. Of course that should preferably be a strong, collision-resistant cryptographic hash function. Most functions in the MD or SHA families fit the bill although advances in 13 "Juno: An Adaptive Delivery-Centric Middleware" available from: http://www.dcs.kcl.ac.uk/staff/tysong/files/FMN12.pdf accessed November, 2011. 57 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools cryptanalysis compel users to move to stronger hashes over time (e.g. the MD5 was broken14 in 2008). SHA-1 and SHA-2 are the most commonly used cryptographic hash functions as of 2011, together with MD5 despite being "broken". SHA-3 is expected to become a FIPS (Federal Information Processing Standard) around 2012. An interesting case arises when the verifiability of mutable data is concerned. For mutable data, two approaches can be taken: Take the view that each change actually generates a completely different object and that there is no need to maintain ancestry information as part of the name (such metadata information can be maintained externally if necessary). In such an approach the issue becomes moot. Try to encode ancestry / versioning information in the name itself. In such an approach the name will need to have a fixed component which will be supplemented by a changing set of version hashes. This however means that the totality of the name can no longer be verified and so full verification has to involve other cryptographic techniques such as digital signatures which require the overhead of a public key infrastructure. Metadata versus Opacity The discussion of metadata versus opacity relates to names since it affects the decision, when devising a naming scheme, of whether to include metadata information as part of the name itself (e.g. a component of the name as in a file extension or in file names which embed a date) or whether to consider it extraneous and maintain it in a separate infrastructure (if at all). In an opaque arrangement of software handling, the name effectively treats it as a blob and does not try to infer any meaning or differentiate the processing by examining name components. Clearly in cases where the names embed metadata it is still possible for lower levels to treat them opaquely and only expect higher application layers to use this metadata information; but practically the temptation may be too strong to ignore. Architectures where identifiers are really opaque (i.e. opaque by construction as opposed to being treated as opaque) make such proper layering considerations or end-to-end principles easier to enforce (forfeiting the possibility of "clever" improvements in the network). COBRA's IOR or J2EE object references are two such cases in point. 2.3. Resource Management This section discusses the state of the art in managing resources in 3DTI applications similar to the one developed in REVERIE. In REVERIE, we aim to develop more advanced resource management schemes. As starting point, the previous resource management scheme as applied in 3DTI applications in Illinois, Berkeley Pennsylvania and Chapel Hill will be discussed (Nahrstedt et al 2011a) and (H.Towles et al. 2003). Managing resources is one of the main challenges in REVERIE. 3DTI is a demanding application that consumes network and computing resources rapidly and requires low delays for interactive communications. Proper resource usage will benefit the project in the following ways: - High interactivity due to smaller end-to-end delays - Better image quality/perceptional quality due to better bandwidth usage - Fewer system breakdowns due to unavailable resources Available tools from the literature and REVERIE partners 14 It should be noted that "broken" doesn't mean that the hash no longer offers any protection, just that an attack with a complexity lower than previously imagined has been devised against it. Be that as it may and since the computation costs for most hash functions are negligible there is no reason why one shouldn't move to stronger hashes over time except for high software update and regression testing costs. 58 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools In 3D Tele-immersion, large volume data is interchanged in real time. To achieve high quality real-time experience, low latency and high throughput is required. To achieve this, resource management (of network, hardware and software resources) in the system is of utmost importance. Unoccupied links, losses and CPU blocking are factors that can heavily deteriorate the user experience. Nahrstedt et al. (2011a) have over 6 years of experience in deploying and testing 3D tele-immersive environments and have summarized resource management issues. This work can be consulted as a reference for resource management in 3DTI and other complex interactive media environments. We summarize the results relevant to REVERIE below, to clearly underscore the design issues and challenges at hand. The main concerns are the buildup of latency through the system and blocking due to bandwidth limitations. From these techniques, we will choose suitable techniques for managing processing power at the receivers and terminals, and for managing network bandwidth. Bandwidth: Both CPU and Network bandwidth have to be taken into account to avoid partial blocking of media streams and synchronous arrival. This resource is needed to achieve high throughput of events occurring at the sender site to the receiver site. We will survey the following mechanisms to manage these resources: Network Bandwidth: 1. Diffserv: This is a technique increasingly used in the Internet that allows differentiation between packets on the IP-level. Higher priority packets are served first. The Per Hop Behavior (PHB) can be defined per diffserv class and defines queuing, scheduling, policing, traffic shaping. Current PHB’s are :Default PHB, Class selector PHB, expedited forwarding PHB, Assured Forwarding PHB. While the approach is scalable it requires knowledge of internet traffic and the routers used in the network. Diffserv can be deployed by the telecom operator Internet Service Provider (ISP). 2. Intserv: A flow based Quality of Service (QoS) model. Receivers reserve bandwidth for the endto-end path typically using the Resource Reservation Protocol (RSVP), allowing routers in the network to maintain a soft-state path. RSVP/Intserv is generally considered to be not as scalable. 3. MPLS: Multiprotocol Label Switching is often employed to provide QoS, mostly from the operator side. MPLS [RFC 3031] allows a label insertion between layer 2 and layer 3 and is sometimes called layer 2.5. The initial idea was that forwarding based on a label value was faster than forwarding on an address as done in IP. Currently the three important applications are: a) Enable IP capabilities on non-IP forwarding devices mostly ATM, but also Ethernet/ppp also called generalized MPLS. b) Explicit routing paths: in some types of networks certain paths can be preferred, but the IP routing mechanism does not support to explicitly choose a path. . This utility is mostly used for operators for engineering the traffic, so resources are properly used. c) Layer 2/Layer 3 Tunneling: By using MPLS labels from the head end, different protocols/packets can be encapsulated. ATM or other types of layer 2 services can be emulated. This can be useful for providing legacy/old services on newer hardware. As IP can also be encapsulated between remote sites using an MPLS-based virtual connection, construction of VPN’s is a major application of MPLS. MPLS is a mainstream technology mainly employed by telecom service operators and internet operators. Its applicability to the REVERIE use case to deploy tele-immersion on the internet is limited because we cannot 59 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools change the operators’ networks. However employing tele-immersion over a private VPN could be possible, but somewhat costly. 4. Bandwidth broker: A bandwidth broker manages resources by admission and allocation of resources based on current network load and Service Level Agreement (SLA). By continuously monitoring RTT, loss rate, packet size from monitoring output, it can adapt its parameters to decide on admission or not. Bandwidth estimation is important in this scheme as it determines the adaption steps by the algorithm, several techniques have been developed to do so, also referred in Nahrstedt et al. (2011a). Bandwidth broker is usually deployed at the edge routers in the network. 5. Application level overlay: As already discussed in the architecture section, application level overlay can be used to achieve better management of underlying resources. For example in Zhang et al. (2005), an overlay for streaming video on the internet was presented, which was previously not feasible due to bandwidth restrictions. Other overlays for streaming media were presented in Yi et al. 2004, while a general framework for deploying overlays in PlanetLab was presented in Jiang and Xu (2005). The main advantage of using overlays is that the underlying network infrastructure does not need be changed. In previous methods, this was the case. However the independence of the overlay network to the underlying network/physical infrastructure can also result in mismatch and less efficient resource usage, such as in Ripeanu, 2001. In REVERIE we are most likely to use an overlay network or VPN (MPLS based) for bandwidth, as Diffserv, Intserv a bandwidth broker would require network control not available by the partners. CPU Bandwidth: For real-time media streaming, it is important to control the CPU bandwidth (i.e. cycles per time unit) so as to avoid blocking of real-time media throughput. Blocking of real-time media can unnecessarily result in extra delays, jitter and skew both at the sender and receiver sites. Managing and allocating processor bandwidth to streams can be seen as a processor scheduling problem. Distinction can be made between best effort schedulers and real-time (media) schedulers. Best effort schedulers can allow prioritization of some tasks (media tasks) or allow bandwidth to be divided proportionally. However best effort schedulers do not provide hard deadline guarantees of when a task will be scheduled. In real time schedulers on the other hand, algorithms like earliest deadline first can provide hard deadlines on when tasks are scheduled, i.e. real-time capabilities. Similar to the allocation of network bandwidth, admission algorithms have been developed for admitting tasks to be scheduled. In REVERIE we will investigate the applicability of currently available real-time and best effort schedulers to our 3DTI immersion application when possible. In REVERIE we are not planning to deploy CPU scheduling at the clients and servers. Instead we will deploy monitoring tools to make sure blocking is avoided both at the clients and the composition server. Delay (end-to-end delay) In general, as defined in the ITU standard G.114 (2003), delays above approximately 400ms are unacceptable in 2 way videoconferencing. This delay requirement can also be assumed for 3DTI. In the current (centralized) REVERIE architecture the end to end (e2e) delay for real-time data can be described as: E2ED = D_sender + D_network + D_server + D_network + D_receiver D_sender consists of capturing and compression delay at the sender site. D_network represents delays introduced in the network and D_server processing in delays at the server (trans-coding, the 60 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools scene composition model). Delays at the receiver consist of decoding buffering and rendering. Also synchronization of multiple streams can cause extra delay as buffering is required to adapt to the skew between the streams. All these steps are described in Nahrstedt et al. (2011a). We briefly summarize them here: Sender-side delay Many delays are incurred in hardware/software in capturing and coding. 2D video encoding and reconstruction of 3D scenes can be optimized by hardware/parallel implementations. References to them are given in the related section in Nahrstedt et al. (2011a). Generally, in interactive applications such as video conferencing, it is preferred to keep the encoding and decoding delays as small as possible. Most of the time, compression and decompression have the same level/order of complexity (delay). 1. 2. REVERIE supports different types of senders (scalable terminals); each level will be tested for its (limited) processing delays. Preferably, an API interface to the networking/monitoring components is available that reports delays so as to enable Quality of Experience based networking Internet network delay One of the aims of REVERIE is to show that 3DTI can run on the internet. Therefore an approach with no leased or specialized networks is taken. The regular internet will introduce extra delays that have to be taken into account. We must separate between point to point delay management pointto-point and multi-point delay management. The first one is tightly related to bandwidth management. With multiple sites, delay management becomes important (as this accumulates over multiple hops). Currently in REVERIE we have two different approaches to handle the inter-network delay in each use case. 1. In use case 1, there are many users. However traffic is simplified in such a way that links are under committed and internet delay only plays a minor role. If this is not sufficient, techniques from P2P/overlay networking or gaming can be investigated. 2. In use case 2, there are 3 users with heavy traffic. The link to and from the server, which will be monitored continuously for capacity/latency/loss based on the bit-rate, will be adapted using the adaptive coding schemes that will be developed. We will create user awareness of the network at the two terminals and the server. Receiver-side delay Loss concealment, receiver buffering and the decoding/rendering of 2D/3D video streams, all introduce computational delays. Even though they are all troublesome, in the REVERIE case we should make sure that the 2D/3D video rendering/decompression is fast enough given relatively normally sized buffers and computationally simple concealment techniques. Also, adapting rate on monitoring receiver side delays can allow for increased performance. (Interface measurement API). 1. REVERIE supports different types of receivers (scalable terminals); each level will be tested for its (limited) processing delays. 2. Preferably, an API interface to the networking/monitoring components is available that reports delays as to enable Quality of Experience based networking. Synchronization of multiple streams: In previous work, 3DTI have consisted of multiple correlated video streams from different cameras/receivers. To synchronize the multiple streams again at the receiver is a challenging task. In REVERIE a similar complex synchronization problem is encountered. In this section we briefly explain the two problems with synchronization encountered in 3DTI described in Ott and MayerPatel (2004) and Huang et al. (2011). 61 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools In Ott and Mayer-Patel (2004), a single 3DTI site referred to as a cluster consists of different IPcameras which send multiple related streams via a gateway towards another cluster where the streams are rendered. On the path between the two clusters, streams traverse similar links and routers. If links become over committed, the streams will compete for bandwidth and parts of a stream maybe partially dropped to avoid congestion. When the streams are processed by the receiver gateway and rendered at the receiving cluster, it turns out that the damaged stream causes problems rendering the intact stream. It turns out that it would have been much better if they had experienced similar loss/delay or bit-rate adaption to avoid the congestion in the first place. As the streams come from different IP cameras (different senders) this type of coordination is not supported by current internet protocols. Subsequently, the authors develop a protocol on top of User Datagram Protocol (UDP) to allow monitoring of connections at the gateways of the clusters to allow for coordination between streams. The experimental results showed the improvements achieved. This specific type of effect can be expected in REVERIE, for example in use case 2 when multiple IP Cameras synchronously capture the scene and transmit to another site. In Huang et al. (2011), a similar synchronization problem was tackled with multiple sites (N=5-9). Each site produces multiple (N) synchronous 3D videos streams that are sent to more than 1 receiver sites. Each receiver (theoretically) needs N streams to be able to render properly. This raises the following questions regarding the inter-stream synchronization of streams from the same sender: 1) Can the number of streams that each receiver needs be reduced to alleviate the network? 2) To which video stream should the other streams synchronize when rendered? (Which video stream is the master stream?) 3) Can we route the streams in such a way, that they arrive approximately at the same time at their destination? This implies small inter-stream skew and improved inter-destination synchronization. Regarding questions 1) and 2) the work from Huang et al. (2011) obtains answers from a previous study (Yang et al. 2009). Depending of the viewing direction of the user at the receiver site, the number of needed streams is reduced (only the ones relevant to the viewing direction are requested). Also, for synchronization the stream most in the directionof the users view is chosen as the stream that other streams synchronize to. The work performed in Huang et al. (2011) mainly focuses on question 3), an algorithm is developed that aims to minimize overall end-to-end delay with both inter-stream skew and bandwidth on the links as given constraints for 5-9 nodes. For a deeper look at this we refer to the paper itself, for REVERIE it is important to take the problems presented here into account as we will need to develop our own custom solution to these problems suited to the use cases. In REVERIE large issues regarding inter-stream synchronization are expected to arise. According to previous research, solutions lie in the adapted routing/overlay networking in Huang et al. (2011) and adaption of internet protocol in Ott and Mayer-Patel (2004). In REVERIE we will investigate both approaches. More specifically in use case 1 we will test overlay schemes for multiple nodes, while in use case 2 we will investigate protocol adaption for communication between clusters. The REVERIE consortium will report protocol adaption useful for 3DTI to the IETF as possible new drafts. 2.4. Streaming This task deals with the live transmission of media over the internet, often referred to as streaming. In the past, protocols, codecs and technologies to support this have been developed. They are described specifically in this document. Adoption of an appropriate streaming mechanism, allows smooth and interoperable media flow within the REVERIE framework. The developed streaming mechanisms will set requirements on 62 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools resource management and the network architecture. Streaming does not only deal with the compression of media data, but also with the packetisation and transport of such media over the network in an optimal way, as is needed in REVERIE. Streaming mechanisms can be adapted to the network/architecture and conditions or the other way around. Available tools from the literature and REVERIE partners Signalling and synchronisation for bidirectional data exchange Description: Overall system synchronisation; depending on the architectural choice; it may include the synchronisation of each user with the server and/or among the users themselves. It also includes the synchronisation of the user generated content with the cloud. Ultimately, it will ensure that each user is synchronised with the virtual environment. Dependencies on other technology: Depends on final implementation choices of the architecture; e.g. P2P, hierarchical data distribution, client/server architecture. Data streams synchronization Description: Synchronisation of data streams at the receiving side (audio, separate 3D views, depth map, 3D mesh model, avatar position and any type of data, which are potentially interdependent). The goal is to ensure a consistent reconstruction of the scene. Dependencies on other technology: Chosen standards or design choices for audio, video, or 3D shapes representation. Real-time streaming Description: Use of transmission protocols that ensure delay-constrained data delivery, as required by the system. Common techniques, like Real-Time Transport Protocol (RTP) and RealTime Transport Control Protocol (RTCP) can be used to regulate the data flow depending on the available resources and configuration demanded by the user (within the degree of control allowed by the system). Dependencies on other technology: Logical organisation of the components of the system (users and cloud) i.e. user and server control on the exchange of data. In this section we present the state of the art in streaming real time media in the internet. In REVERIE we will be streaming multiple heavy media streams on the internet, therefore selecting the right technology/protocol for streaming is an important issue. First we review the basic principles of video coding and the evolution of the related standards. Second we summarize internet protocols available for streaming. After this we look at specific issues and research related to streaming for 3DTI. Two differences between 3DTI streaming and for example a video conferencing stream, are the need for inter-stream synchronization between multiple related streams and streams originating from different senders. 2.4.1. Internet Protocols The selection of the appropriate protocol for encapsulating and transporting REVERIE media over the internet is an important research question investigated in REVERIE. Task 5.4 investigates different real-time internet protocols for this task. We list commonly used protocols for media streaming below. The common distinction we will follow is between acknowledged and unacknowledged protocols for transmission and control protocols for connection setup. In the internet protocol stack, Transmission Control Protocol (TCP) provides reliable acknowledged data transport. Its main principle is that unacknowledged packets (losses on the physical layer or IP level drops) will cause it to halve its congestion window and decrease its effective send rate by half. TCP is the dominant protocol on the internet and so far its anti-congestion mechanism have served the internet well. 63 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools In the internet protocol stack, UDP provides an unreliable datagram transport function. While it provides a checksum for error detection, handling of errors and congestion is meant to be handled by higher level protocols, so called application level framing. The protocols RTP/RTCP provide this type of application dependent framing on top of UDP. For many multimedia applications (codecs/video/audio), the internet engineering taskforce provides standardized packet formats for both the transmitted packets (RTP) and the feedback reports (RTCP). Indeed many multimedia applications use this format. One of its main advantages is that it disobeys TCP’s behavior of halving its send-rate, enabling higher effective throughput. As TCP traffic is still dominant in the current internet, ISP’s have not yet shown efforts to enforce TCP friendly behavior. While the use of RTP/UDP seems preferable for media traffic, the internet initially mainly evolved as a medium to exchange textual information. For this type of traffic the combination of HTTP/TCP is common. As this type of traffic has become preferred, tunneling media on top of this has also been popular. While from a networking point of view this is not optimal, web and mobile video transmitted over HTTP are popular. To support mobile and web video over HTTP in a standardized way the MPEG (Moving Picture Experts Group) and the 3GPP (3rd Generation Partnership Project) have released the HTTP DASH (Desktop and mobile Architecture for System Hardware) standard that supports transmission of coded video in segments of 2-10 seconds in HTTP format (Stockhammer, 2011). The receiver client can merge the segments together and display them smoothly. For setting up and controlling sessions, several control protocols have been standardized by the IETF that can also be deployed in REVERIE. For example, Session Initiation Protocol (SIP) is often used to locate a sender and receiver and setup an RTP media stream connection. SIP can exchange Session Description Protocol (SDP) parameters that store specific media parameters (format, description, codec, bandwidth needed, etc). The Real Time Streaming Protocol (RTSP) is an application-level protocol that gives the client direct control of the delivery of media data with realtime properties. RTSP provides an extensible framework to enable controlled, on-demand delivery of real-time data, such as audio and video, and supports options like pause, play, fast-forward, etc. In REVERIE we should try both RTP based streaming as TCP/HTTP based streaming according to the new MPEG DASH specification. At first sight, RTP seems to provide the large throughput needed in use case 2 for full media features, while http adaptive streaming could be interesting to test in use case 1. Comparisons in performance between the two could also be interesting from a research perspective 2.4.2. Basic Video Compression Video compression is an extensive field of research marked by continuous industry standardization efforts, mainly driven by MPEG and the ITU-T Video Coding Experts Group (VCEG). The stakes are high, applications like digital TV, internet videos, video conferencing systems, and cinemas all use video compression and reach billions of people worldwide. In this section we will review the principles of video compression as implemented in related standards. In REVERIE, encoding/decoding mechanisms are needed to meet the bandwidth requirements. However, as encoding operations introduce computational and buffering latency an efficient scheme needs to be selected to minimize this unwanted latency. To do this we need knowledge of 3D video coding, which turns out to be related to or based on basic 2D video coding (hybrid video coders). We will not be too specific on technical details, but instead focus on issues relevant to the REVERIE use cases and give brief summaries. We will first explain the four basic techniques/principles used in most video coders and then review the existing video standards. The first technique/term that is fundamental to both image and video coding is intra-coding. In an image, pixels/areas close to each other often contain similar values. Using this for compression is called intra-coding and is used in most current standards, often with transform-coding. Transforms of blocks of 8x8 pixels (or 4x4 in H.264) concentrate most of the information in the first coefficients 64 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools that represent lower frequencies. This allows more efficient compression by dropping the trailing coefficients that are close to zero. Also, in H.264, prediction between transformed blocks in the same image is applied obtaining even higher intra-compression gains. This technique is applied in both image coding and video coding. The second technique/principle fundamental to video coding is inter-frame coding. Inter-frame coding takes advantage of sequential frames in a video as they are often very similar, therefore the next frame can be estimated from the previous. By estimating the motion of blocks in the frame, the next frame can be predicted with reasonable accuracy. As a consequence apart from a motion estimator only the difference between the estimation and the real-frame has to be transmitted, which tends to be small and can be compressed efficiently (less quantization steps are needed). While forward prediction was enabled in the H.261 video standard for conversational video, later MPEG-1, MPEG-2 and H.264/AVC also support bi-predictive coded frames that also use a frame in the future for prediction. This introduces an extra coding delay, the simple profile of MPEG-2 and the Baseline profile of H.264 are recommended as they disable the use of bi-predictive coding. The third important principle that applies to both image and video coding (and any other type of information source) is entropy coding. Without going into theoretical details, entropy coding aims to reduce the average bitrate by assigning shorter word lengths to more likely symbols. In the English alphabet symbols like ‘e’ and ‘a’ would be assigned shorter bit-words, while less likely symbols such as ‘q’ or ‘z’ are assigned longer bit-words. Known schemes to obtain these words are Huffman coding and arithmetic coding; the latter is often used in modern video standards. Entropy coding will give good results, especially if values tend to be clustered. The fourth important basic principle/technique that can be found in most video coding standards is quantization and rate control. Quantization is a form of lossy coding and refers to representing a large (or infinite) set of values by a (much) smaller set. The simplest example is analog to digital conversion when reading values from a sensor. The continuous values of the sensor are converted to a digital representation. For compression/storage purpose, fewer bits are preferred. Many parameters in a video coding system can also be re-quantized and sent using fewer bits; most standards support many quantizers to adapt to signal. Regarding the output of the encoder, an approximately constant bit-rate is wanted. As bit-rate can depend on the video content and changes, bit-rate control is often achieved by selecting a different quantizer. Now that the fundamental principles of video coding have been briefly explained, the evolution of the different standards is briefly discussed. H.261 (1993) was one of the first hybrid video coders and supported conversational video at 64 kbps. H.261 features all the above techniques except bipredictive encoding. The MPEG-1 standard is very similar to H.261 but had application video storage on disks and hard drives. MPEG-1 introduced bi-predictive coding and provided random access to media files. MPEG-2 was subsequently developed as a more application independent standard that provided different levels and profiles for each application. MPEG-2 supports conversational video, broadcasting and stored video in its different profiles. It notably supports interlacing common in TV signals and has become very dominant in TV broadcasting. Much of the current digital TV infrastructure supports MPEG-2 transmission/broadcasting. Therefore, sometimes MPEG-2 is also referred to as a transmission medium. H.263 and H.263+ can be seen as the successors of H.261 and gained significantly better compression than this standard for conversational video. In 2003 H.264/AVC was released, the standard achieves much better compression rates compared to previous standards (up to 50% where latency is tolerated). Apart from that H.264/AVC supports the different types of applications (conversational, stored file format (MPEG-4 format) and streaming) and supports the different transport layers in a clean way by using Network Abstraction Layer Units (NALU). The profile recommended for conversational video in H.264 is the baseline profile and does not use B-frames/slices (bi-directional prediction). For an overview of the different technical features of the standard, the video coding layer and the network abstraction we refer to Wiegand et al. (2003). Due to its complexity, the standard took some time to become fully supported by fast software and hardware. Fast encoding seemed to be the most critical, especially 65 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools conversational (real-time) video applications that require low encoding latency. Currently, the open source x264 encoder provides fast encoding (Merritt and Vanam, 2004). Moreover, very low latency (<20ms) can be achieved according to Garrett-Glaser (2011) for conversational video based on an Intel i7 1.6 GHz computer. The H.264 standard fully specifies the properties of the decoder that is able to decode any input stream fulfilling the constraints of H.264. A description of a reference decoder is given in Merritt and Vanam (2004). Decoders that confirm the reference should produce similar output for a given H.264 stream. Practically speaking, the open-source library libav codec supports fast H.264 decoding in the order of <10ms baseline. Wenger (2003) presents a paper on employing H.264 over IP, with an emphasis on wireline networks with relatively low error rates where most of the losses in transmission are caused by congestion. Wenger (2003) argues that adding redundancy (forward error correction/detection bits) to combat losses increases traffic and is counterproductive as it will lead to more congestion and more dropped packets. Therefore the design of H.264 provides different ways to combat the effect of these losses without introducing redundancy. We outline them below as they can be of use in the REVERIE streaming case: 1. Intra-placement: Due to intra/inter prediction, errors can drift in time; by inserting IDR pictures that invalidate previous memory this effect is combated. The effect is stronger than with a normal intra frame, as in H.264 pictures can use frames before the last intra frame for prediction. 2. Picture segmentation in independently decodable slices: The picture is adapted in slices for a number of macro-blocks. The main motivation is that one slice can fit one Maximum Transmission Unit (MTU) (IP packet); in this case losses will not affect the entire picture. 3. Data partitioning: Different symbols of a slice are given priority. The most important information in H.264, headers quantization and motion parameters have most priority. Second is the intra-partition that carries intra-coded content only. The least priority is given to interpartition that constitutes inter-predicted content. 4. Parameter sets: In H.264 this infrequently changing parameter information (video format, entropy coding mechanism etc.) can be sent out of band of the real-time communication channel. This avoids the severe errors introduced when the parameter set is not available which would make the stream temporarily undecodable. 5. Flexible macro block reordering: In this technique subsequent macro blocks are part of different slices that are transmitted in different MTUs (packets). This way some of the interprediction is broken; but if one of the slice packets is lost, concealment is possible using the other slice. In his experiments, Wenger (2003) tests the mechanisms for error rates of 0%, 3%, 5%, 10% and 20% respectively on two video test sets. The results show that the mechanisms described above improve the quality of the received signal, considerably. Without these types of control, the signal will become useless at rates of 3%, while with error consilience error rates up to 20% seem to still give reasonable output signals. In REVERIE we should choose the right mechanisms to support error consilience. A new standard for video coding HEVC/H.265 is planned in January 2013 by ITU/MPEG. While new compression gains are expected, for REVERIE it is crucial to rely on robust existing hardware/software solutions (preferably open source) to implement our rendering and coding platforms. For this reason HEVC/H.265 is seen as out of scope. H.264 is the international video standard that achieves the best compression and is designed for network friendly use. Open source implementations of both encoders and decoders are available that can run in real-time on normal processors, such as the Intel i7. To guarantee low decoding and coding latency, the right profile needs to be selected (baseline) and encoder tuning needs to be performed. Various options for achieving error resilience are available that give reasonable quality with loss rates up to approximately 20%. H.264 is recommended for video streaming in REVERIE 66 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 2.4.3. 3D Stereo Video Generally speaking there are two types of 3D video. The first is stereo video, which uses two images to create depth perception. The second refers to the ability to choose the viewpoint in the scene and is called free viewpoint video. Both technologies can also be combined together (stereo + free viewpoint) to enable both effects. With 3D stereo video, each eye views its particular image and in the brain the 3D appearance is constructed by merging the images from the two eyes. So the challenge of a 3D stereo video display is to make sure that each eye sees its own image. Traditionally, special glasses with red/green filtering to filter both of the images are used. Moreover novel auto-stereoscopic displays using lenticular screens can enable 3D stereo video watching without glasses. In the case of normal stereo video, when a viewer who is watching stereo video moves they do not get a different view of the scene. A simple way to provide for this functionality is storing video plus depth and calculating the two screens by interpolation. So 3D stereo video is stored as either a combination of left and right video using conventional video formats (MPEG2/H.264) or a video plus a depth map (can be a monochromatic image of the depths). Compression of conventional stereo video is generally performed by compression of the first video and applying inter-view prediction to the second video. The difference between the videos generated by two different cameras is approximately fixed by their properties and geometrical setup. The displacement between the two can be represented by a dense motion vector map called a disparity map. Subsequently only the difference between the estimation and real signal has to be transmitted in addition. This principle was standardized in MPEG-2 and similar representations are available in H.263, MPEG-4 and H.264. The general approach is illustrated in Figure 44, the left frame is independently coded while the right view gets inter-view prediction. Note: generally, subsequent frames are more similar than stereo frames and most of the additional compression is achieved when coding I-frames. Figure 44: Inter-view coding for conventional stereo video (Smolic, 2011) An alternative to transmitting both the left and right frames is to transmit one central view including a depth map indicating the depth of the different objects in this image. The stereo-pair can then be generated at the receiver site. In the ATTEST project (Fehn et al. 2002), compression efficiency of depth data in combination with state of the art codecs was tested. The results showed that depth data can be compressed at about 10-20% of the original colour data. Subsequently, a backwards compatible bit-stream format for distributing video and depth was developed and added to MPEG-2 and H.264. The main disadvantage of video and depth is generation of the depth map, however in REVERIE we can use different sensors like Kinect to generate depth-maps. In REVERIE we recommend video and depth for 3D video. Video and depth is more efficient compression wise to conventional stereo and allows some change in viewing direction. The difficulty of obtaining depth maps should be tackled by partners working on the capturing part (WP 4). 2.4.4. Free viewpoint / Multi-view video This section is largely based on the more extensive overview of the topic presented in Smolic et al. (2008a). In multi-view coding, the user can choose from which angle/viewpoint they want to view the scene. This technology has links to computer graphics where multiple viewpoints of 3D models/scenes are often supported. The way the scene is represented is important, as it determines 67 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools the capturing setup needed, the storage format and the rendering methods available. According to Smolic et al. (2008a) 3D scenes can be represented as a trade-off between two extremes; image based and geometry based. In Smolic et al. (2008a), they include a picture categorizing the existing application on a scale from image to geometry based. Geometry based generally refers to the rendering of 3D models from its particular viewpoint, including texture and other enhancements. Mesh representations are an example of a 3D model that is often employed. Image based generally refers to interpolating multiple camera views to obtain the view; this method has no knowledge of the 3D geometry of objects. In this section we briefly look at image based and compression aspects relevant to REVERIE. In a request for proposals to support multi-view in H.264, a design based on inter-view prediction by the Heinrich Hertz Institute in Berlin won. Results of testing using PSNR (Peak-Signal to Noise Ratio) showed gains up to 2dB compared to coding all videos regularly (Merkle et al. 2006). The design is added as an amendment to the H.264 standard. Figure 45 shows some of the prediction relations, by assuming one stream is known, other correlated streams can achieve some extra compression by inter-view prediction. Figure 45: Inter-view prediction in Multi-view Video Coding (MVC) (Smolic, 2011) 2.4.5. 3D Mesh Compression In Smolic et al. (2008c) an overview is given on compression of static and dynamic meshes, most of the content in this section is based on that book section. A mesh M is a data-structure M(C,G) that consist of a connectivity part C and a geometry G and can represent any 3D shape in scene. The data-structure basically represents 2D surfaces in a 3D space entangled between edges and vertices. Mapped to the 2D space, these surfaces are faces in a graph structure of edges and vertices. Often these faces are only taken as triangles. The vertices, edges, faces and an incidence relationship represent the connectivity of the mesh. The geometry of the mesh contains the mapping of the vertices, edges and faces/triangles back to 3D. It is clear that a scene based on a mesh data structure is very different from a scene consisting of interpolated multi-view images. Connectivity C(V,E,F,R): A graph structure with only triangle faces. Geometry G(mapping): mapping of vertices from C to points in 3D geometric space of vertices and possible line segments. Also vertices can contain colour or texture that can be rendered on the face surface. For static meshes, Smolic et al. (2008c) present two types of coders: Single rate coders: The connectivity graph is encoded and decoded as one stream, the sequential strip of triangles either using a spanning tree or region growing methods. The geometry (mapping of vertices in space) uses previously decoded vertices and the connectivity graph to predict the coordinates of the next vertices obtain good compression rates. 68 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Progressive coders: This approach can be compared to scalable video coding. By collapsing edges or removing vertices, a simpler mesh is represented first as a base layer and then refinement is added to each step. To decide which edges/vertices to collapse, different constraint metrics are taken into account; such as volume preservation or boundary preservation. Often when the decoder knows how the encoder works, the mesh can be reconstructed completely in progressive steps. Other lossy progressive techniques include multi-resolution approaches; such techniques send more coarse (lower resolutions) first that can be upgraded in later steps but do not reconstruct the original mesh exactly. In the case of dynamic meshes, it is not necessary to send a new full mesh every time instant. With most objects the connectivity of the mesh will remain the same; only the 3D coordinates of the vertices will change. This geometry part can be modelled by a static and a dynamic part,the static part represents a mesh structure and the dynamic part motion parameters of the individual meshes. Another way to save even more is to assign key meshes that are represented by the original mesh and displacement and interpolated meshes. Some known algorithms/standards for 3D dynamic mesh compression are: MPEG-4 AFX, Dynapack, D3DMC and RD3DMC. They all contain prediction and sometimes clustering techniques (to detect if parts of the mesh can be modelled as a rigid body). For REVERIE, depending on the chosen scene representation, MESH based models can be very efficient from a compression point of view. However, the interpolated/predicted movement might look unnatural sometimes. Depending on the achievements of the capturing in WP4 (can we capture a human in a mesh to start with), we can use mesh based representation in use case 1 or use case 2. Also, there is strong rendering and decompression support from graphics cards allowing possibly even the rendering of the human face. Moreover regarding WP5, as most compression techniques have been “optimized” for the graphics pipeline, possible interesting research could arise from studying their properties when transmitted over the network. 2.4.6. Additional requirements 3DTI STREAMING In addition to the basic requirements for general streaming, there are some additional challenges for 3DTI streaming: Handling multiple streams from the same site: o Streaming based on user view (Yang et al. 2009) o Complexity of 3D compression algorithm o Intra-synchronization among streams from one site (Ott and Mayer-Patel, 2004); Huang et al. 2011) o Inter-synchronization among streams from multiple sites (Ott and Mayer-Patel, 2004) Bandwidth competition between related streams from different senders: Related streams from different cameras will compete for bandwidth in the network. This can lead to retransmissions and extra latency, this effect is described in Ott and Mayer-Patel (2004). In this approach, gateway routers maintain connection states and additional routers and packets have an additional header to allow the gateway to do this. As a result the application can adapt to the state of the overall connection. Inter and Intra synchronization of multiple streams: In Huang et al. (2011) an entire architecture is proposed for a 3DTI system that keeps the skew between streams below a certain value. It presents an architecture to handle inter-destination synchronization and the synchronization between different senders. Also by using an overlay, network latency and bandwidth is kept within constraints. While previous research reveals some new issues in media streaming for 3DTI. In REVERIE we are likely to see new and more complex versions of these problems by streaming more heterogeneous 69 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools data; such as a composition model, synchronization of different media, Point cloud, 2D movies, and pictures, silhouettes/skeletons from Kinect data, sensor data and 3D movies. Therefore in REVERIE, more complicated synchronization and user issues will arise that will have to be dealt with. These issues may lead to novel protocols to support the future internet. 2.5. Signaling for 3DTI real-time transmission Signaling mechanisms allow for setup and tear down of sessions, similar to the traditional phone calling mechanisms. In the telecom (ITU) and internet world (IETF), many standards have been developed for universal session handling. Well known are SIP and H.323 for telephony and video conferencing applications. As the end goal in REVERIE is to enable immersive communications between (groups) of participants, initiation of sessions is of utmost importance to enable this. It should be easy to setup reliable sessions with other participants that are our friends, teachers or co-workers, that we want to share an experience with. Available tools from the literature and REVERIE partners Ambulant SMIL player Description: CWI developed and maintain the open source SMIL player AMBULANT. When SMIL is employed this software can be used/enhanced to provide specific signaling functionality. 2.5.1. Internet Protocols Signalling functions in previous distributed multimedia applications are often based on IETF standardized protocols such as RTP/RTCP, SIP, RTSP and SDP. This is necessaryl for compatibility in the internet as the IP Multimedia Subsystem (IMS). The IMS is a telecommunication architecture that integrates the internet with fixed and mobile networks. It offers, for example, interoperability between PSTN phones, mobile phones (GSM and up) and the internet. Even though most operators do not fully implement the architecture, they use some of its components to achieve interoperability. While currently the IMS supports traditional applications such as telephony (VOIP) and television (IPTV), it is designed for complex multimedia applications such as video conferencing and maybe tele-immersion. IMS uses internet protocols for signalling such as SIP and SDP. In the next section, we will discuss the MPEG-4 signalling framework, which was developed for advanced multimedia application with multiple media streams/objects and 3D capabilities. From the six year experience with 3DTI systems, Nahrstedt et al. (2011b) conclude that these protocols do not take requirements from 3DTI into account very well and present their experiences/solutions. They developed solutions for five signalling issues: session initiation, multistream topology, lightweight session monitoring, session adaption and session teardown. We will summarize each of these issues, except topology. Session initiation: a) Registration of various devices and resources (in REVERIE this can be 3D/2D renderers, 3DTVs and so on). b) Construction of an initial content dissemination topology based on the overlay network formed between service gateways (similar to path selection for 2 way communications). In REVERIE we will be mostly concerned with a) as we will have scalable terminals that allow for different devices. SIP-like messages such as Camera_join, display_join and gate_way_join, will need to have compatibility with non-network devices connected directly to the gateway (USB, firewire, or PCI). 70 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Session monitoring: Nahrstedt et al. (2011b) developed Q-Tree as a solution for detecting component failures and monitoring the status of metadata in the network. It builds a management overlay for signalling traffic based upon low Round Trip Times (RTT) (minimum latency). Designing the overlay is already a bit complex (approximately 10 nodes) as the construction of a degree bounded minimum spanning tree is already an NP-hard problem (k-MST). However, they develop a suitable heuristic for finding an appropriate overlay. After the overlay is constructed high-level semantic queries can be compiled by the query engine and sent into the overlay. The novelty compared to other systems is in the support for range queries on multiple attributes (i.e. bandwidth is between 10 and 20; CPU utilization is bigger than 60%; etc.). When a node issues a query, this query is routed to the hierarchically distributed nodes that have this range assigned and contains the requested metadata item. This metadata item is stored in the content store of the node, and can be both local metadata or metadata from a remote node. To update from metadata changes, the hierarchically organized nodes are periodically updated. In case of highly variable metadata such as loss, jitter, bandwidth or CPU utilization and others, a high frequency of updates would be required. The authors indicate that for these cases a simple multicast seems better to handle range queries (ask each nod by multicast, the right one will reply its metadata). Moreover the approach assumes uniformly distributed metadata or that the distribution of the metadata is known to implement the hierarchical query overlay, indeed in many practical cases this assumption does not hold. Session adaption: Approaches that adapt to the user view and interest are presented in Yang et al. (2009) and Wu et al. (2009). Session Teardown: As one terminal consists of multiple devices, the devices should also be able to leave the session gracefully, i.e. a camera_leave message should be specified. Gateways or terminals should be able to leave using a gateway_leave message. For REVERIE it seems we can use less sophisticated approaches as we simplified the traffic load in use case 1 and reduced the number of nodes in use case 2 to three. Therefore a more reduced scheme is useful. Also in the current state of the art, the viewpoint of the users’ matters and reflects which streams in the network are given priority, and which are not. View control should therefore be incorporated in the signalling mechanism. 2.5.2. MPEG-4 transmission Framework A framework that supports signalling and transmission is MPEG-4. MPEG-4 is an extensive multimedia framework that consists of various tools for multimedia applications. MPEG-4 part-1 (systems) provides a way to link media objects and achieve synchronization. MPEG-4 part-7 on the other hand provides a framework for monitoring and signalling MPEG-4 sessions: the Delivery Multimedia Interface (DMIF). DMIF (or Delivery Application Interface (DAI)) provides the following services and specifically DMIF Network interface for communications over IP. The DMIF has primitives for starting sessions, services monitoring sessions, and performing user commands. In principle these services are suitable to be used in an IP based 3DTI system. However most of its functionality over IP is implemented by making use of internet signalling protocols such as RTSP, SIP, SDP and RTP/RTCP, that can be used directly instead. However as other parts of MPEG-4 support 3D scene description with spatial composition (MPEG-4 part-11 BIFS), 3D and 2D mesh coding schemes (part-16), high level video compression (part-10 MPEG-4/H.264), audio and visual coding standards (part-2 and part-3), an MPEG-4 interface could still prove useful to REVERIE. 71 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 2.5.3. SMIL multimedia presentation SMIL is a multimedia presentation language that can be used to author and describe multimedia presentations, including synchronization of different objects and maintaining state. In November 2008 version 3.0 of this XML based description language was released by the WWW Consortium (W3C). It can primitively be seen as the multimedia equivalent of HTML, it allows spatial composition of media objects, as HTML allows spatial composition of text, links and images. However as multimedia, more than text is time dependent, SMIL offers many timing and synchronization features. A reference to its specification of elements and attributes is given in Bulterman and Ruthledge (2008). Playback of SMIL documents requires a specific player that can be embedded in the web browser. CWI developed and actively maintains the open source SMIL player ambulant that supports most SMIL functionality and runs on most platforms/web browsers. Therefore in REVERIE, specifically for use case 1, ambulant can be a web based renderer that can compose and synchronize various media formats based on SMIL. While SMIL is not widely deployed, its features seem to be getting adopted in newer web standards, HTML 5 and MPEG-DASH. 2.6. MPEG-V Framework The MPEG-V (ISO/IEC 23005) media context and control is an official standard that intends to bridge the differences in existing and emerging virtual worlds while integrating existing and emerging (media) technologies like instant messaging, video, 3D, VR, AI, chat, and voice. The standard was defined as an integral part and deliverable of the ITEA2 Metaverse1 project. Work is already advancing on a second version of the standard to extend its application domains. There is a lot of interest, for instance in biosensors; measuring vital body parameters and using them as inputs for either games or lifestyle-related applications. Next to that, Gas and Dust, Gaze Tracking, Smart Cameras, Attributed Coordinate, Multi-Pointing, Wind and Path Finding sensors are under consideration. The official standard consists of the following parts. 2.6.1. Architecture The system architecture of the MPEG-V framework is depicted in the next figure. For more information, please refer to the MPEG-V documentation. 2.6.2. Control information This part specifies syntax and semantics required to provide interoperability in controlling devices in real as well as virtual worlds. The adaptation engine (RV or VR engine), which is not within the scope of standardization, takes five inputs (1) Sensory Effects (SE), (2) User’s Sensory Effect Preferences (USEP), (3) Sensory Devices Capabilities (SDC), (4) sensor capability, and (5) Sensed Information (SI) and outputs sensory devices commands and/or SI to control the devices in real worlds or a virtual worlds' object. The scope of this part covers the interfaces between the adaptation engine and the capability descriptions of actuators/sensors in the real world and the user’s sensory preference information, which characterizes devices and users, so that appropriate information to control devices (actuators and sensors) can be generated. In other words, user’s sensory preferences, sensory device capabilities, and sensor capabilities are within the scope of this part. For more information, please refer to the MPEG-V documentation. 72 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 41: System Architecture of the MPEG-V Framework 2.6.3. Sensory information Sensory information which is part of standardization area B specifies syntax and semantics of description schemes and descriptors that represent sensory information. Also haptic, tactile, and emotion information fit in this part. The concept of receiving sensory effects in addition to audio/visual content is depicted in the next figure. Figure 42: Concept of MPEG-V Sensory Effect Description Language The Sensory Effect Description Language (SEDL) is an XML Schema-based language which enables one to describe so-called sensory effects such as light, wind, fog, vibration, etc. that trigger human senses. The actual sensory effects are not part of SEDL but defined within the Sensory Effect Vocabulary (SEV) for extensibility and flexibility allowing each application domain to define its own sensory effects. A description conforming to SEDL is referred to as Sensory Effect Metadata (SEM) and may be associated to any kind of multimedia content (e.g., movies, music, Web sites, games). 73 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools The SEM is used to steer sensory devices like fans, vibration chairs, lamps, etc. via an appropriate mediation device in order to increase the experience of the user. That is, in addition to the audiovisual content of, e.g. a movie, the user will also perceive other effects such as the ones described above, giving her/him the sensation of being part of the particular media which shall result in a worthwhile, informative user experience. For more information, please refer to the MPEG-V documentation. 2.6.4. Virtual world object characteristics This 4th part specifies syntax and semantics of description schemes and descriptors used to characterize a virtual world object related metadata, making it possible to migrate a virtual world object (or only its characteristics) from one virtual world to another and/or control a virtual world object in a virtual world by real word devices. For more information, please refer to the MPEG-V documentation. 2.6.5. Data formats for interaction devices This part specifies the syntax and semantics of the data formats for interaction devices, i.e., Device Commands and SI, required for providing interoperability in controlling interaction devices and in sensing information from interaction devices in real and virtual worlds. For more information, please refer to the MPEG-V documentation. 2.6.6. Common types and tools Part 6 specifies the syntax and semantics of the data types and tools common to the tools defined in other parts of MPEG-V. To be specific, data types which are used as basic building blocks in more than one tool of MPEG-V, for example; color-related basic types and time stamp types which can be used in device commands and sensed information to specify timing. For more information, please refer to the MPEG-V documentation. 2.6.7. Conformance and reference software For more information, please refer to the MPEG-V documentation. Usefulness to the project Scene composition and visualization (WP7) will depend heavily on acquired sensory information (WP4) and user interaction and autonomy (WP6). The MPEG-V standard on the other hand defines a standard interface and architecture for various kinds of sensory information (maybe of interest to WP4), data formats for interaction devices (maybe of interest to WP6), and virtual world object characteristics (may be of interest to WP7). In defining the interface between WP4, WP6 and WP7, the use of the MPEG-V (ISO/IEC 23005) standard should be taken into account. Furthermore, a comparison between the standardized MPEG-V architecture and the proposed FP7 REVERIE architecture may be required. In the event that the MPEG-V standard would be of importance or interest to the partnership, the standard could be adopted and updates or improvements to the standard may be submitted to the ISO/IEC subcommittee of the related joint technical committee of the related workgroup. 74 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 3. Related Tools to WP6: Interaction and autonomy 3.1. 3D Avatar Authoring Tools 3D virtual characters are basically created in the same way as the virtual environments they are intended to populate are brought to life. The same techniques are used in order to hand-model (or sometimes capture and reconstruct) a polygon mesh representing the static appearance of the character. The latter may be enhanced for increased detail, by applying advanced texture mapping techniques (such as normal maps). This static appearance is usually complemented by a rig, a set of control structures used to control the mesh vertices and apply animation to the character. A rig’s most powerful advantage is that it reduces the overall complexity of the character mesh (consisting of hundreds of vertices) into a limited set of degrees of freedom, representing the realistic movement of the human body. Skeletal rigs are the most common type of body animation rigs used by modellers and animators, as they provide an abstraction to the human skeleton. The mesh control structures are represented as bones, connected to one another using rotational joints, ultimately forming a hierarchy. Animation is applied by rotating specific joints, applying rotational effects to any possible children joints further down the hierarchy, therefore roughly simulating actual skeletal movement. Animation of the bones is applied to the vertices of the mesh using a technique called smooth skinning. This process consists of applying several weighted influences on each vertex, signifying which, and how much bones contribute to its displacement from the static appearance. Facial animation is usually handled by separate facial rigging techniques, which include the traditional bone structure (with the exception being that facial bones are mainly interconnected by translational joints, and are not an abstraction to any actual skeletal structure of the face), and the more popular Morph Targets or Blend Shapes (Parke, 1972). This latter technique consists of actually displacing vertices on the facial mesh to new locations, forming several expressions. These actions are stored as shape keys and are blended together to produce new facial expressions. A third method that has been proposed for extreme accuracy but not yet used in practice due to complexity, is the physically realistic modelling of the human facial muscles (Terzopoulos and Waters, 1990). Figure 43: Polygon Mesh and Rig combined to produce the skinned model in a new pose As the creation of 3D virtual characters is at the forefront of many multi-billion dollar industries including film and gaming, many commercial, as well as open source tools have been developed in order to speedily produce realistic humanoid meshes and rigs for integration onto graphics engines. Most of these tools, as well as popular 3D modelling software come with pre-implemented tools for exporting meshes and rigs to a usable file format, storing information on the mesh vertices, normals and texture coordinates, as well as the rig's joint hierarchy and rotational constraints. Popular 75 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools export formats include the Wavefront OBJ format (which only supports mesh export), the COLLADA (.dae), the 3D Studio (.3ds) and the DirectX (.x) formats for both mesh and rig export. Graphics engines developers have also been known to implement their own formats, best suiting the needs of their engines. Usefulness to the project As REVERIE aims to provide users with the tools to create their own virtual avatars, the use of 3D avatar authoring tools is mandatory. Users should be able to customize their character to their liking and should not be bothered with the polygon mesh modelling, rigging and skinning processes, which should be handled automatically by the authoring tool. Available tools from the literature and REVERIE partners MakeHuman Description: Open Source http://www.makehuman.org/ tool for creating 3D human characters. PeoplePutty Description: People Putty is a commercial tool that allows users to create interactive 3D characters and dress them up. These characters can then be used to act as tour guides for personal web pages. http://www.haptek.com/ Dependencies on other technology: Haptek Player Digimi Avatar Studio Description: Digimi tools allow users to create and personalize a realistic 3D avatar from a single face image. The avatars which are created can be deployed in virtual worlds, crossplatform games, social networks, mobile applications and animation tools. It is an easy-to-use platform for generating personalized Avatars delivered in 3D Flash format, which is compatible with all web rich-media, social applications and games. http://www.digimi.com/newsite/presite/home.jsp ICT Virtual Human Toolkit Description: A collection of modules, tools and libraries that allows users, authors and developers to create their own virtual humans. http://vhtoolkit.ict.usc.edu/index.php/Main_Page Dependencies on other technology: AcquireSpeech; Watson; NPCEditor; Non-verbal Behaviour Generator (NVBG); SmartBody. Evolver Avatar Engine Description: A free avatar creation engine, which allows users to quickly build an avatar and make it available on dozens of online destinations such as movies in social media, virtual worlds or massively multiplayer online games. Autodesk MotionBuilder Description: Autodesk MotionBuilder is a real-time 3D character animation software, which is particularly useful for motion-capture data. http://usa.autodesk.com/adsk/servlet/pc/index?id=13581855&siteID=123112 Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds Max 76 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Image Metrics' PortableYou Description: The PortableYou platform is a suite of web services that enables a developer to create applications with advanced avatar creation, customization, and projection functionality. With a PortableYou enabled application, end-users can instantly generate customizable 3D avatars from photos of their face and carry these avatars across other enabled third-party applications. http://www.image-metrics.com/Portable-You/PortableYou Digital Art Zone's DAZ Studio 4 Pro Description: DAZ Studio is a feature rich 3D figure customization, posing, and animation tool that enables the creation of stunning digital illustrations and animations. DAZ Studio is the perfect tool to design unique digital art and animations using virtual people, animals, props, vehicles, accessories, environments and more. Simply select your subject and/or setting, arrange accessories, setup lighting, and begin creating beautiful artwork. http://www.daz3d.com/i/products/daz_studio 3.2. Animation Engine Building an Embodied Conversational Agent (ECA) system needs the involvement of many research disciplines. Issues like speech recognition, motion capture, dialog management, or animation rendering require different skills from their designers. Soon it became obvious that there was the need to share expertise and to exchange the components of an ECA system. SAIBA (Vilhjálmsson et al. 2007) is an international research initiative whose main aim is to define a standard framework for the generation of virtual agent behaviour. It defines a number of levels of abstraction, from the computation of the agent’s communicative intention, to behaviour planning and realization, as shown in Figure 49. Figure 44: SAIBA architecture The Intent Planner module decides the agent’s current goals, emotional state and beliefs, and encodes them into the Function Markup Language (FML) (Heylen et al. 2008). To convey the agent’s communicative intentions, the Behaviour Planner module schedules a number of communicative signals with the Behaviour Markup Language (BML). It specifies the verbal and non-verbal behaviours of ECAs [Vilhjálmsson et al. 2007]. Finally the task of the third element of the SAIBA framework, Behaviour Realizer, is to realize the behaviours scheduled by the Behaviour Planner. It receives input in the BML format and it generates the animation. A feedback system is needed in order to inform the modules of the SAIBA about the current state of the generated animation. This information is used, for example, by Intent Planner to re-plan the agent’s intentions when an interruption occurs. 77 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools There exist several implementations of the SAIBA standard. SmartBody [Thiébaux et al. 2008] is an example of the Behaviour Realizer. It takes as input BML code (including speech timing data and the world status updates), and it composes multiple behaviours and generates the character animation synchronized with audio. For this purpose it uses an extended version of BML, allowing one to define interruptions and predefined animations. SmartBody is based on the notion of animation controllers. The controllers are organized in a hierarchical structure. Ordinary controllers manage the separate channels, e.g. pose or gaze. Then the meta-controllers manipulate the behaviours of subordinate controllers allowing the synchronization of the different modalities to generate consistent output from the BML code. SmartBody can be used with the NVBG that corresponds to the Behaviour Planner in the SAIBA framework. It is a rule-based module that generates BML annotations for non-verbal behaviours from the communicative intent and speech text. SmartBody can be used with different characters, skeletons and different rendering engines. Heloir and Kipp (2009) extend the SAIBA architecture by a new intermediate layer called the animation layer. Their EMBR agent is a real-time character animation engine developed by the Embodied Agents Research Group that offers a high degree of animation control through the EMBRScript language. This language permits control over skeletal animations, morph target animations, shader effects (e.g. blushing) and other autonomous behaviours. Any animation in EMBRScript is defined as a set of key poses. Each key pose describes the state of the character at a specific point in time. Thus the animation layer gives access to animation parameters related to the motion generation procedures. It also gives to the ECA developer, the possibility to better control the process of the animation generation without constraining him to enter into the implementation details. Elckerlyc (van Welbergen et al. 2010) is a modular and extensible Behaviour Realizer following the SAIBA framework. It takes as input a specification of verbal and non-verbal behaviours encoded with extended BML and can eventually give feedback concerning the execution of a particular behaviour. Elckerlyc is able to re-schedule behaviours that are already queued with behaviours coming from a new BML block in real-time, while maintaining the synchronization of multimodal behaviours. It receives and processes a sequence of BML blocks continuously allowing the agent to respond to the unpredictability of the environment or of the conversational partner. Elckerlyc is also able to combine different approaches to animation generation to make agent motion more humanlike. It uses both procedural animation and physical simulation to calculate temporal and spatial information of motion. While the physical simulation controller provides physical realism of motion, procedural animation allows for the precise realization of the specific gestures. BMLRealizer (Arnason and Porsteinsson, 2008) created in the CADIA lab is another implementation of the Behaviour Realizer layer of the SAIBA framework. It is an open source animation toolkit for visualizing virtual characters in a 3D environment that is partially based on the SmartBody framework. As input it also uses BML; the output is generated with the use of the Panda3D rendering engine. RealActor (Cerekovic et al. 2009) is another BML Realizer developed recently. It is able to generate the animation containing verbal content that is complemented by a rich set of non-verbal behaviours. It uses the algorithm, based on neural networks, to estimate the duration of words. Consequently it can generate the correct lip movement without explicit information about the phonemes (or visemes). RealActor was integrated in various open-source 3D engines (e.g. Ogre, HORDE3D). The Greta architecture is SAIBA compliant (Niewiadomski et al., 2011). It allows for creating cross-media and multi-embodiment agents (see Figure 50). It proposes a hierarchical organization for the SAIBA Behaviour Realizer. It also introduces different levels of customization for the agent. Different instantiations of the agent can share Behaviour Planning and Realization; only the animation computation and rendering display may need to be tailored. That is, the Greta system is 78 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools able to display the same communicative intention with different media (AR, VR, web, etc.), representations (2D, 3D) and/or embodiments (robots, virtual and web Flash-based agents). Figure 45: Greta architecture Usefulness to the project The SAIBA framework can be very useful to the REVERIE project. Its modularity and decomposition into several steps allow us to integrate various modules at different levels of the SAIBA framework. Virtual agents can be driven by setting their emotions and communicative intentions using EmotionML and FML-like languages (e.g. APML-FML). Identically, when necessary, they can be driven by specifying their behaviours using BML. Integration within larger frameworks, such as SEMAINE, can be ensured by using either FML or BML languages. Several virtual agent platforms such as Elckerlyc, Greta or SmartBody, etc. follow this standard. Available tools from the literature and REVERIE partners EMBR Description: Free, real-time animation engine for embodied agents that offers a high degree of animation control via the EMBRScript language. Animations are described either by prerecorded animations or by sequences of key poses influencing parts of a character. A key pose may specify a sub-skeleton configuration, a shader parameter value, a set of kinematic constraints, or a combination of morph targets. http://embots.dfki.de/EMBR/ FACEWARE Description: FACEWARE is a commercial performance-driven animation technology for the film and games industry. Developed over 10 years of animation production experience, FACEWARE utilizes a marker-less video analysis technology and artist-driven performance transfer toolset to deliver ultra-high fidelity, highly efficient, truly believable facial animation in a fraction of the time of more traditional methods. This software has been used internally by the Image Metrics 79 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools production team to produce thousands of minutes of facial animation as a service and has been an integral tool on many award winning facial animation projects. http://www.image-metrics.com/Faceware-Software/Overview Interactive Face Animation – Comprehensive Environment (iFACE) Description: iFace is a free Face Multimedia Object framework – a framework for all the functionality and data related to facial actions. http://img.csit.carleton.ca/iface/ Dependencies on other technology: DirectX v9.0c; .NET framework. SmartBody Description: SmartBody is a modular, controller-based character animation system. SmartBody uses BML to describe a character’s performance, and composes multiple behaviours into a single coherent animation, synchronizing skeletal animation, spoken audio, and other control channels. http://sourceforge.net/projects/smartbody/ FaceFX Description: FaceFX is a cutting edge solution for creating realistic facial animation from audio files. http://facefx.com Dependencies on other technology: Used in game engine pipelines (possibly in Unity, etc.) faceshift Description: faceshift is a new technology for real-time markerless facial performance capture and animation. The software automatically produces highly detailed facial animations based on FACS expressions from depth cameras such as Microsoft’s Kinect. faceshift works seamlessly for fast facial expressions, head motions, and difficult environments. http://www.faceshift.com/faceshift.html CharToon Description: Tool for authoring and real-time rendering of 2D (cartoon-like) graphics models. It has a 3-tier architecture (Graphics, Control, Choreography) and is designed for conversational agents that feature Lip-sync, Emotions, Gestures and can accept GESTYLE markup for nonverbal communication. http://www.cwi.nl/projects/FASE/CharToon Dependencies on other technology: SVG Elckerlyc Description: Elckerlyc is a BML compliant Behaviour Realizer for generating multimodal verbal and non-verbal behaviour for virtual humans. It follows the SAIBA framework. It supports animation of real-time continuous interaction. http://elckerlyc.ewi.utwente.nl/ MPEG4 H-Anim Description: MPEG-4 (ISO-IEC standard) adopts a VRML based standard for virtual human representation (H-Anim), and provides an efficient way to animate virtual human bodies. MPEG-4 standardizes the definition of the shape and surface of a model and anatomic deformations. By transfer of animation parameters it is an efficient and flexible way to animate virtual humans. 80 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools http://h-anim.org/ 3.3. Autonomous Agents An intelligent avatar can appear in many different forms: as a graphical representation of a human in a virtual reality system; a Non-Player Character (NPC) in 3D computer games; an online sales-assistant; an online nurse encouraging health diets and exercise, etc. (Rich and Sidner, 2009; Larson, 2010). For avatars to engage with humans credibly in an emotional and expressive manner, four key issues need to be taken into account (Rich and Sidner, 2009): 1. Engagement: whether you have interaction between two autonomous avatars; or an autonomous avatar and an avatar with a human user; or two avatars with human users, it is very important that the initiation of contact, ongoing contact and the termination of contact appears lifelike, smooth and familiar (Sidner et al., 2005). 2. Emotion: Using appropriate gestures, stances, facial expressions, voice intonations, etc., an avatar needs to express emotional information in a way that human users can understand. On the other hand, intelligent and autonomous avatars need to be able to recognize and understand these same emotional characteristics of the human user in order to behave and respond correctly (Gratch et al., 2009). 3. Collaboration: For the coordination of activities to occur, a high level of communication is required between the avatar and human user. To appear credible, collaboration also relies on engagement and emotion (Grosz and Kraus, 1996). 4. Social Relationship: It is becoming more and more common that developers are creating avatars which have long-term relationships with their human users. These avatars can work with and help human users with long-term weight-loss, healthy-eating diets, education and learning, etc. To have a successful social relationship between intelligent avatars and human users, the other three factors of engagement, emotion and collaboration are necessary. Virtual agents have been endowed with human-like conversational and emotional capabilities. To some extent, they can communicate emotions through facial expressions and body movements. Several computational models have been proposed to allow the agents to display a large palette of emotions, going beyond the six prototypical expressions of emotions (Arya et al. 2009). Emotions such as relief, embarrassment, anxiety and regret, can be shown as sequences of multimodal signals (Niewiadomski et al. 2011). Being a social interactant, agents can control the communication of their emotional states. Agents can display complex emotions such as the superposition of emotions, the masking of one emotional state by another one by combining signals of the different emotional states on their face (Niewiadomski and Pelachaud 2010; Bui 2004; Mao et al. 2008). Models of display rules (Prendinger and Ishizuka 2005) ensure the agents decide when to show an emotion and to whom. Several perceptual studies have shown that human users perceive when an agent lies through asymmetric facial expressions (Rehm and André 2005). They can also distinguish the display of a polite smile vs. an embarrassed or a happy one by the agent (Ochs et al. 2012). In an interaction, agents are both speaker and listener. Models of turn-taking ensure a smooth interaction schema. Agents are active listeners. They display backchannels to tell how they view what their interactants are saying (often with very limited natural language understanding); and how engaged in the interaction they are (Sidner et al. 2004). Imitating specific interactants’ behaviour allows the agent to maintain engagement as well as to build rapport (Bevacqua et al. to appear; Huang et al. 2010). Social capabilities such as politeness and empathy have also been considered to some extent. Agents can adapt their facial movements (Niewiadomski and Pelachaud 2010) and their gestures (Rehm and André 2005) depending on the social relationship with their interactants. These models 81 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools work in very specific controlled contexts. By simulating the emotions, their interactants can potentially feel and agents can display empathy toward them (Ochs et al, 2008). However such models suffer from strong limitations such as showing anger to users’ anger. Studies have shown that users’ stress and frustration decrease when the agents show empathy (Prendinger and Ishizuka 2005; Beale and Creed 2009). Empathic agents are preferred, they have been found to be more agreeable and caring. Human users show more satisfaction and engagement with them (Brave et al. 2005) and their task performance increases as well (Partala and Surakka 2004). Agents showing appropriate emotions lead to higher perceived believability, and are perceived to be warmer and more competent (Demeure et al. to appear). To simulate human intelligence and perform credibly in a virtual online world, autonomous avatars must interact in a way that is familiar to human users and the processes and techniques which enable this and that have been reported in the state of the art are extremely varied: Search and optimization: This technique involves intelligently searching through every possible solution. However, this can become too slow as the amount of possible solutions can grow exponentially depending on the problem being solved. In this case heuristics are used which can be described as intelligent guesses on choosing the correct path to take, also known as “pruning”. Optimization is a form of search where the initial step includes a guess at the correct path and continuing the search from that point. In the literature, many researchers have used a search and optimization technique for the development of autonomous agents, such as Shao and Terzopoulos, 2005; Chung et al. 2009; and Codognet, 2011. Logic: When logic is used as a problem-solving technique in AI, it can take the form of a set of statements or facts which can be true or false, such as propositional logic; or a set of descriptors which outline objects and their properties and relationships such as first-order logic; or even a set of statements which can have a truth value between 0 and 1, as in fuzzy logic. There are many types of logic programming and problem solving techniques which have been used in AI and with intelligent avatars, such as in Pokorny and Ramakrishnan, 2005; Kulakov et al. 2009; and Drescher and Thielscher, 2011. Probability: Probability theory is used in AI when there is not a complete set of information about the world or the events that will occur. One of the most popular probabilistic methods is Bayesian networks, but there are many others, such as HMMs, Kalman filters, decision theory, etc. Researchers which have used such techniques for autonomous agents include Alighanbari and How, 2006; Moe et al., 2008; and Arinbjarnar and Kudenko, 2010. Classifiers and statistical learning methods: Classification involves the examination of a new piece of data, input, stimulus, action, etc. and matching it to a previously seen or known class. Once this recognition occurs, a decision can be made about what action to take based on previous experience. The element of learning can be done using many different techniques, such as neural networks, support vector machines, nearest-neighbour algorithms, decision trees, etc. However, for every new problem, a suitable classifier has to be chosen as no individual classification method suits all problems. Examples of classification for intelligent avatars include Jebara and Pentland, 2002; Jadbabaie et al. 2003; Brooks et al, 2007; Oros et al. 2008; Parker and Probst, 2010; and Boella et al. 2011. Usefulness to the project Autonomous agents are a critical part of the REVERIE project; they are required to engage in understandable social interactions with other users in an emotional and expressive manner. They must respond in real time and portray behaviour and responses to the virtual environment and other users in a credible manner. They must learn to adapt their responses and behaviours to differing environments, such as a learning environment where interaction is required, and a narrative space (storytelling) where less interaction will occur. The REVERIE avatars may also be controlled by their human users and when this occurs the avatars must learn their behaviours 82 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools through a process of repetition and reward. “A device can either behave intelligently as a result of automated or human-controlled directions, or a device literally can be intelligent - that it requires no external influence to direct its actions” (Larson, 2010) Several of the presented tools (e.g. the SEMAINE platform) allow human users to interact with virtual agents in real-time. This interaction relies on the analysis of non-verbal cues extracted from users and from the model of social behaviours of the agent. The flexibility of their architecture (cf. SEMAINE) offers us to concentrate on the module we aim to extend (e.g. the attentive model or the emotional one) while relying on the other parts of the architecture. Connection to the existing modules is done using standard languages (EmotionML, FML, BML) and specific ones (e.g. SEMAINEML). Available tools from the literature and REVERIE partners Greta Description: Greta is a free, real-time 3D embodied conversational agent with a 3D model of a woman compliant with the MPEG-4 animation standard. She is able to communicate using a rich palette of verbal and non-verbal behaviours. Greta can talk and simultaneously show facial expressions, gestures, gaze, and head movements. http://perso.telecom-paristech.fr/~pelachau/Greta/ NPCEditor Description: A package for creating dialogue responses to inputs for one or more characters. It contains a text classifier based on cross-language relevance models that selects a character's response based on the user's text input, as well as an authoring interface to input and relate questions and answers, and a simple dialogue manager to control aspects of output behaviour. http://vhtoolkit.ict.usc.edu/index.php/NPCEditor Non-verbal Behaviour Generator (NVBG) Description: The NVBG is a tool that automates the selection and timing of non-verbal behaviour for ECA (aka Virtual Humans). It uses a rule-based approach that generates behaviours given information about the agent's cognitive processes but also by inferring communicative functions from a surface text analysis. The rules within NVBG were crafted using psychological research on non-verbal behaviours as well as a study of human non-verbal behaviours, to specify which non-verbal behaviours should be generated at each given context. In general, it realizes a robust process that does not make any strong assumptions about the markup of communicative intent in the surface text. In the absence of such a markup, NVBG can extract information from the lexical, syntactic, and semantic structure of the surface text that can support the generation of believable non-verbal behaviours. http://vhtoolkit.ict.usc.edu/index.php/NVBG Semaine Description: Semaine is a modular, real-time architecture of Human-Agent interaction. Its technologies embed a visual and acoustic analysis, a dialog manager and a visual and acoustic synthesis. The system can detect the emotional states of the user through analyzing facial expression, head movement and voice quality. Four virtual agents with specific personality traits including different facial models, voice quality and behaviour sets have been defined. The Semaine project is an EU-FP7 1st call STREP project and aims to build a Sensitive Artificial Listener (SAL). SAL is a multimodal dialogue system which can: 1. Interact with humans with a 83 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools virtual character; 2. Sustain an interaction with a user for some time; 3. React appropriately to the user's non-verbal behaviour. Semaine is Open Source software. http://semaine.sourceforge.net/ Dependencies on other technology: ActiveMQ, JAVA, openSMILE 3.4. Audio and speech tools Audio and speech based interaction between man and machine has been an active research area in the field of HCI, of which speech (Rabiner, 1993) and speaker recognition (Beigi, 2011) have been at the forefront for many years. Research in automatic speech recognition has had to deal with robust representation, and the recognition process of the many parameters characterizing a highly variable signal such as speech. Speaker recognition, on the other hand, emphasizes on recognizing persons from the physical characteristics of the sound of their voice, as well as their manner of speaking, such as accent, pronunciation pattern, rhythm, etc. In both cases of speaker enrolment and speaker verification, a person’s speech undergoes a feature extraction process that transforms the raw signal into feature vectors. In the enrolment case, a speaker model is trained using these feature vectors; while in the recognition case, the extracted feature vectors of the unidentified speech are compared to the speaker models in the database, outputting a similarity score. Through advancements in recent years, several methods for representation and feature extraction have been proposed, with the use of Mel-Frequency Cepstral Coefficients (MFCC) (Darch et al. 2008; Davis and Mermelstein, 1980) being more prominently featured among other methods, including perceptual linear prediction coefficients (Hermansky, 1990), normalization via cepstral mean subtraction (Furui, 2001), relative spectral filtering (Hermansky and Morgan, 1994) and Vocal Tract Length Normalization (VTLN) (Eide and Gish, 1996). The predominant framework for speech recognition algorithms uses stochastic processing with HMMS (Gales and Young, 2007), while adaptation to variable conditions (such as different speaker, vocabulary, environment, etc.) has been addressed via maximum a posteriori probability estimation (Gauvain and Lee, 1994), maximum likelihood linear regression (Kim et al. 2010) and eigenvoices (Kuhn et al. 1998). Recent advances in audio and speech interaction however, have turned towards the analysis and recognition of human emotion in audio signals (Oudeyer, 2003; Chen, 2000) and the detection of human produced sound cues (such as laughs, cries, sighs, etc.) (Schröder et al. 2006) to complement voice pitch and speech data in order to tackle emotion integration in intelligent HCI. Research on audio affect recognition is largely influenced by basic emotion theory, and most existing efforts aim to recognise a subset of basic emotions from the speech signal. Similar to speech recognition methods, most existing speech affect recognition approaches use acoustic features, such as MFCC; with many studies showing that pitch and energy contribute the most to affect recognition. Most methods in the related literature are able to discriminate between positive and negative affective states (Batliner, et al. 2003; Kwon et al. 2003; Zhang et al. 2004; Steidl et al. 2005). Several recent efforts have been made towards automatic recognition of non-linguistic vocalizations such as laughter (Truong and van Leeuwen, 2007), coughs (Matos et al. 2006) and cries (Pal et al. 2006), which help improve the accuracy of affective state recognition. Others have tried to interpret speech signals in terms of application-specific affective states, such as deception (Hirschberg et al. 2005; Graciarena et al. 2006), certainty (Liscombe et al. 2005), stress (Kwon et al. 2003) and frustration (Ang et al. 2002). Other approaches in the field of audio interaction, have investigated the concept of musical generation and interaction (Lyons et al. 2003). Usefulness to the project Audio and speech tools will provide REVERIE agents with the means to socially interact with users in a natural way. More specifically, REVERIE avatars should be able to comprehend the message the user is communicating, as well as understand the emotional condition of the user, 84 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools therefore adapting its own behaviour accordingly (e.g., detecting a laughing sound will indicate a pleasant occasion). Available tools from the literature and REVERIE partners AcquireSpeech Description: AcquireSpeech is a tool that connects the sound input on your computer to a speech recognition server, while providing real time monitoring, transcripts, and recording, as well as allowing for direct text input and playback of recorded speech samples. It has been designed with a focus on configurability and usability, allowing for different speech recognition systems and usage scenarios. http://vhtoolkit.ict.usc.edu/index.php/AcquireSpeech Dependencies on other technology: PocketSphinx openSMILE Description: The openSMILE feature extration tool enables you to extract large audio feature spaces in real time. It combines features from Music Information Retrieval and Speech Processing. Speech & Music Interpretation by Large-space Extraction (SMILE) is written in C++ and is available as both a standalone command line executable as well as a dynamic library. The main features of openSMILE are its capability for on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple configuration file. New components can be added to openSMILE via an easy binary plugin interface and a comprehensive API. http://sourceforge.net/projects/opensmile/ openEAR Description: openEAR is the Munich Open-Source Emotion and Affect Recognition Toolkit developed at the Technische Universität München (TUM). It provides efficient (audio) feature extraction algorithms implemented in C++, classfiers, and pre-trained models. http://sourceforge.net/projects/openart/ YAAFE Description: YAAFE means “yet another audio features extractor“; a software designed for efficient computation of many audio features simultaneously. Audio features are usually based on same intermediate representations (FFT, CQT, envelope, etc.), YAAFE automatically organizes computation flow so that these intermediate representations are computed only once. Computations are performed block per block, so YAAFE can analyze arbitrarily long audio files. The YAAFE framework and most of its core feature library are released in source code under the GNU Lesser General Public License (LGPL) and is available online (http://www.tsi.telecom-paristech.fr/aao/en/software-and-database/). Other extraction software also exists. jAudio (McEnnis et al. 2005) is a java-based audio feature extractor library, whose results are written in XML format. Maaate is a C++ toolkit that has been developed to analyze audio in the compressed frequency domain, http://maaate.sourceforge.net/. FEAPI (Lerch et al. 2005) is a plugin API similar to VAMP. MPEG7 also provides Matlab and C code for feature extraction. http://yaafe.sourceforge.net/ DESAM Toolbox Description: The DESAM Toolbox, which draws its name from the collaborative project “Décomposition en Eléments Sonores et Applications Musicales” funded by the French ANR, is a set of Matlab functions dedicated to the estimation of widely used spectral models from, 85 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools potentially musical, audio signals. Although those models can be used in music information retrieval tasks, the core functions of the toolbox do not focus on any application. It is rather aimed at providing a range of state-of-the-art signal processing tools that decompose music files according to different signal models, giving rise to different “mid-level” representations. This toolbox is therefore aimed at the research community interested in the modeling of musical audio signals. The Matlab code is distributed under the GPL and is available online: http://www.tsi.telecom-paristech.fr/aao/en/software-and-database/ 3.5. Emotional Behaviours Computational models of emotional expressions are gaining a growing interest. The models of expressive behaviours are crucial for the believability of virtual characters (Aylett, 2004). Agents portraying emotional behaviours and different emotional strategies such as empathy are perceived as more trustworthy and friendly; users enjoy interacting with them more (Brave et al., 2005; Partala and Surakka, 2004; Prendinger et al., 2005, Ochs et al., 2008). Early computational models followed mainly the discrete emotion approaches that provide concrete predictions on several emotional expressions (Ruttkay et al, 2003). The idea of universality of the most common expressions of emotions was particularly sought after to enable the generation of “well recognizable” facial displays. However easy to categorize in terms of evoked emotions, the expressions based on discrete theory are still oversimplified. One method to enrich the emotional behaviour of a virtual character, while relying on discrete facial expressions, is to introduce blends. In the works of Bui (2004), Niewiadomski and Pelachaud (2007a, 2007b) and Mao et al. (2008), these blend expressions are modeled using fuzzy methods. Several models of emotional behaviour link separate facial actions with some emotional dimensions like valence. Interestingly most of them use the PAD model, which is a 3D model defining emotions in terms of pleasure (P), arousal (A) and dominance (D) (Mehrabian, 1980). Among others, Zhang et al. (2007) proposed an approach for the synthesis of facial expressions from PAD values. Another facial expression model based on the Russell and Mehrabian 3D model was proposed by Boukricha et al. (2009). A facial expressions control space is thus constructed with multivariate regressions, which enables the authors to associate a facial expression to each point in the space. A similar method was applied previously by Grammer and Oberzaucher (2006), whose work relies only on the two dimensions of pleasure and arousal. Their model can be used for the creation of facial expressions relying on the action units defined in the FACS (Ekman et al. 2002) and situated in the 2D space. Arya et al. (2009) propose a perceptually valid model for emotion blends. Fuzzy values in the 3D space are used to activate the agent's face. Recently, Stoiber et al. (2009) proposed an interface for the generation of facial expressions of a virtual character. The interface allows one to generate facial expressions of the character using the 2D custom control space. The underlying graphics model is based on the analysis of the deformation of a real human face. Some researchers were inspired by the Componential Process Model (CPM) (Scherer, 2001), which states that different cognitive evaluations of the environment lead to specific facial behaviours. Paleari and Lisetti (2006) and Malatesta et al. (2009) focus on the temporal relations between different facial actions predicted by the Sequential Evaluation Checks (SECs) of the CPM model. Lance and Marsella (2007, 2008) propose a model of gaze shifts towards an arbitrary target in emotional displays. The model presented by Niewiadomski et al. (2011) generates emotional expressions that may be composed of non-verbal behaviours displayed over different modalities, of a sequence of signals or of expressions within one modality that can change dynamically. Signal descriptions are gathered into two sets: the behaviour set and constraint set. Each emotional state has its own behaviour set, which contains signals that might be used by the virtual agent to display that emotion. 86 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Usefulness to the project Computational models of emotional expression provide virtual agents with a large repertoire of multimodal behaviours. Agents can express their emotional states. They can communicate their attitude to others agents or humans. They seem also much more lively and believable. The inclusion of emotional behaviour is a very important task within the REVERIE project. This task will develop a computational model to enable the virtual human or avatar to display a large range of emotional responses, as a reaction to external stimuli. The work will guide the autonomous agents towards a natural and credible pattern of contextual responses, such as a startle reflex, as well as emotion in context such as displaying sadness and joy in response to visual and aural prompts. Focus will typically be on multimodal emotional behaviour including facial expression, body movement and behaviour expressivity. Available tools from the literature and REVERIE partners EmotionML Description: Emotion Markup Language (EmotionML) 1.0, is a markup language designed to be usable in a broad variety of technological contexts while reflecting concepts from the affective sciences. EmotionML allows a technological component to represent and process data, and enables interoperability between different technological components processing the data. It provides a manual annotation of material involving emotionality, automatic recognition of emotions from sensors and generation of emotion-related system responses. The latter may involve reasoning about the emotional implications of events, emotional prosody in synthetic speech, facial expressions and gestures of embodied agents or robots, the choice of music and colours of lighting in a room, etc. http://www.w3.org/TR/emotionml/ SAIBA framework Description: Developed to ease the integration of autonomous agent technologies. Three main processes have been highlighted: Intent Planner, Behaviour Planner and Behaviour Realizer. Two representation languages have been designed to link these processes. FML (Heylen et al, 2008) represents high level information of what the agent aims to achieve: its intention, goals and plans. BML (Vilhjalmsson et al, 2007) describes non-verbal communicative behaviours at a symbolic level. http://wiki.mindmakers.org/projects:saiba:main/ EMA (Gratch & Marsella, 2004) Description: Modelling of the emotion regulation process and of the effects of emotions on the mental and affective state of the agent; a model of the adaptation process identifies the behaviour that a virtual agent should adopt to cope with high intensity emotions. There are different coping strategies. FatiMa (Dias et al, 2011) Description: This is an open-source generic model of emotion. It relies on appraisal theory. It can output discrete emotions from the OCC model as well as continuous ones as in the PAD representation. http://sourceforge.net/projects/fatima-modular/ ALMA (Gebhard, 2005) Description: This model adopts the discrete and continuous representation of emotions: it uses the 24 kinds of emotions from the OCC model. Every emotion has an associated value from the PAD representation. ALMA is freely available. 87 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools http://www.dfki.de/~gebhard/alma/index.html OSSE (Ochs et al, 2009) Description: This model embeds the discrete representation of emotions from the OCC model (joy, hope, disappointment, sadness, fear, relief, pride, admiration, shame and anger). It uses a continuous representation of their intensity. Events are described by triples <agent, action, patient>. Emotions triggered by this event are calculated for each agent. This computation depends on the values of these three elements and the preferences of the agents, OSSE is freely available. https://webia.lip6.fr/svn/OSSE 3.6. Virtual Worlds As for the real world, the bare essence of a virtual world consists of the availability of (1) a virtual landscape or terrain and (2) virtual characters or avatars. Evidently, these bare essences are hardly sufficient to create a plausible immersive virtual environment for obvious reasons that often find their equivalent in the real world. A short list of the most common denominators is provided below. Terrain: a complex scene may consist of hills, rocks, water, other Terrain editor: to allow the user to [graphically] design the terrain Avatar: usually created when a new user visits the virtual world for the 1st time Avatar editor: to customize clothes, skin, hair, facial and body characteristics, other Avatar animation: walking, flying, dancing, path finding, crowd control, other Sky: a skybox is often used to encapsulate the complete virtual world Sky [timeline] editor: to allow the user to [graphically] design the skybox Scene: the terrain is packed with objects like houses, structures, vegetation, other Scene/Object editor: to allow the user to [graphically] design or import new object Scripting: to automate actions or reactions, to animate objects Script editor: to allow the user to [graphically] design or import new animations Physics: basic physics engines will keep avatars from falling through the ground or from running through objects but many other dedicated engines exist like for example Water physics: to simulate the behaviour of water Particle physics: to rupture glass or vegetation, to simulate explosions, other Lighting: to cast shadows and reflections from one object or avatar to the next using spot lights, environment lights, other Transparency, mirroring, features may require special engines Sound: to support music, talking (man-to-machine) Video: to support in-world movie playback, multiscopic video, other Security: authentication, authorization, privacy, ownership Economy: currency, convertibility to real currency, fair trade, other Then there’s the financial and programming side of the virtual worlds that induce a couple of other factors to take into account. Shaders [editor]: for HW accelerated graphics using OpenGL, DirectX, other Networking: to communicate information between the different elements inside the world and in-between different worlds Licensing: commercial, free, GPL, LGPL, BSD, MIT, CCL, other Source code availability: open, closed Platform: Windows, Linux, iOS, Android, OSX, Solaris, Playstation, Xbox, Wii, other 88 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Programming language: C, C++, C#, Java, Flash, Delphi, Python, Ruby, JavaScript, other Interoperability: import and export capabilities to/from other worlds Architecture: client-server, peer-to-peer, cloud, other Deployment: scalability, distributability, reliability, extensibility, maintainability, other 2D: text, fonts, cursors, panels, menus, other Support: documentation, large and active development community, other A reasonably good but still basic overview of existing game engines is already available on this page: http://content.gpwiki.org/index.php/Game_Engines. The provided list of game engines does however not include a vast amount of other virtual worlds that are mostly already on-line. These kinds of virtual worlds either use one of the listed game engines above or provide their own proprietary or open source implementation. They are not targeting the fast interaction that is often of highest importance to gaming like in first person shooter games but is rather targeting the social interaction where people can meet up, talk, travel, dance, create, self-enhance, give presentations, make money, do business, earn a reputation, or do other things together. Evidently, the requirements for these kinds of virtual worlds are inherently different. Particle physics may for instance be a make-or-break requirement for a virtual world dedicated to gaming but may be completely irrelevant for a virtual world dedicated to socializing. The virtual worlds for socializing can arguably be further divided in Metaverses and Mirror Worlds. Well know Metaverses are for example The Sims, IMVU, Second Life, Blue Mars, Kaneva, HiPiHi, and Active Worlds. A more comprehensive short list of Metaverses is available on this page: http://arianeb.com/more3Dworlds.htm. Mirror Worlds on the other hand try to simulate the real world. Well known Mirror Worlds are for example Google [Maps] StreetView and Microsoft [Bing] StreetSide but there are also others less known like MapJack, EveryScape and Earthmine. A more comprehensive list of street views is available on this page: http://en.wikipedia.org/wiki/Competition_of_Google_Street_View. Usefulness to the project In line with the introduction on the referenced page for game engines, when picking a virtual world, attention should be paid to whether or not it satisfies the needs of the use case. It is therefore unreasonable to list all pros and cons of every virtual world but rather list the financial, licensing, deployment, technical and other requirements of the partners and the use cases. This list can then be used to start the search for the best suited virtual worlds. In the best case, a single world can be selected for all partners and both use cases. For example, the intelligent agent platform Greta of ParisTech uses OGRE and OpenGL. If the consortium is willing to deeply integrate this technology into a virtual world, the choice of engines is already significantly narrowed to e.g. Axiom, Diamonin, YAKE, OGE, RealmForge, and YVision. Available tools from the literature and REVERIE partners 89 Open Source Virtual World Server OpenSimulator Description: OpenSimulator is an open source multi-platform, multi-user 3D application server. It can be used to create a virtual environment (or world) which can be accessed through a variety of clients, on multiple protocols. OpenSimulator allows virtual world developers to customize their worlds using the technologies they feel work best - we've designed the framework to be easily extensible. OpenSimulator is written in C#, running both on Windows over the .NET Framework and on Unix-like machines over the Mono framework. The source code is released under a BSD License, a commercially friendly license to embed OpenSimulator in products. OpenSimulator is the open source version of Linden Labs proprietary Second Life. OpenSimulator was used as virtual world environment in the ITEA2 Metaverse1 project to implement the PresenceScape conceptual demonstrator. FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools http://www.opensimulator.org Dependencies on other technologies: .NET (Windows) or Mono (Linux) Framework Linden Labs Virtual World Server Second Life Description: Second Life ® is a single global 3D virtual world created by its residents (people like you) that’s bursting with entertainment, experiences, and opportunity. Second Life offers a uniquely immersive experience where you can create, buy, and sell anything you can imagine; socialize with people across the world; and enjoy live events and gaming activities. In-world virtual object creation tools support community creation of animated (scripted) virtual goods and designer artifacts while in-world avatar creation tools support creation of basic personalized avatars. Second Life also supports virtual currency for in-world commercial activities. A number of external tools can be used to create basic avatar animations that can be imported into the world. Second Life was used as virtual world environment in the ITEA2 Metaverse1 project to implement the Mixed Reality conceptual demonstrator. http://www.secondlife.com Open Source Virtual World Client Hippo Description: The Hippo OpenSimulator client is a modified Second Life client, targeted at OpenSimulator users. The client is written in C++, running on Linux. It allows its users to navigate through and interact with objects and avatars in virtual environments via a Graphical User Interface (GUI). The Hippo OpenSimulator Viewer works seamlessly together with Linden Labs Virtual World Second Life. Hippo was used in the ITEA2 Metaverse1 project as a GUI and to implement stereoscopic (3D) video streaming in the Mixed Reality conceptual demonstrator. http://mjm-labs.com/viewer http://sourceforge.net/projects/opensim-viewer Dependencies on other technologies: Linux OS (or Windows OS using Cygwin) Open Source Virtual World client Metabolt Description: The Metabolt client allows its user to navigate through and interact with objects and avatars in virtual environments via a command line interface. The Metabolt Client works seamlessly together with OpenSimulator and Linden Labs Virtual World Second Life. Metabolt was used in the ITEA2 Metaverse1 project to implement the autonomous agents in the PresenceScape conceptual demonstrator. http://www.metabolt.net Linden Labs Virtual World Client Second Life Description: The Linden Labs Virtual World Client Second Life allows its users to navigate through and interact with objects and avatars in virtual environments via a GUI. The client works seamlessly together with OpenSimulator and Linden Labs Virtual World Second Life. The Second Life client was used in the ITEA2 Metaverse1 project as a GUI for the PresenceScape conceptual demonstrator for virtual camera control. http://www.secondlife.com 3.7. User-system interaction User-system interaction is a topic that is much older than the computer itself. Indeed, depending on the meaning of ‘system’, we can say that an analogue watch also has a user-system interface, more specifically the wristband of the watch, the look-and-feel of the watch, the hands of the watch revealing the current time and the dial buttons to mechanically charge the internal spring or change the hand positions. We can easily go further back in time to the Belgian “Pot van Olen” of the 16 th 90 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools century, the crossbow or catapult of the middle ages or the first Oldowan chopping tool of the Lower Palaeolithic period. For obvious reasons however, we will narrow the scope of the usersystem interaction to computer or computer-related systems. Within this narrowed scope, many books have been written that can arguably be categorized into two main areas. The first one is more user-system related where we can look at the plurality of options with which the user can interact with the system. These options can vary from the usual suspects like keyboard or mouse to the most sophisticated ones like direct brain-computer interfaces. The second area is more interaction related where a given user-system can be applied in different ways to interact with the system. The various Windows interface designs have for example evolved from the simple non-user-friendly design of Windows 1.0 back in 1981 to the more userfriendly Metro design of Windows 8. Since it is at present not clear which interaction related area or areas are of interest to the project, we will further narrow the scope of the user-system interaction to the user-system related area. The availability of a plurality of devices and related software that can be used by a user to interact with a system has grown significantly since the introduction of the screen as an output device, and the keyboard and mouse as input means. Apple’s iPhone for instance kept the screen as output device but replaced the keyboard and mouse input with just a few buttons, a camera, a microphone, a touch screen, motion sensors, and a GPS, all of which can be used by the user to interact with the system. The output means of the iPhone have been further complemented with a speaker and a vibration motor. Experiments have been done with mice carrying a small device directly connected to the mouse’s brain on the one side and wirelessly connected to a water tap on the other side. Eventually, the mice learned to open the tap with their thoughts only. In the following part, a limited list of examples of available types of user-system devices is given. Arguably the first wave of input devices for the general public consisted of electrical contactbased devices like the keyboard, the mouse, the trackball, the joystick, the gamepad, and the steering wheel. The second wave could be determined as consisting of more sophisticated devices like GPS, RFIDs, touch pads, touch screens, motion sensors, the user’s voice derived from the microphone and the user’s pose and gestures derived from the camera. Devices like the iPhone, Wii and Kinect clearly belong to this category as well as game cards and access badges based on RFID or similar technologies. Probably in the same category is face or face features detection, a field of technology that is already incorporated in many domestic photo or film cameras today. Variations to these devices are, e.g. projections of input capabilities on ordinary objects where after video analysis are used to determine the intent of the user. Probably belonging in the same category are the eye trackers in combination with face trackers that are able to deduce where exactly the user is looking at. Devices in the third wave are using even more sophisticated technologies that are sometimes already in use in specialized areas like the army or in healthcare and these are slowly finding their way to the general public. Examples of such devices are [wearable] brain caps using contactless electroencephalogram measurements for the gaming industry, attention trackers that can be used in education, direct nerve interfaces for control over prostatic limbs, tongue interface for people with severe motor disabilities, and all kinds of bio and chemical sensors for lie detectors. Gradual technological advances in hardware or software technologies used by these user-system interaction devices as well as the combination of two or more such devices constantly allow us to improve the control of the user over the system. For instance, advanced facial analysis optionally combined with more advanced voice analysis allows us to better understand the emotion, intent or interest of the interacter. 91 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 3.7.1. User-System Interaction for Virtual Characters Virtual characters are distinct by whether they are intended to be controlled by a human user (often referred to as an avatar), or being autonomous. Avatars can be controlled through invoking simple discrete sets of commands, like “walk forward” or “jump”, through any interface device (such as a keyboard or a mouse). Such control is typical in any modern video game. More sophisticated forms of input are made possible by introducing more diverse input devices, such as analog joysticks, microphones, cameras and other tracking devices. In the latter cases, the avatar is obliged to modify the rig according to the observed movement. Autonomous characters on the other hand rely on sophisticated AI techniques in order to assume control over their behaviour. Nevertheless, autonomous virtual humans (or agents) are usually required to interact with users in a natural and believable way. Communication is handled by a number of uni-modal or multimodal channels, specific to the input device used to communicate signals. Since Audio-based HCI was addressed in Section 3.4, the remainder of this Section will focus on Visual and Sensor based HCI techniques, as well as multimodal approaches, which fuse elements of the above methods for better results. Visual-based HCI is probably the most wide-spread area in the area of man-machine interaction, in which researchers have tried to address the different aspects of human response that can be visually recognized as signals. Such research topics include Facial Expression Analysis (de la Torre and Cohn, 2011), which aims at recognizing emotion information through facial expression display, Gesture recognition (Just and Marcel, 2009; Kirishima et al. 2005) has also provided auxiliary information about the user's emotional state, as well as complementing Body Movement Tracking (Gavrila, 1999; Aggarwal and Cai, 1999) for direct interaction in the context of control, described in the previous paragraph. Gaze tracking and estimation (Sibert and Jacob, 2000) is another indirect form of interaction, suited for recognizing the user’s focus of attention, as well as providing low level direct input (by using an eye-controlled mouse pointer, for example). A third modality for HCI is provided via physical sensors to communicate data between user and machine. Such hardware devices include many sophisticated technologies, such as motion tracking, haptic, pressure and taste/smell sensors. Motion tracking sensors usually consist of wearable clothes and joint sensors which allow computers to track human skeleton joint movements and reproduce the effect on virtual characters. Haptic and pressure sensors are more common in the robotics and virtual reality areas (Robles-De-La-Torre, 2006; Hayward et al. 2004; Iwata, 2003), in which machines are being made aware of contact. Smell and taste sensors also exist, although their applicability has been limited (Legin et al. 2005). Sensors may also however concern simpler and more common devices such as pen-based sensors, keyboard/mouse devices and joysticks. Penbased sensors are of specific interest to mobile devices and are more commonly related to handwriting and pen gesture recognition (Oviatt et al. 2000), while keyboards, mice and joysticks have been around for decades. Multi-modal HCI systems refer to the combination of the aforementioned uni-modal user inputs in order for one modality to assist the other in its shortcomings. Known multi-modal methods used in the literature fuse the visual and audio channels to improve recognition rates. For example, lip movement tracking has been shown to assist speech recognition, which in turn has been shown to assist command acquisition in gesture recognition. Applications of these multi-modal systems include smart video conferencing, intelligent homes, driver monitoring, intelligent games, ecommerce and aiding tools for disabled people. Usefulness to the project Understanding the user’s emotion, intent or interest is crucial for the development of an interactive cognitive automated system. In an e-learning setting, the user’s level of attention is most probably a valuable characteristic that may be used to maximize the focus of the user on the subject at hand. For autonomous agents to respond in a cognitive way to user interaction through voice, 92 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools gesture, facial feedback or other means, the user’s intent is evidently a crucial parameter. Not only the intent of a first user may be valuable input but rather the combined input of all users in a certain environment is probably crucial for the autonomous agent in order to be able to understand and adequately respond to group interactions. Group interaction may also be important for an intelligent camera system charged with real-time directing of the recordings of the group. Social translucency may also be improved by incorporating the obtained user information somehow in the rendering chain. Last but not least, the obtained information can be combined with a variety of stored and real-time contextual information. Simple examples of the above could be (1) the handshake of the autonomous agent with the user’s avatar on first encounter and (2) the [mutual] acknowledgment of a second user’s avatar approaching the autonomous agent and the first user’s avatar. REVERIE should provide multiple means of control interaction, with respect to user hardware setup. Avatars should be controllable through minimal input devices, such as the keyboard or mouse, as well as tracking devices such as Microsoft's Kinect sensor. Furthermore, natural social interaction between users and autonomous agents should be supported for increased realism and believability. In the list below, a list of useful user tracking tools and devices is provided. Available tools from the literature and REVERIE partners Watson Description: Watson is a real-time visual feedback recognition library for interactive interfaces that can recognize head gaze, head gestures, eye gaze and eye gestures using the images of a monocular or stereo camera. http://vhtoolkit.ict.usc.edu/index.php/Watson Open Source Natural Interaction Framework OpenNI Description: The OpenNI organization is an industry-led, not-for-profit organization formed to certify and promote the compatibility and interoperability of Natural Interaction (NI) devices, applications and middleware. As a first step towards this goal, the organization has made available an open source framework, the OpenNI framework, which provides an Application Programming Interface (API) for writing applications utilizing NI. This API covers communication with both low level devices (e.g. vision and audio sensors), as well as high-level middleware solutions (e.g. for visual tracking using computer vision). http://www.openni.org Dependencies on other technologies: PrimeSense NiTE middleware Open Source NI Middleware PrimeSense NiTE Description: The NI Middleware from PrimeSense is a lot like your brain; what yours does for you, NiTE does for computers and digital devices. It allows them to perceive the world in 3D so they can comprehend, translate and respond to your movements, without any wearable equipment or controls. Including computer vision algorithms, NITE identifies users and tracks their movements, and provides the framework API for implementing NI UI controls based on gestures. Hand Control allows you to control digital devices with your bare hands and as long as you're in control, NITE intelligently ignores what others are doing. Full Body Control lets you have a totally immersive, full-body video game experience; the kind that gets you moving. Being social, NITE middleware supports multiple users, and is designed for all types of action. http://www.primesense.com/Nite Dependencies on other technologies: PrimeSensor Module, Asus Xtion Firmware Seeing Machines Face Tracker Product 93 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Description: faceLAB provides full face and eye tracking capabilities. Its Automatic Initialization feature provides one-click subject calibration, generating data on (1) Eye movement; (2) Head position and rotation; (3) Eyelid aperture; (4) Lip and Eyebrow movement, and (5) Pupil size. faceAPI provides a suite of image-processing modules created specifically for tracking and understanding faces and facial features. These tracking modules are combined into a complete API toolkit that delivers a rich stream of information that can be incorporated into products or services. Seeing Machines faceAPI provides a comprehensive, integrated solution for developing products that leverage real-time face tracking. All image-processing for face tracking is handled internally, removing the need for any computer vision experience. http://www.seeingmachines.com iMotions Attention Tool Eye Tracker Software Description: iMotions Attention Tool is a robust eye tracking software platform for scientific and market research. iMotions technology is proven and has several patents pending on emotion measurements and reading pattern recognition. It allows merging eye tracking data from a diversity of models of eye trackers from Tobii, EyeTech and SensoMotoric Instruments. http://www.imotionsglobal.com Tobii Technology Eye Tracker Products Description: Tobii is the world’s leading vendor of eye tracking and eye control: a technology that makes it possible for computers to know exactly where users are looking. The Tobii eye trackers estimate the point of gaze with extreme accuracy using image sensor technology that finds the user’s eyes and calculates the point of gaze with mathematical algorithms. Tobii has a wide range of eye trackers (X1 Light, T60, T120, X60, and X120) for an equally wide range of applications (assistive technologies, human research, marketing, gaming, other). http://www.tobii.com SensoMotoric Instruments (SMI)’s Eye Tracker Products Description: SensoMotoric Instruments (SMI) is a world leader in dedicated computer vision applications, developing and marketing eye & gaze tracking systems and OEM solutions for a wide range of applications. Founded in 1991 as a spin-off from academic research, SMI was the first company to offer a commercial, vision-based 3D eye tracking solution. SMI products combine a maximum of performance and usability with the highest possible quality, resulting in high-value solutions for their customers. Their major fields of expertise are (1) Eye & gaze tracking systems in research and industry, (2) High speed image processing, and (3) Eye tracking and registration solutions in ophthalmology. SMI has a wide range of eye trackers (RED, RED250, RED500, IVIEW X, other) for an equally wide range of applications (assistive technologies, human research, marketing, gaming, other). http://www.smivision.com EyeTech Digital Systems Eye Tracker Products Description: EyeTech Digital Systems designs and develops eye tracking hardware and software since 1996. They provide both off-the-shelf eye tracking systems and a host of customized solutions. Their Quick Glance software enables cursor control using eye tracking and is 32/64bit compatible. It includes software for direct eye-tracking gaze data and third-party benefits such as heat maps, areas of interest, 3D heat maps, landscapes, and focus maps on static content, along with gaze plots on video content. http://www.eyetechds.com MiraMetrix Eye Tracker Product 94 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Description: The MiraMetrix S2 Eye Tracker is an easy to calibrate, tripod mounted, portable eye tracker that comes with a software API and viewer application. The API uses standard TCP/IP for communication and provides XML data output. The viewer application collects eye gaze and other data in real time and provides important analysis tools. It records a video of onscreen activity as subjects are interacting and having their eye movements tracked. A second monitor can be used to show what people are doing on-the-fly. http://mirametrix.com Alea Technologies Eye Tracker Product Description: The IntelliGaze IG-30 system is a commercial European eye tracking system that has been designed from the ground up with an Augmentative and Alternative Communication (AAC) application in mind. The open software architecture of the IntelliGaze system allows easy integration with most specialized communication packages as well as standard Windows applications. http://www.alea-technologies.de Open Source openEyes Eye Tracker Software Description: openEyes is an open-source open-hardware toolkit for low-cost real-time eye tracking. The openEyes toolkit includes algorithms to measure eye movements from digital videos, techniques to calibrate the eye-tracking systems, and example software to facilitate real-time eye-tracking application development. They make use of the Starburst algorithm, which is Matlab software that can be used to measure the user's point of gaze in video recorded from eye trackers that use dark-pupil IR or visible spectrum illumination. They provide the cvEyeTracker which is a real-time eye-tracking application using the Starburst algorithm written in C for use with inexpensive, off-the-shelf hardware. http://thirtysixthspan.com/openEyes Open Source Opengazer Eye Tracker Software Description: Opengazer is an open-source gaze tracker for ordinary webcams that estimates the direction of a user’s gaze. This information can then be passed to other applications. For example, used in conjunction with Dasher, Opengazer allows you to write with your eyes. Opengazer aims to be a low-cost software alternative to commercial hardware-based eye trackers. The latest version of Opengazer is very sensitive to head-motion variations. To rectify this problem the open source community is currently focusing on head tracking algorithms to correct head pose variations before inferring the gaze positions. A subproject of Opengazer involves the automatic detection of facial gestures to drive a switch-based program. Three gestures have been trained to generate three possible switch events: a left smile, right smile, and upwards eyebrow movement. All the software is written in C++ and Python. The Opengazer project is supported by Samsung and the Gatsby Foundation and by the European Commission in the context of the AEGIS project (Accessibility Everywhere: Groundwork, Infrastructure, Standards). http://www.inference.phy.cam.ac.uk/opengazer Open Source TrackEye Eye Tracker Software Description: TrackEye is a real-time tracking application of human eyes for Human Computer Interaction (HCI) using a webcam. The application features the following capabilities: (1) realtime face tracking with scale and rotation invariance, (2) tracking the eye areas individually, (3) tracking eye features, (4) eye gaze direction finding, and (5) remote controlling using eye movements. www.codeproject.com/KB/cpp/TrackEye.aspx Dependencies on other technologies: OpenCV Library v3.1 95 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Freer Logic (and Unique Logic and Technology) Attention Tracker Products Description: Freer Logic’s patent pending technology BodyWave, in the form of a sports armband, reads and reacts to brainwaves through the extremities of the body. BodyWave reads brain activity through the human body via a uniquely innovative arm band that houses brainwave sensors that attach to the arm or wrist. BodyWave monitors the brains physiological signals through the body. Dry sensors acquire brain signals and transfer them wirelessly via Bluetooth or WiFi to a mobile device or PC. When BodyWave is used with Freer Logic’s 3D computer simulations, it can teach stress control, increase attention, and facilitate peak mental performance. Freer Logic partners with Unique Logic and Technology provide a multitude of Play Attention applications that can be used for feedback technology, attention training, memory training, cognitive skill training, social skills training, motor skills training, behaviour shaping, and more. http://www.freerlogic.com, http://www.playattention.com 4. Related Tools to WP7: Composition and visualisation The aim of WP7 is to render avatars (use case 1) as well as visually highly realistic 3D representations of humans (use case 2) into a common virtual room. The latter case poses a major challenge. Besides refining already known methods, e.g. image based (single/multi-view with/without depth maps) and geometry based (polygon meshes with/without textures), a major part of this work package is to investigate new approaches. HHI intends to research a hybrid representation of humans for telecommunication scenarios (use case 2 in REVERIE). Image based rendering techniques will be combined with 3D geometry processing thereby mixing the benefits of both approaches. The goal is to achieve high quality human representations at low bit rates and low computational complexity. Such techniques allow real-time video conferencing in 3D with stereoscopic replay, different viewpoints etc. 4.1. Rendering of human characters Rendering of human characters started in the 1970's with the advent of video games. At that time, only very abstract representations, which hardly resembled their real counterparts, were used. After the first decade the 2D figures in computer games became more human like with simple characteristic features like facial features, clothes and recognizable extremities. Another decade made the transition from 2D to 3D and increased further the graphics resolution, which made facial expressions readable. With the new millennium the richness in details increased further to a degree that realistically appearing renderings were achieved, when single frames are inspected. Today's renderings of human beings are quite close to photo realistic appearance. The major challenge nowadays lies in the Figure 46: Depicts (from top to reproduction of biodynamics to achieve natural movement as well bottom): 1978's Basketball by Atari, 1985's Super Mario as psychological plausible behaviour. Although modern animations Brothers by Nintendo, 1996's Quake by id Software, 2004's 96 Half Life 2 by Valve Software. FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools of human characters can be quite close to photo realism, there is the effect called "uncanny valley", which states that human replicas looking and acting almost, but not perfectly, like human beings, causes a response of revulsion among human observers. Figure 47: Depicts the game L.A. Noire, which features depth analysis's newly developed technology for the film and video game industries called MotionScan that utilizes 32 cameras to record an actor's every wince, swallow, and blink which is then transferred to in-game animation. Inspired by the realistic rendering capabilities of computer animated human characters in video games, a trend is to also use such techniques in the film industry. The first photorealistic computer animated feature film was Final Fantasy: The Spirits Within, released in July 2001. On one hand, the actors are perfectly designed and look very natural and realistic. On the other, the movements and dynamics are clearly recognizable as computer generated. One of the latest films featuring the most advanced human rendering techniques is The Adventures of Tintin, which was released in October 2011. Figure 48: Final Fantasy: The Spirits Within (left) and The Adventures of Tintin (right). From a scientific point of view, realistic rendering of human characters is no longer a pure graphics rendering problem, as today's graphics cards and libraries are capable of rendering very sophisticated 3D models. The realistic appearance depends on the model representing the human character. Since models of realistically appearing human characters are highly complex to achieve, it is no longer common to model humans by hand, but instead to generate them out of 3D scans of real persons. For that reason, 3D reconstruction (Section 1.5), Motion Capture (Section 1.2) and Performance Capture (Section 1.3) are sufficient prerequisites to model and animate realistically appearing models of human characters. For the purpose of photo-realistic rendering markerless methods are preferable, because the captured texture can be used directly for rendering (which might be not needed in every case, e.g. when retargeting the motion commands). That is why we focus in this section on some few works, which are particularly interesting from the rendering perspective. Due to the enormous complexity of the human body, human motion, and human biomechanics, realistic simulation of human beings remains largely an open problem in the scientific community. It is one of the "holy grails" of computer animation research. For the moment it looks like 3D computer animation can be divided into two main directions: non-photorealistic and 97 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools photorealistic rendering, which can be further subdivided into o real and o stylized photorealism. In order to achieve realistic rendering of human characters the following aspects have to be addressed, besides rendering of a suitable geometry and texture: Light: reflection (direct and subsurface), refraction and shadows, e.g. d’Eon and Irving (2011) and Alexander et al. (2010). Surface structure: hair (e.g. Koster et al. (2004)), wrinkles, pimples, etc. Dynamic: changes of color and shape because of movement. Naturalness: e.g. breath, pupils, eyelid movement. Psychology: plausible behaviour For simple humanoid models it might be sufficient to design the model directly out of colored geometric primitives. But, in order to create realistic models, it is common to base the work on 3D scans of humans. In his PhD thesis (Anguelov, 2005) the author presents a (now well known) chain of algorithms (also published as separate conference papers) to: Recover an articulated skeleton model as well as a non-rigid deformation model from 3D range scans of a person in different poses (Anguelov et al. 2004a; Anguelov et al. 2004b) in order to interpolate between these poses. Calculate a direct surface deformation model to adapt surface regions of a mean 3D template model depending on neighboring joint states, and additionally a PCA-based model to inter- and extrapolate between scans of differently shaped people in order to reflect the variability among human shapes (gender, height, weight ...) (Anguelov et al. 2005) Figure 49: Shapes of different people in different poses, synthesized from the SCAPE (Shape Completion and Animation of PEople) framework (Anguelov, 2005) In (Hasler et al. 2009) the authors present a unified model that describes both human pose and body shape, which allows the accurate modeling of muscle deformations not only as a function of pose but also dependent on the physiques of the subject. Coupled with the models ability to generate arbitrary human body shapes, it severely simplifies the generation of highly realistic character animations. A learning based approach is trained on 550 full body 3D laser scans taken of 114 different subjects15. Scan registration is performed using a non-rigid deformation technique. Then, a rotation invariant encoding of the acquired exemplars permits the computation of a statistical model that simultaneously encodes pose and body shape. Finally, morphing or generating meshes according to several constraints simultaneously can be achieved by training semantically meaningful regressors. 15 Source code as well as data of 3D scans, registrations and the extracted PCA model are available at http://www.mpi-inf.mpg.de/resources/scandb/ 98 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Figure 50: The registration pipeline of (Hasler et al. 2009) (from left to right): The template model, the result after pose fitting, the result after non-rigid registration, transfer of captured surface details, and the original scan annotated with manually selected landmarks are shown. Figure 51: Derived from a dataset of prototypical 3D scans of faces, the morphable face model contributes to two main steps in face manipulation: (1) deriving a 3D face model from a novel image, and (2) modifying shape and texture in a natural way. Figure 52: Matching a morphable model to a single image (1) of a face results in a 3D shape (2) and a texture map estimate. The texture estimate can be improved by additional texture extraction (4). The 3D model is rendered back into the image after changing facial attributes, such as gaining (3) and losing weight (5), frowning (6), or being forced to smile (7). 99 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Since the human visual system is highly optimized to read and interpret facial expressions of people, lots of research is focused on realistic face expression and head pose rendering. A very successful approach is presented in (Blanz and Vetter, 1999). Starting from an example set of 3D face models, a morphable face model is derived by transforming the shape and texture of the examples into a vector space representation. New faces and expressions are modeled by forming linear combinations of the prototypes. Shape and texture constraints derived from the statistics of example faces are used to guide manual modeling or automated matching algorithms. This approach allows 3D face reconstructions from single images and their applications for photorealistic image manipulations. The authors demonstrate face manipulations according to complex parameters such as gender, fullness of a face or its distinctiveness (illustrated in Figures 56 and 57). A different approach is to mix geometry-based and image-based approaches. An augmented reality example worth mentioning is (Eisert and Hilsmann, 2011). The authors present a virtualmirror system for virtual try-on scenarios. A highly realistic visualization is realized by modifying only the relevant regions of the camera images while preserving the rest. The selected piece of cloth is tracked in consecutive frames and exchanged in the output images with a virtual piece of cloth which is adapted to the detected current lighting and deformation conditions. Figure 53: Hierarchical image-based cloth representation (left) and sample input camera images (upper row right), retextured results with modified color and logo (lower row right). In the case of free-view-point video, hybrid representations (image- and geometry-based) have been proven successful, providing user selected views onto an actor with highly realistic visualizations. In (Carranza et al. 2003) such a method is presented. Here, the actor’s silhouettes are extracted from synchronized video frames via background segmentation and then used to determine a sequence of poses for a 3D human body model. By employing multi-view texturing during rendering, time-dependent changes in the body surface are reproduced in high detail. The motion capture subsystem runs offline, is non-intrusive, yields robust motion parameter estimates, and can cope with a broad range of motion. The rendering subsystem runs at real-time frame rates using ubiquitous graphics hardware, yielding a highly naturalistic impression of the actor. The actor can be placed in virtual environments to create composite dynamic scenes. Free-viewpoint video allows the creation of camera fly-throughs or viewing the action interactively from arbitrary perspectives. Figure 54: Novel views onto an actor, generated with (Carranza et al. 2003) from multi-view footage via image-based texturing. 100 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Also noteworthy are the two scientific publications mentioned below in the tools-and-literature section, (Xu et al., 2011) and (Hilsmann and Eisert, 2012). Both approaches have in common, that they support the synthesis of motion sequences of an actor after having captured a multi-view multi-pose database of him/her. This approach constitutes the basis of HHIs hybrid rendering research for REVERIE. Usefulness to the project Rendering of human characters is obviously a main interest of the REVERIE project, because its goal is the development of a communication platform where its participants are projected into a common virtual room. For use case 1, simply rendered human representations are sufficient. The goal of use case 2 is to provide photo realistic 3D representations of the users (limited to three for the project). Available tools from the literature and REVERIE partners Video-based Characters - Creating New Human Performances from a Multi-view Video Database (Xu et al., 2011) Description: This method synthesizes plausible video sequences of humans according to userdefined body motions and viewpoints. A small database of multi-view video sequences is captured of an actor performing various basic motions. This database needs to be captured only once and serves as the input to the synthesis algorithm. A marker-less model-based performance capture approach is applied to the entire database to obtain pose and geometry of the actor in each database frame. To create novel video sequences of the actor from the database, a user animates a 3D human skeleton with novel motion and viewpoints. The technique then synthesizes a realistic video sequence of the actor performing the specified motion based only on the initial database. The first key component of this approach is a new efficient retrieval strategy to find appropriate spatio-temporally coherent database frames from which to synthesize target video frames. The second key component is a warping-based texture synthesis approach that uses the retrieved most-similar database frames to synthesize spatio-temporally coherent target video frames. For instance, this enables us to easily create video sequences of actors performing dangerous stunts without them being placed in harms way. It is shown through a variety of videos and a user study that realistic videos of people can be synthesized, even if the target motions and camera views are different from the database content. Figure 55: Animation of actor created from a multi-view video database. The motion was designed by an animator and the camera was tracked from the background with a commercial camera tracker. 101 Image-based Animation of Clothes (Hilsmann and Eisert, 2012) Description: A pose-dependent image-based rendering approach for the visualization of clothes with very high rendering quality. The image-based representation combines body-posedependent geometry and appearance. The geometric model accounts for low-resolution shape adaptation, e.g. animation and/or view interpolation, while small details (e.g. fine wrinkles), as FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools well as complex shading/reflection properties are accounted for through numerous images captured in an offline process. The images contain information on shading, texture distortion and silhouette at fine wrinkles. The image-based representations are estimated in advance from real samples of clothes captured in an offline process, thus shifting computational complexity into the training phase. For rendering, pose dependent geometry and appearance are interpolated and merged from the stored representations. Figure 56: Details of an arm bending sequence interpolating between the left and right most images. The second and third images are synthetically generated in-between poses. Note how the wrinkling behaviour is perceptually correct. Analyzing Facial Expressions for Virtual Conferencing (Eisert and Girod, 1998) Description: A method for the estimation of 3D motion from 2D image sequences showing head and shoulder scenes typical for video telephone and teleconferencing applications. A 3D model specifies the color and shape of the person in the video. Additionally, the model constrains the motion and deformation in the face to a set of facial expressions which are represented by the facial animation parameters defined by the MPEG-4 standard. Using this model, a description of both global and local 3D head motion as a function of the unknown facial parameters is obtained. Combining the 3D information with the optical flow constraint leads to a robust and linear algorithm that estimates the facial animation parameters from two successive frames with low computational complexity. To overcome the restriction of small object motion, which is common to optical flow based approaches, a multi-resolution framework is used. Experimental results on synthetic and real data confirm the applicability of the presented technique and show that image sequences of head and shoulder scenes can be encoded at bit-rates below 0.6 kbit/s. Figure 57: Left: original frames; 2nd column: wireframe of the animated head model; 3rd column: textured model rendered with estimated expression parameters; Right: facial expression applied to a different model. 102 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Geometry-Assisted Image-based Rendering for Facial Analysis and Synthesis (Eisert and Rurainsky, 2006) Description: An image-based method for the tracking and rendering of faces. The algorithm is used in an immersive video conferencing system where multiple participants are placed in a common virtual room. This requires viewpoint modification of dynamic objects. Since hair and uncovered areas are difficult to model by pure 3D geometry-based warping, image-based rendering techniques are added to the system. By interpolating novel views from a 3D image volume, natural looking results can be achieved. The image-based component is embedded into a geometry-based approach in order to limit the number of images that have to be stored initially for interpolation. Also, temporally changing facial features are warped using the approximate geometry information. Both geometry and image cube data are jointly exploited in facial expression analysis and synthesis. Figure 58: Different head positions - all generated from a monocular video sequence. Hair is correctly reproduced if the head is turned. 4.2. Scene recomposition with source separation The aim of this task is to provide 3D audio rendering and compositing tools. The main goal of this task will be to enable the simulation of real or enhanced virtual acoustic environments for multiple sound sources and possibly multiple user locations in the virtual scene. For this task, two main relevant technologies/components are of primary interest and include 1) the scene recomposition or remixing from existing real acoustic scenes, which would imply the 3D audio rendering of imperfect sources obtained from source separation; 2) Flexible 3D audio rendering components of natural sources. A specific challenge is to bridge the gap between complex but high quality acoustic room simulation methods based on the use of pre‐stored impulse responses of pre‐defined rooms/halls and the less accurate but efficient statistical or parametric room simulation methods. An extensive literature exists in source separation for scene recomposition. The following four studies represent state of the art works in this domain and research within audio rendering and composition in REVERIE will build on these prior works: Figure 59: Beat spectrogram 103 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools The separation of the lead vocals from the background accompaniment in audio recordings is a challenging task. Recently, an efficient method called REPET (REpeating Pattern Extraction Technique) has been proposed to extract the repeating background from the non-repeating foreground (Liutkus et al., 2012). While effective on individual sections of an audio document, REPET does not allow for variations in the background (e.g. verse vs. chorus), and is thus limited to short excerpts only. This limitation was overcome and REPET was generalized to permit the processing of complete musical tracks. The proposed algorithm tracks the period of the repeating structure and computes local estimates of the background pattern. It uses for that the beat spectrogram, a 2D representation of the sound that reveals the rhythmic variations over time. Separation is performed by soft time-frequency masking, based on the deviation between the current observation and the estimated background pattern. Evaluation on a dataset of 14 complete tracks shows that this method can perform at least as well as a recent competitive music/voice separation method, while being computationally efficient. A two-stage blind source separation algorithm for robot audition was developed by Maazaoui et al. (2012). The first stage consists of fixed beamforming preprocessing to reduce the reverberation and the environmental noise. The manifold of the sensor array was difficult to model due to the presence of the head of the robot, so pre-measured Head Related Transfer Functions (HRTFs) to estimate the beamforming filters were used. The use of the HRTF to estimate the beamformers allows capturing the effect of the head on the manifold of the microphone array. The second stage is a blind source separation algorithm based on a sparsity criterion which is the minimization of the l1 norm of the sources. Different configurations of the algorithm are presented and promising results are shown, i.e. the fixed beamforming preprocessing improves the separation results. When designing an audio processing system, the target tasks often influence the choice of a data representation or transformation. Low-level time-frequency representations such as the Short-Time Fourier transform (STFT) are popular, because they offer a meaningful insight on sound properties for a low computational cost. Conversely, when higher level semantics, such as pitch, timbre or phoneme, are sought after, representations usually tend to enhance their discriminative characteristics, at the expense of their invertibility. They become so-called mid-level representations. In (Durrieu et al. 2011), a source/filter signal model which provides a mid-level representation is proposed. This representation makes the pitch content of the signal as well as some timbre information available, hence keeping as much information from the raw data as possible. This model is successfully used within a main melody extraction system and a lead instrument/accompaniment separation system. Both frame-works obtained top results at several international evaluation campaigns. Ozerov et al. (2011) considered the Informed Source Separation (ISS) problem where, given the sources and the mixtures, any kind of side-information can be computed during a so-called encoding stage. This side-information is then used to assist source separation, given the mixtures 104 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools only, at the so-called decoding stage. State of the art ISS approaches do not really consider ISS as a coding problem and rely on some purely source separation-inspired strategies, leading to performances that can at best reach those of oracle estimators. On the other hand, classical multichannel source coding strategies are not optimal either, since they do not benefit from the mixture availability. We introduce a general probabilistic framework called Coding-Based ISS (CISS) that consists of quantizing the sources using some posterior source distribution from those usually used in probabilistic model-based source separation. CISS benefits from both source coding, due to the source quantization, and source separation, due to the use of the posterior distribution that depends on the mixture. Our experiments show that CISS based on a particular model considerably outperforms both the conventional ISS approach and the source coding approach based on the same model. Usefulness to the project Simulation of real or enhanced virtual acoustic environments constitutes an important objective of the project. Compatibility with the characteristics of the room acoustics plays a fundamental role in ensuring a good interaction between users and the REVERIE virtual world and hence a plausible virtual immersion. Scene recomposition is one of the two main approaches dedicated to this task. The following section draws up an overview of the second approach, namely 3D audio rendering. Available tools from the literature and REVERIE partners The tools from the literature presented in the next section, developed by IT/TPT, are all available for the project. 4.3. 3D audio rendering of natural sources There is an extensive literature in existence in this area, and a subset of particularly interesting studies has been selected and included below both for statistically and physically based approaches. The plane wave decomposition is an efficient analysis tool for multidimensional fields, particularly well fitted to the description of sound fields, whether these ones are continuous or discrete, obtained by a microphone array. A beamforming algorithm to estimate the plane wave decomposition of the initial sound field was developed by Guillaume and Grenier (2007). The algorithm aims to derive a spatial filter which preserves only the sound field component coming from a single direction and rejects the others. The originality of the approach is that the criterion uses a continuous instead of a discrete set of incidence directions to derive the tap vector. Then, a spatial filter bank is used to perform a global analysis of sound fields. The efficiency of the approach and its robustness to sensor noise and position errors are demonstrated through simulations. Head-Related Impulse Response (HRIR) measurement systems are quite complex and present long acquisition times for an accurate sampling of the full 3D space. Therefore HRIRs customization has become an important research topic. In HRIRs customization, some parameters (generally anthropometric measurements) are obtained from new listeners and ad-hoc HRIRs can be retrieved from them. Another way to get new listeners’ parameters is to measure a subset of the full 3D space HRIRs and extrapolate them in order to obtain a full 3D database. This partial acquisition system should be rapid and accurate. Fontana et al. (2006) present a system which allows for rapid acquisition and equalization of HRIRs for a subset of the 3D grid. A technique to carry out HRIR customization based on the measured HRIRs is described. Grenier and Guillaume (2006) described array processing to improve the quality of sound field analysis, which aims to extract spatial properties of a sound field. In this domain, the notion of spatial aliasing inevitably occurs due to the finite number of microphones used in the array. It is linked to the Fourier transform of the discrete analysis window, which is constituted of a mainlobe, fixing the resolution achievable by the spatial analysis, and also from sidelobes which degrade the 105 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools quality of spatial analysis by introducing artifacts not present in the original sound field. A method to design an optimal analysis window with respect to a particular wave vector is presented, aiming to realize the best localization possible in the wave vector domain. Then the efficiency of the approach is demonstrated for several geometrical configurations of the microphone array, on the whole bandwidth of sound fields. Moglie and Primiani (2011) describe the development and testing of FDTD code to simulate the whole Reverberation Chamber (RC). In order to reduce computer processing, some approximations were introduced. The results were validated with the experimental ones measured in a RC. Simulated and measured results were compared using the same statistical software. In addition, the computations easily provide other results that cannot be obtained by measurements like the ones that regard field distribution inside the cavity. The developed FDTD code is able to simulate the statistical properties of a RC as a function of its dimensions and stirrer geometry. Many numerical techniques have been proposed to simulate RCs. Every method requires very large computer resources if a full 3D simulation is done. The developed FDTD code is able to simulate different geometries and movements of the stirrer(s) allowing the designer to obtain the best configuration using the simulator, and saving time for experimental tests. Simulations integrate experimental measurements when long measurement time or destructive tests are required. Sound rendering is analogous to graphics rendering when creating virtual auditory environments. In graphics, we can create images by calculating the distribution of light within a modeled environment. Illumination methods such as ray tracing and radiosity are based on the physics of light propagation and reflection. Similarly, sound rendering is based on physical laws of sound propagation and reflection. Lokki et al. (2002) clarify real-time sound rendering techniques by comparing them to visual image rendering. They also describe how to perform sound rendering, based on the knowledge of sound source(s) and listener locations, radiation characteristics of sound sources, geometry of 3D models, and material absorption data, i.e., the congruent data used for graphics rendering. In several instances, the authors use the Digital Interactive Virtual Acoustics auralization system, developed at the Helsinki University of Technology, as a practical example to illustrate a concept. In the context of sound rendering, the term auralization, making audible, corresponds to visualization. Applications of sound rendering vary from film effects, computer games, and other multimedia content to enhancing experiences in Virtual Reality (VR). Raghuvanshi et al. (2010) describe a method for real-time sound propagation that captures all wave effects, including diffraction and reverberation, for multiple moving sources and a moving listener in a complex, static 3D scene. It performs an offline wave-based numerical simulation over the scene and extracts perceptually-salient information. To obtain a compact representation, the scenes acoustic response is broken into two phases: ER and Late Reverberation (LR), based on a threshold on the temporal density of arriving sound peaks. The LR representation is computed and stored once per room in the scene, while the ER accounts for a more detailed spatial variation by recording multiple simulations over a uniform grid of source locations. ER data is then compactly stored at each source/receiver point pair as a set of peak delays/amplitudes and a residual frequency response sampled in octave bands. An efficient, real-time technique that uses this precomputed representation to perform binaural sound rendering based on frequency-domain convolutions is described. They also introduce a new technique to perform artifact-free spatial interpolation of the ER data. The system demonstrates realistic, wave-based acoustic effects in real time, including diffraction low-passing behind obstructions, hollow reverberation in empty rooms, sound diffusion in fully furnished rooms, and realistic LR. 106 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Mehra et al. (2012) describe an efficient algorithm for a time-domain solution of the acoustic wave equation for the purpose of room acoustics. It is based on adaptive rectangular decomposition of the scene and uses analytical solutions within the partitions that rely on spatially invariant speed of sound. This technique is suitable for auralizations and sound field visualizations, even on coarse meshes approaching the Nyquist limit. It is demonstrated that by carefully mapping all components of the algorithm to match the parallel processing capabilities of Graphics Processing Units (GPUs), significant improvement in performance is gained compared to the corresponding Central Processing Unit (CPU)-based solver, while maintaining the numerical accuracy. Substantial performance gain over a high-order finite-difference time-domain method is observed. Using this technique, a 1 s long simulation can be performed on scenes of air volume 7500 m3 till 1650 Hz within 18 min compared to the corresponding CPU-based solver that takes around 5 h and a highorder finite-difference time-domain solver that could take up to three weeks on a desktop computer. To the best of the authors’ knowledge, this is the fastest time-domain solver for modeling the room acoustics of large, complex-shaped 3D scenes that generates accurate results for both auralization and visualization. Usefulness to the project Like scene recomposition, 3D audio rendering is of real importance in the global process of realistic rendering and 3D immersion. Available tools from the literature and REVERIE partners Romeo-HRTF Description: Romeo-HRTF is a HRTF database. Here the usual concept of binaural HRTF is extended to the context of the audition of a humanoid robot where the robot head is equipped with an array of microphones. A first version of the HRTF database is proposed, that is based upon a dummy that simulates the robot (called Theo). A second version of the database is based upon the head and torso of the prototype robot Romeo. The corresponding HRIRs are also proposed. The heads of Theo and Romeo are equipped with 16 microphones. The databases are recorded for 72 azimuth angles and 7 elevation angles. The typical use of this database is for the development of algorithms for robot audition. http://www.tsi.telecom-paristech.fr/aao/en/2011/03/31/romeo-hrtf-a-multimicrophone-headrelated-transfer-function-database/ 4.4. Composition and synchronization Composition and synchronization deals with the live management of data streams that have to be passed to the rendering facilities of the REVERIE platform. 107 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools A consistent final view of the user on the virtual room of REVERIE is the result of merging different data streams of object representations. REVERIE is characterized by its requirement to be able to model a scene made out of objects of vastly different representations. Examples of representations that are suited for REVERIE are image or video based representations that can be combined with depth maps, point-cloud representations, and polygon meshes that can be textured and optionally may have depth maps. In the virtual REVERIE world we may encounter avatars, highly realistic representations of humans and objects that represent the walls, doors, and windows of the virtual room and the home furnishings and props. All objects that act in the virtual world have to be positioned and field-ofview related object data of appropriate level-of-detail have to be collected and transmitted to the appropriate renderers. The representation of human characters can be animated, controlled by a limited set of parameters. This can be done for instance by deformation of the geometrical model by applying a rig to the polygonal mesh or by control of the joints of a skeleton. As a result of this parameterization the amount of data needed to advance to the next frame of an animation is reduced considerably. Examples of parameterization are the Facial Animation Parameters (FAP) and H-Anim, a VRML based standard for virtual human representation defined by the MPEG-4 standard. In Section 3.1 more formats are mentioned. An even higher level of control that affects the behavior via the emotional state of the character is possible using the Markup languages mentioned in Section 3.2. The importance of these techniques is the reduction in bit rate needed for scene updates that is essential for the efficiency and real time characteristics required by the REVERIE framework. Each of the types of data that specify the REVERIE virtual world, from the low level geometry based representation up to the level of parameters of the virtual human representation, will have a type specific renderer associated to it. These renderers will in parallel process incoming data streams to produce the image data from which the final image will be composited. To make sure that the renderers produce output synchronously, the incoming network streams need synchronization control. Synchronization control needs to happen at the receiver side and in some cases at the server side because in the Internet packets can arrive late and out of order. In multimedia applications, proper temporal synchronization of media streams is essential for the user experience. It is a welldeveloped research field. Work package T5.6. provides network support for synchronization by adding proper timestamps, sequence numbers, and providing an architecture for exchanging control messages (network support for synchronization). The aim of T7.5. is to use this information to achieve synchronized rendering at clients. First we define the types of synchronization operations/goals that need to be performed at the renderer: Intra-stream synchronization: Proper re-ordering of streams of a single media object (intrastream synchronization) is necessary. Lost or delayed and out of order packets need to be dealt with to maintain the original timing structure of the single media stream. Inter-stream synchronization: Second, inter-stream synchronization to maintain temporal properties between different streams, usually audio and video is needed. In general one stream media object is chosen as the master and forces other streams to synchronize to its timeline. In common case of A/V transmission the audio is often selected as the master, and when video frames are lost or damaged they are concealed by skipping/pausing/duplicating to maintain temporal synchronization. A comprehensive survey that overviews both requirements for inter and intra media synchronization, and the common approaches to synchronize are presented by Blakowski and Steinmetz in 1996. In case of video conferencing and more interactive distributed media 108 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools applications, inter-destination synchronization becomes relevant. Inter-destination media synchronization: This type of synchronization, aims for equal output timing of the same stream (presentation) at different receivers. It is beneficial to achieve fairness in competitive scenarios, such as a video conferencing quiz. Also it can be useful when users are watching the same video content while talking over an audio connection, so that the video content is synchronized. Employment of IDMS for a shared media application is described by (Boronat, Mekuria, Montagud and Cesar 2012). Inter-sender media synchronization: Streams, or groups of streams from different senders should also be rendered synchronously compared to when they were generated at different senders. This is needed to achieve fairness in some scenarios. For example a quiz where two users have to answer to a third participant, it would be unfair if this third participant rendered streams from the first user earlier. I.e., in REVERIE intersender synchronization needs to be taken into account. Synchronization and 3D Tele-Immersion in the REVERIE Context Huang Z et al. 2011 summarized the different types of synchronization relevant for 3D teleimmersion. First Inter-stream synchronization, often referring to synchronization between audio and video streams, can in tele-immersion refer to multiple different video/audio streams (bundle of streams). In that study network based support for synchronization was given, while this task focuses on the client/receiver. The transmission properties of bundles of streams were investigated in Argawal et al. 2010. It showed that when the number of streams in a bundle increases, the overall inter-stream skew (the difference between the earliest and latest arrival) increases. As these streams are similar and represent different angles of the same scene, it is important that they are synchronized. The basic buffering technique requires large buffering times, which is undesirable due to the interactivity requirements for 3D Tele-immersion. For these reasons Inter-stream synchronization is a big challenge in 3D tele-immersion and in the REVERIE framework we have to develop algorithms to overcome this challenge to render multiple streams synchronously at the client without invoking large delays. These types of synchronization, required in most multimedia applications are relevant in REVERIE. T5.6 sets the first step and provides some network/sender-based support for synchronization. However, to achieve these goals at the receiver the renderer needs to perform control techniques on the media/streams. Such renderer adaptations are similar for the different types of media synchronization. A (not mutually exclusive) classification of control techniques is given below based on the work of Ishibashi and Tasaka 2000. A Basic technique of client-side synchronization is buffering. Based on timestamps and sequence numbers the effects of jitter and delay can be smoothened out by buffering. Examples of preventive control are when buffer underflow asynchrony is about to occur (but before asynchrony occurs). To prevent asynchrony, the play-outrate can be decreased, preventive pauses can be introduced and the size of the buffers can be increased. Reactive control techniques are techniques that are employed to recover after asynchrony occurs. In such cases skipping or pauses can be used to recover or the play-out rate can be increased/decreased. Another technique is the use of a virtual timebase that can be expanded or contracted based on the synchronization need. Another option is master slave switching, i.e., switching to the stream lagging behind may increase overall synchronization. Another simple technique that is employed very often is discarding late events (dropping media with too much delay). Common control techniques are techniques that can be used both for preventive and reactive control, such as play-out rate adaption and interpolation of data (estimation of missing frames based on correctly received frames). 109 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Table 2 Techniques for client based control techniques Technique Basic Control Preventive Control Technique Buffering techniques Preventive skips of MDUs (eliminations or discarding) and/or preventive pauses of MDUs (repetitions, insertions or stops) Change the buffering waiting time of the MDUs Reactive skips (eliminations or discarding) and/or reactive pauses (repetitions, insertions or stops) Reactive Control Make play-out duration extensions or reductions (play-out rate adjustments) Use of a virtual time with contractions or expansions (VTR) Master/slave scheme switching Late event discarding (Event-based) Common Control Adjustment of the play-out rate Data interpolation Combinations of applying client-based techniques and server/network-based techniques are referred to as synchronization algorithm and are further described in Ishibashi and Tasaska 2000. Quality of Experience for synchronization: The study by Blakowski and Steinmetz presented the perceptual aspects and requirements for media synchronization. For example for audio video synchronization a maximum skew of approximately 80 ms is allowed, with audio ahead being easier to notice. The values presented in that study are generally taken into account in the development of synchronization for multimedia systems. To test the effect of inter-destination synchronization schemes in networked games with virtual avatars and videoconferencing Ishibashi Y, Nagasaka M and Noriyuki F 2006 conducted experiments that showed that difference in latency (between the questioners and examiner) over 300 ms lead to perceived unfairness between participants. The game simulated was a 3-way conference were 2 questioners tried to answer simple questions posed by a third examiner. The person that raises its hand first was given the turn to answer. As can be expected when the delay between on the link between the examiner and questioner introduced more latency, unfairness was introduced. Hosoya K, Ishibashi Y, Sugawara S, Psannis in 2008 then conducted similar tests but with applying group synchronization (inter-destination media synchronization) and showed that it improves fairness. Usefulness to the project REVERIE is a framework of which the scene representation of a virtual 3D environment is characterized by different types of objects (realistic representations of humans, avatars, and the virtual objects of the environment in which these humanoids act), different types of object representations (image or video based, point-clouds, textured polygons meshes, etc.) and different temporal characteristics of the entities (dynamic, static). To be able to present a consistent view on this complex scenery, the visuals of all these types of objects (and the audio that may be involved as well) has to be managed and synchronized. The humanoids that act in the scenery may navigate independently, controlled by individual 110 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools users of the REVERIE system. Their position and gaze influences the view the user ‘behind’ that humanoid may get presented and should be consistent with the view that other participants have. Composition and synchronization is essential for the user experience, the feeling of being immersed in the dream world of REVERIE. Available tools from the literature and REVERIE partners SMIL 3.0 Description: Synchronized Multimedia Integration language (SMIL) for choreographing multimedia presentations for combination of audio, video, text and graphics in real-time. Dependencies on other technology: XML Ambulant Description: The AMBULANT Open SMIL Player is an open-source media player with support for SMIL 3.0. AMBULANT is a player upon which higher-level systems solutions for authoring and content integration can be built or within which new or extended support for networking and media transport components can be added. The AMBULANT player may also be used as a complete, multi-platform media player for applications that do not need support for closed, proprietary media formats. Focused on distributed synchronization. Modifications are possible in order to act as a signalling component regarding Dependencies on other technology: SMIL Videolat (Video latency) Description: Videolat is a standalone tool that generates bar codes representing the current time. It serves to measure rendering times so that latencies of components can be analyzed. http://sourceforge.net/projects/videolat/ Open Source Multimedia Framework GStreamer Description: GStreamer is a framework for constructing graphs of media-handling components. The applications it supports range from simple Ogg/Vorbis playback, audio/video streaming to complex audio (mixing) and video (non-linear editing) processing. Applications can take advantage of advances in codec and filter technology transparently. Developers can add new codecs and filters by writing a simple plug-in with a clean, generic interface. GStreamer is released under the LGPL. The 0.10 series is API and ABI stable. GStreamer was used in the FP7 TA2 project for 2D video composition. http://gstreamer.freedesktop.org GPAC is an Open Source multimedia framework for research and academic purposes. The project covers different aspects of multimedia, with a focus on presentation technologies (graphics, animation and interactivity). MPEG-4 is a method of defining compression of digital data. Use of MPEG-4 for REVERIE will primarily be compression of AV data. MPEG-4 has VRML support for 3D rendering, objectoriented composite files (including audio, video and VRML objects) and support for various types of interactivity. The standard includes the concept of "profiles" and "levels", allowing a specific set of capabilities to be defined in a manner appropriate for a subset of applications. Apart from the efficient coding, the ability to encode mixed media data (video, audio, speech etc.) and the ability to interact with the audio-visual scene generated at the receiver may be of interest for the REVERIE framework. In REVERIE we will develop a media client that can synchronize networked streams of different media content that composes both modeled data as full video/audio clips. Achieving 111 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools synchronization between these different media types is the first challenge in T7.5. Moreover, interstream skew becomes larger when the number of streams increases, increasing the need for interstream synchronization. Developing suitable inter-stream synchronization based on both network and client side techniques is the second challenge. Moreover techniques for inter-destination and inter-sender synchronization are useful to improve the Quality of Experience and provide consistency and fairness. The deployment of such synchronization in REVERIE is desirable. Overall the Quality of Experience of interactive 3D multi streaming video is not well understood and can be studied based on the renderer developed in task 7.5. 4.5. Stereoscopic and autostereoscopic display Visualization on stereoscopic and auto-stereoscopic displays is crucial for immersion of participants into the REVERIE system. Seamless integration of such 3D presentation into the users’ real environment and interaction between elements in both is widely unresolved. 3D displays have serious limitations regarding depth range, perception, and negative impact on the user. This interplay between technology, perception, and envisaged application scenarios creates challenging research questions to be solved in the REVERIE project. Usefulness to the project Extensions of the following algorithms will make sure that pleasant tele-immersion is achieved by 3D presentation which satisfies bounds on perceptual comfort, and integrates real environment with displayed scenery in an optimum and interactive way. Available tools from the literature and REVERIE partners Algorithms for nonlinear disparity mapping and rendering for stereoscopic 3D (Lang et al. 2010; Oskam et al. 2011) Description: These algorithms allow automatic adaptation of stereo 3D content to a particular display size and user settings. Disturbing errors such as window violations can be corrected. Corresponding rendering software will be developed. Algorithms for optimum depth range adaptation on autostereoscopic displays (Smolic et al. 2011; Zwicker et al. 2007; Smolic et al. 2008b; Farre et al. 2011) Description: The usable depth range on autostereoscopic displays is even more limited compared to glasses-based systems. Content has to be adapted to avoid alias and blur. Display properties will be evaluated on theoretical basis and optimum rendering software will be developed. 112 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools 5. References Agarwal, P., Rivas, R., Wu, W ., Arefin, A., Huang, Z., Nahrstedt, K. 2011, SAS kernel: streaming as a service kernel for correlated multi-streaming. Proceedings of the 21st international workshop on Network and operating systems support for digital audio and video (NOSSDAV '11) pp. 81-86 Agarwal, P., Toledano, R.R., Wanmin, Wu., Nahrstedt, K., Arefin, 2010. A. Bundle of Streams: Concept and Evaluation in Distributed Interactive Multimedia Environments. IEEE International Symposium on Multimedia (IEEE ISM’10), pp. 25-32. Agarwal, S., Snavely, N., Seitz, S. and Szeliski, R., 2009a. Bundle adjustment in the large. Computer Vision (ICCV), pp. 29—42. Agarwal, S., Snavely, N., Simon, I., Seitz, S. and Szeliski, R., 2009b. Building Rome in a day. Computer Vision (ICCV), pp. 72—79. Aggarwal, J.K., and Cai, Q., 1999. Human motion analysis: a review. Computer Vision and Image Understanding, 73(3), pp. 428-440. Aggarwal, J. K. and Ryoo, M. S., 2011. Human activity analysis: A review. ACM Computing Surveys, 42(3), pp. 90—102. Alexander, O., Rogers, M., Lambeth, W., Chiang, J., Ma, W., Wang, C. and Debevec, P., 2010. The digital Emily project: Achieving a photoreal digital actor. IEEE Computer Graphics and Applications, 30, pp. 20—31. Alighanbari, M., and How, J.P., 2006. An unbiased Kalman consensus algorithm. American Control Conference, pp. 3519—3524. Alonso, M., Richard, G. and David, B., 2007. Accurate tempo estimation based on harmonic+ noise decomposition. EURASIP Journal on Applied Signal Processing, (1), p. 161. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., and Stolcke, A., 2002. Prosody-based automatic detection of annoyance and frustration in human-computer dialog. Proceedings of the 7th Int’l Conference on Spoken Language Processing (ICSLP). Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Pang, H. and Davis, J. 2004a. The correlated correspondence algorithm for unsupervised registration of non-rigid surfaces. 18th Neural Information Processing Systems Conference (NIPS), 17, pp. 33—40. Anguelov, D., Koller, D., Pang, H., Srinivasan, P. and Thrun, S., 2004b. Recovering articulated object models from 3D range data. 20th Uncertainty in Artificial Intelligence (UAI) Conference, pp. 18—26. Anguelov, D., 2005. Learning models of shape from 3D range data. Ph. D. Stanford University. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J. and Davis, J., 2005. SCAPE: Shape completion and animation of people. ACM Transactions on Graphics (TOG), 24(3), pp. 408—416. Arinbjarnar, M., and Kudenko, D., 2010. Bayesian networks: Real-time applicable decision mechanisms for intelligent agents in interactive drama. 2010 IEEE Symposium on Computational Intelligence and Games (CIG), pp. 427—434. Arnason, B. and Porsteinsson, A., 2008. The CADIA BML realizer. http://cadia.ru.is/projects/bmlr/. Arya, A., DiPaola, S. and Parush, A., 2009. Perceptually valid facial expressions for character-based applications. International Journal of Computer Games Technology, pp. 1—14. 113 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Aylett, R. S., 2004. Agents and affect: Why embodied agents need affective systems. Methods and Applications of Artificial Intelligence, pp. 496—504. Bajramovic, F. and Denzler, J., 2008. Global uncertainty-based selection of relative poses for multicamera calibrations. British Machine Vision Conference (BMVC), 2, pp. 745—754. Ballard, D., 1981. Generalizing the Hough transform to detect arbitrary patterns. Pattern Recognition, 13(2), pp. 111-122. Bari, M.F., Haque, M.R., Ahmed, R., Boutaba, R. and Mathieu, B., 2011. Persistent naming for P2P web hosting. IEEE International Conference on Peer-to-Peer Computing (P2P), pp. 270—279. Barinova, O., Lempitsky, V. and Kohli, P., 2010. On the detection of multiple object instances using Hough transforms. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2233— 2240. Bartlett, M.S., Littlewort, G.C., Frank, M.G., Lainscsek, C., Fasel, I. and Movellan, J.R., 2006. Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia, 1(6), pp. 22—35. Batliner, A., Fischer, K., Hubera, R., Spilkera, J. and Noth, E. 2003. How to find trouble in communication. Speech Communication, 40, pp. 117—143. Bay, H., Tuytelaars, T. and Van Gool, L., 2006. SURF: Speeded up robust features. Computer Vision (ICCV), pp. 404—417. Beale, R. and Creed, C., 2009. Affective interaction: How emotional agents affect users. International Journal of Human-Computer Studies, 67(9), pp. 755—776. Beigi, H., 2011. Fundamentals of speaker recognition. New York: Springer. Benbasat, A.Y. and Paradiso, J.A., 2001. An inertial measurement framework for gesture recognition and applications. International Gesture Workshop on Gesture and Sign Languages in HumanComputer Interaction, pp. 77-90. Bevacqua, E., de Sevin, E., Hyniewska, S. J. and Pelachaud, C., To Appear. A listener model: Introducing personality traits. Journal on Multimodal User Interfaces, Special Issue: Interacting ECAs. Blakowski, G., Steinmetz, R. 1996. A media synchronization Survey: Reference, Model, Specification and Case Studies. IEEE Journal on selected areas in communication Vol. 16 No. 6 pp. 5-35 Blanz, V. and Vetter, T., 1999. A morphable model for the synthesis of 3D faces. 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187—194. Blunsden, S., Fisher, R.B. and Andrade, E.L., 2006. Recognition of coordinated multi agent activities, the individual vs. the group. Workshop on Computer Vision Based Analysis in Sport Environments (CVBASE), pp. 61—70. Boella, G., Tosatto, S.C., Garcez, A.A., Genovese, V., Ienco, D., and van der Torre, L., 2011. Neural symbolic architecture for normative agents. 10th International Conference on Autonomous Agents and Multiagent Systems, 3, pp. 1203—1204. Boronat, F., Mekuria, R., Montagud, M., Cesar, P. 2012. Distributed media synchronization for shared video watching: issues, challenges, and examples. Social Media Retrieval, Springer Computer Communications and Networks series. Boukricha, H., Wachsmuth, I., Hofstaetter, A. and Grammer, K., 2009. Pleasure-arousal-dominance driven facial expression simulation. 3rd International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 119—125. 114 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Boykov, Y. and Kolmogorov, V., 2003. Computing geodesics and minimal surfaces via graph cuts. International Conference on Computer Vision (ICCV), pp. 26—33. Brandstein, M. and Ward, D., 2001. Microphone Arrays: Signal Processing Techniques and Applications. Springer-Verlag, Berlin. Braathen, B., Bartlett, M.S., Littlewort, G., Smith, E. and Movellan, J.R., 2002. An approach to automatic recognition of spontaneous facial actions. 5th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 360—365. Brave, S., Nass, C. and Hutchinson, K., 2005. Computers that care: Investigating the effects of orientation of emotion exhibited by an embodied computer agent. International Journal of HumanComputer Studies, 62, pp. 161—178. Brooks, C.H., Fang, Y., Joshi, K., Okai, P., and Zhou, X. , 2007. Citepack: An autonomous agent for discovering and integrating research sources. AAAI Workshop on Information Integration on the Web. Bui, T.D., 2004. Creating emotions and facial expressions for embodied agents. Ph. D. University of Twente. Bulterman, D.C.A. and Rutledge, L.W., 2008. SMIL 3.0: Flexible Multimedia for Web, Mobile Devices and Daisy Talking Books. Springer. Burley, B. and Lacewell, D., (2008). Ptex: per-face texture mapping for production rendering. Computer Graphics Forum, 27(4), pp. 1155– 1164. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. and Narayanan, S., 2004. Analysis of emotion recognition using facial expressions, speech and multimodal information. 6th International Conference on Multimodal Interfaces (ICMI), pp. 205— 211. Cappé, O., 1994. Elimination of musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Transactions on Speech and Audio Processing, 2(2), pp. 345—349. Carranza, J., Theobalt, C., Magnor, M.A. and Seidel, H., 2003. Free-viewpoint video of human actors. ACM Transactions on Graphics (TOG), 22(3), pp. 569—577. Catanese, S., Ferrara, E., Fiumara, G., and Pagano, F., (2011). A framework for designing 3d virtual environments. Proceedings of the 4th International ICST Conference on Intelligent Technologies for Interactive Entertainment, ACM. Cerekovic, A., Pejsa, T. and Pandzic, I., 2009. RealActor: Character animation and multimodal behaviour realization system. 9th International Conference on Intelligent Virtual Agents (IVA), pp. 486—487. Chai, J. and Hodgin, J.K., 2005. Performance animation from low-dimensional control signals. ACM Transactions on Graphics, 24(3), pp. 686—696. Chen, L.S., 2000. Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction. PhD Thesis, UIUC. Chum, O. and Matas, J., 2008. Optimal randomized RANSAC. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), pp. 1472—1482. Chung, T.H., Kress, M. and Royset, J.O., 2009. Probabilistic search optimization and mission assignment for heterogeneous autonomous agents. IEEE International Conference on Robotics and Automation (ICRA), pp. 939—949. Cignoni, P., Montani, C., Rocchini, C., and Scopigno, R., (1998). A general method for preserving 115 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools attribute values on simplified meshes. Proceedings of the Conference on Visualization ‘98. Computer Society Press, pp. 59–66. Codognet, P., 2011. A simple language for describing autonomous agent behaviours. 7th International Conference on Autonomic and Autonomous Systems (ICAS), pp. 105—110. Cohen, J., Olano, M., and Manocha, D., (1998). Appearance-preserving simplification. Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, ACM, pp. 115–122. Comon, P. and Jutten, C., 2010. Handbook of Blind Source Separation, Independent Component Analysis and Applications. Academic Press, Elsevier. Crandall, D., Owens, A., Snavely, N. and Huttenlocher, D., 2011. Discrete-continuous optimization for large-scale structure from motion. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3001—3008. Craven, P.G. and Gerzon, M.A., 1975. Coincident microphone simulation covering three dimensional space and yielding various directional outputs. US. Pat. 4042779. Cui, Y., Li, B. and Nahrstedt, K., 2004. oStream: Asynchronous streaming multi-cast in applicationlayer overlay networks. IEEE Journal on Selected Areas in Communications, 22(1), pp. 91—106. Curless, B. and Levoy, M., 1996. A volumetric method for building complex models from range images. 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303—312. d’Eon, E. and Irving, G., 2011. A quantized-diffusion model for rendering translucent materials. ACM Transactions on Graphics (TOG), 30(4), p. 56. Darch, J., Milner, B. and Vaseghi, S., 2008. Analysis and prediction of acoustic speech features from mel-frequency cepstral coefficients in distributed speech recognition architectures. Journal of the Acoustic Society of America, 124, pp. 3989—4000. Davis, S. and Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), pp. 357—366. De La Torre, F. and Cohn, J.F., 2011. Facial expression analysis. Guide to Visual Analysis of Humans: Looking at People, Springer. Demeure, V., Niewiadomski, R. and Pelachaud, C., To Appear. How believability of virtual agent is related to warmth, competence, personification and embodiment? MIT Presence. Dias, J., Mascarenhase, S. and Paiva, A., 2011. FAtiMA modular: Towards an agent architecture with a generic appraisal framework. Standards in Emotion Modelling (SEM). Drescher, C. and Thielscher, M., 2011. ALPprolog – A new logic programming method for dynamic domains. Theory and Practice of Logic Programming, 11(4—5), pp. 451—468. Durrieu, J.L., Richard, G., David, B. and F’evotte, C., 2010. Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Transactions on Audio, Speech and Language Processing, 18(3), pp. 564—575. Durrieu, J.L., David, B. and Richard, G., 2011. A musically motivated mid-level representation for pitch estimation and musical audio source separation. IEEE Journal on Selected Topics in Signal Processing, 5(6), pp. 1180—1191. Eide, E. and Gish, H., 1996. A parametric approach to vocal tract length normalization. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 346—348. Eisert, P. and Girod, B., 1998. Analyzing facial expressions for virtual conferencing. IEEE Computer Graphics and Applications, 18(5), pp. 70—78. 116 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Eisert, P., 2002. Model-based camera calibration using analysis by synthesis techniques. 7th International Workshop on Vision, Modeling and Visualization (VMV), p. 307. Eisert, P. and Hilsmann, A., 2011. Realistic virtual try-on of clothes using real-time augmented reality methods. IEEE ComSoc MMTV E-Letter, 6(8), pp. 37—48. Eisert, P. and Rurainsky, J., 2006. Geometry-assisted image-based rendering for facial analysis and synthesis. Elsevier Signal Processing: Image Communication, 21(6), pp. 493—505. Ekman, P., Friesan, W. V. and Hager, J.C., 2002. Facial Action Coding System (FACS). Consulting Psychologists Press, Stanford University, Palo Alto. Ephraim, Y. and Malah, D., 1984. Speech enhancement using a MMSE short-time spectral amplitude estimator, IEEE ASSP-32, pp. 1109—1121. Ephraim, Y. and Malah, D., 1985. Speech enhancement using a MMSE error log-spectral amplitude estimator, IEEE ASSP-33, pp. 443—445. Ephraim, Y. and Malah, D., 1998. Noisy speech enhancement using discrete cosine transform. Speech Communication, 24, 249—257. Farre, M., Wang, O., Lang, M., Stefanoski, N., Hornung, A. and Smolic, A., 2011. Automatic content creation for multiview autostereoscopic displays using image domain warping. IEEE International Conference on Multimedia and Expo (ICME), pp. 1—6. Fechteler, P. and Eisert, P., 2011. Recovering articulated pose of 3D point clouds. 8th European Conference on Visual Media Production (CVMP), London, UK. Fehn, C., Kauff, P., de Beeck, M.O., Ernst, F., Ijsselsteijn, W., Pollefeys, M., Van Gool, L., Ofek, E. and Sexton, I., 2002. An evolutionary and optimized approach on 3D-TV. IBC, 2, pp. 357—365. Fischler, M. and Bolles, R., 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), pp. 381—395. Fontana, S., Grenier, Y. and Farina, A., 2006. A system for head related impulse responses rapid measurement and direct customization. 120th Convention Audio Engineering Society (AES), pp. 1— 20. Frahm, J., 2010. Fast robust large-scale mapping from video and internet photo collections. ISPRS Journal of Photogrammetry and Remote Sensing, 60(6), pp. 538—549. Franco, J.S. and Boyer, E., 2009. Efficient polyhedral modeling from silhouettes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(3), pp. 414—427. Furui, S., 2001. Digital Speech Processing, Synthesis and Recognition, 2nd Edition, New York: Marcel Dekker. Furukawa, Y. and Ponce, J., 2009a. Carved visual hulls for image-based modeling. International Journal of Computer Vision, 81, p. 5367. Furukawa, Y. and Ponce, J., 2009b. Dense 3D motion capture for human faces. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1674—1681. Gales, M. and Young, S., 2007. The application of Hidden Markov Models in speech recognition. Foundatinos and Trends in Signal Processing, 1, pp. 195—304. Garrett-Glaser, J., 2011. Diary of an x264 Developer. http://x264dev.multimedia.cx/. 117 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Gauvain, J.L. and Lee, C.H., 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), pp. 291—298. Gavrila, D.M., 1999. The visual analysis of human movement: A survey. Computer Vision and Understanding, 73(1), pp. 82—98. Gebhard, P., 2005. ALMA – A layered model of affect. 4th International Joint Conference on Autonomous Agents and Multi-agent Systems (AAMAS), pp. 29—36. Gillet, O. and Richard, G., 2008. Transcription and separation of drum signals from polyphonic music. IEEE Transactions on Audio, Speech and Language Processing, 16(3), pp. 529—540. Godsill, S. and Rayner, P., 1998. Digitial audio restoration – A statistical model-based approach. Applications of Digital Signal Processing to Audio and Acoustics, pp. 133—194. Graciarena, M., Shriberg, E., Stolcke, A., Enos, F., Hirschberg, J. and Kajarekar, S., 2006. Combining prosodic, lexical and cepstral systems for deceptive speech detection. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 1033—1036. Grammer, K. and Oberzaucher, E., 2006. The reconstruction of facial expressions in embodied systems: New approaches to an old problem. ZIF Mitteilungen, 2, pp. 14—31. Gratch, J. and Marsella, S., 2004. A domain independent framework for modeling emotion. Journal of Cognitive Systems Research, 5(4), pp. 269—306. Gratch, J., Marsella, S. and Petta, P., 2009. Modeling the cognitive antecedents and consequences of emotion. Cognitive Systems Research, 10(1), pp. 1—5. Grenier, Y. and Guillaume, M., 2006. Sound field analysis based on generalized prolate spheroidal wave sequences. 120th Convention of the Audio Engineering Society (AES), pp. 1—7. Grosz, B.J. and Kraus, S., 1996. Collaborative plans for complex group action. Artificial Intelligence, 86(2), pp. 269—357. Guillaume, M. and Grenier, Y., 2007. Sound field analysis based on analytical beamforming. EURASIP Journal on Advances in Signal Processing, (1), p. 189. Ha, S., Bai, Y. and Liu, C.K., 2011. Human motion reconstruction from force sensors. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 129—138. Hashimoto. Y., Ishibashi, Y. 2006, Influences of network latency on interactivity in networked rockpaper-scissors. Proceedings of 5th ACM SIGCOMM workshop on Network and system support for games (NetGames '06) Art No. 23 Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B. and Seidel, H., 2009. A statistical model of human pose and body shape. 30th Conference of the European Association for Computer Graphics (EUROGRAPHICS), 28(2), pp. 337—346. Hayward, V., Astley, O.R., Cruz-Hernandez, M., Grant, D. and Robles-De-La-Torre, G., 2004. Haptic interfaces and devices. Sensor Review, 24(1), pp. 16—29. Heloir, A., and Kipp, M., 2009. EMBR - a real time animation engine for interactive embodied agents. 9th International Conference on Intelligent Virtual Agents (IVA), pp. 393—404. Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic Society of America, 87(4), pp. 1738—1752. Hermansky, H. and Morgan, N., 1994. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), pp. 578—589. 118 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Hervieu, A., Bouthemy, P. and Le Cadre, J., 2009. Trajectory-based handball video understanding. ACM International Conference on Image and Video Retrieval (CIVR), 43, pp. 1—8. Heylen, D., Kopp, S., Marsella, S., Pelachaud, C. and Vilhjalmsson, H., 2008. The next step towards a functional markup language. Intelligent Virtual Agents (IVA), Tokyo. Hilsmann, A. and Eisert, P., 2012. Image-based animation of clothes. 33rd Conference of the European Association for Computer Graphics (EUROGRAPHICS), Cagliari, Italy. Hirschberg, J., Benus, S., Brenier, J.M., Enos, F. and Friedman, S., 2005. Distinguishing deceptive from non-deceptive speech. 9th European Conference on Speech Communication and Technology (INTERSPEECH), pp. 1833—1836. Hosoya K., Ishibashi Y., Sugawara S., Psannis K. 2008. Effects of Group Synchronization Control in Networked Virtual Environments with Avatars. 12th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications. (DS-RT '08) Huang, L., Morency, L.P. and Gratch, J., 2010. Learning back channel prediction model from parasocial consensus sampling: A subjective evaluation. 10th International Conference on Intelligent Virtual Agents (IVA), Philadelphia. Huang, Z., Wu, W., Nahrstedt, K., Rivas, R. and Arefin, A., 2011. SyncCast: Synchronized dissemination in multi-site interactive 3D tele-immersion. 2nd Annual ACM Conference on Multimedia Systems, pp. 69—80. Intille, S.S. and Bobick, A.F., 2001. Recognizing planned, multiperson action. Computer Vision and Image Understanding, 81(3), pp. 414—445. Ishibashi, Y., Tasaka, S. 2000 A comparative survey of synchronization algorithms for continuous media in network environment. 25th Annual IEEE Conference on Local Computer Networks (IEEE LCN), pp. 337 - 348 Ishibashi, Y., Nagasaka, M., Noriyuki, F. 2006, Subjective Assessment of Fairness among Users in Multipoint Communications. Proceedings of the ACM SIGCHI international conference on Advances in computer entertainment technology (ACE '06), Art. No. 69 Iwata, H., 2003. Haptic interfaces. The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Application, Lawrence Erlbaum Associates, Mahwah, NJ. Izadi, S., Newcombe, R., Kim, D., Hilliges, O., Molyneaux, D., Hodges, S., Kohli, P. and Fitzgibbon, A.D.A., 2011. Kinect Fusion: Real-time dynamic 3D surface reconstruction and interaction. ACM SIGGRAPH, p. 23. Jadbabaie, A., Lin, J. and Morse, A.S., 2003. Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Transactions on Automatic Control, 48(6), pp. 988—1001. Jancosek, M. and Pajdla, T., 2011. Multi-view reconstruction preserving weakly-supported surfaces. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3121—3128. Jebara, T. and Pentland, A., 2002. Statistical imitative learning from perceptual data. 2nd International Conference on Development and Learning, pp. 191—196. Jiang, X. and Xu, D., 2005. Violin: Virtual internet working on overlay infrastructure. Parallel and Distributed Processing and Applications, pp. 937—946. Jin, H., Soatto, S. and Yezzi, A., 2005. Multi-view stereo reconstruction of dense shape and complex appearance. International Journal of Computer Vision, 60(3), pp. 175—189. Junker, H., Amft, O., Lukowicz, P. and Troster, G., 2008. Gesture spotting with body-worn inertial sensors to detect user activities. Pattern Recognition, 41(6). 119 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Just, A. and Marcel, S., 2009. A comparative study of two state-of-the-art sequence processing techniques for hand gesture recognition. Computer Vision and Image Understanding, 113(4), pp. 532—543. Kanade, T., Rander, P., Vedula, S. and Saito, H., 1999. Mixed Reality, Merging Real and Virtual Worlds, ch. Virtualized reality: Digitizing a 3D time varying event as is and in real time. Springer Verlag. Kettern, M., Schneider, D.C. and Eisert, P., 2011. Multiple view segmentation and matting. 8th European Conference on Visual Media Production (CVMP), London, UK. Kim, D.K., Gales, M.J.F. and Member, S., 2010. Noisy constrained maximum-likelihood linear regression for noise-robust speech recognition. Transform, 19, pp. 315—325. Kirishima, T., Sato, K. and Chihara, K., 2005. Real-time gesture recognition by learning and selective control of visual interest points. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), pp. 351—364. Kolev, K., Klodt, M., Brox, T. and Cremers, D., 2009. Continuous global optimization in multi-view 3D reconstruction. International Journal of Computer Vision, 84, pp. 80—96. Koster, M., Haber, J. and Seidel, H., 2004. Real-time rendering of human hair using programmable graphics hardware. Computer Graphics International (CGI), pp. 248—256. Kuhn, R., Nguyen, P., Junqua, J.C., Goldwasser, L., Niedzielski, N., Fincke, S., Field, K. and Contolini, M., 1998. Eigenvoices for speaker adaptation. International Conference on Spoken Language, pp. 1771—1774. Kulakov, A., Lukkanen, J., Mustafa, B. and Stojanov, G., 2009. Inductive logic programming and embodies agents. International Journal of Agent Technologies and Systems, 1(1), pp. 34—49. Kumar, M., 2008. A motion graph approach for interactive 3D animation using low-cost sensors. M. Sc. Virginia Polytechnic Institute and State University. Kutulakos, K. and Seitz, S., 2000. A theory of shape by space carving. International Journal of Computer Vision, 38(3), pp. 199—218. Kwon, O.W., Chan, K., Hao, J. and Lee, T.W., 2003. Emotion recognition by speech signals. 8th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 125—128. Lance, B.J. and Marsella, S.C., 2007. Emotionally expressive head and body movements during gaze shifts. 7th International Conference on Intelligent Virtual Agents (IVA), pp. 72—85. Lance, B.J. and Marsella, S.C., 2008. A model of gaze for the purpose of emotional expression in virtual embodied agents. 7th International Conference on Autonomous Agents and Multi-agent Systems, Estoril, Portugal, pp. 199—206. Lang, M., Hornung, A., Wang, O., Poulakos, S., Smolic, A. and Gross, M., 2010. Non-linear disparity mapping for stereoscopic 3D. ACM Transactions on Graphics (SIGGRAPH). Larson, D.A., 2010. Artificial intelligence: Robots, avatars and the demise of the human mediator. Ohio State Journal on Dispute Resolution, 25(1), pp. 150—163. Lazebnik, S., Schmid, C. and Ponce, J., 2004. Semi-local affine parts for object recognition. British Machine Vision Conference. Lee, J. and Ha, I., 2001. Real-time motion capture for a human body using accelerometers. Robotica, 19(6), pp. 601—610. Legin, A., Rudnitskaya, A., Seleznev, B. and Vlasov, Y., 2005. Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Analytica Chimica Acta, 534(1), pp. 129—135. 120 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Lehmann, A., Leibe, B. and Van Gool, L. Fast PRISM: Branch and bound hough transform for object class detection. International Journal of Computer Vision, 94(2), pp. 175—197. Lerch, A., Eisenberg, G. and Tanghe, K., 2005. FEAPI, a low level features extraction plugin API. International Conference on Digital Audio Effects (DAFx), p. 73. Li, X., Wu, C., Zach, C., Lazebnik, S. and Frahm, J., 2008. Modeling and recognition of landmark image collections using iconic scene graphs. 10th European Conference on Computer Vision: Part I, pp. 427—440. Liscombe, J., Hirschberg, J. and Venditti, J.J., 2005. Detecting certainness in spoken tutorial dialogues. 9th European Conference on Speech Communication and Technology (INTERSPEECH), pp. 1—4. Liu, X.L., Du, H.P., Wang, H.R., and Yang, G.J., (2009). Design and development of a distributed virtual reality system. 8th International Conference on Machine Learning and Cybernetics, 2 pp. 889—894. Liutkus, A., Rafii, A., Badeau, R., Pardo, B. and Richard, G., 2012. Adaptive filtering for music/voice separation exploiting the repeating musical structure. International conference on Acoustics, Speech and Signal Processing (ICASSP). Lokki, T., Savioja, L., Väänänen, R., Huopaniemi, J. and Takala, T., 2002. Creating interactive virtual auditory environments. IEEE Computer Graphics and Applications, pp. 49—57. Lorensen, W.E. and Cline, H.E., 1987. Marching cubes: A high resolution 3D surface construction algorithm. Computer Graphics, 21(4), pp. 163—169. Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 20(2), pp. 91—110. Lyons, M.J., Haehnel, M. and Tetsutani, N., 2003. Designing, playing and performing with a visionbased mouth interface. Conference on New Interfaces for Musical Expression, pp. 116—121. Maazaoui, M., Grenier, Y. and Abed-Meraim, K., 2012. Blind source separation for robot audition using fixed HRTF beamforming. EURASIP Journal on Advances in Signal Processing, (Under Minor Revision). Maiano, C. and Therme, P. and Mestre, D., (2011). Affective, anxiety and behavioural effects of an aversive stimulation during a simulated navigation task within a virtual environment: a pilot study. Computers in Human Behavior 27(1), pp. 169—175. Malatesta, L., Raouzaiou, A., Karpouzis, K. and Kollias, S.D., 2009. Towards modeling embodied conversational agent character profiles using appraisal theory predictions in expression synthesis. Applied Intelligence, 30(1), pp. 58—64. Mao, X., Xue, Y., Li, Z. and Bao, H., 2008. Layered fuzzy facial expression generation: Social, emotional and physiological. Affective Computing, Focus on Emotion Expression, Synthesis and Recognition, pp. 185—218. Matos, S., Birring, S.S., Pavord, I.D. and Evans, D.H., 2006. Detection of cough signals in continuous audio recordings using HMM. IEEE Transactions on Biomedical Engineering, 53(6), pp. 1078—1083. Matsuyama, T., Wu, X., Takai, T. and Wada, T., 2004. Real-time dynamic 3D object shape reconstruction and high-fidelity texture mapping for 3D video. IEEE Transactions on Circuits and Systems for Video Technology, 14(3), pp. 357—369. Matusik, W., Buehler, C. and McMillan, L., 2001. Polyhedral visual hulls for real-time rendering. 12th Eurographics Workshop on Rendering Techniques, pp. 115—125. 121 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools McDonald Jr, J., and Burley, B.,(2011). Per-face texture mapping for real-time rendering. ACM SIGGRAPH Studio Talks, pp. 3. McEnnis, D., McKay, C., Fujinaga, I. and Depalle, P., 2005. jAudio: A feature extraction library. International Conference on Music Information Retrieval, pp. 600—603. Mehra, R., Raghuvanshi, N., Savioja, L., Lin, M.C. and Manocha, D., 2012. An efficient GPU-based time domain solver for the acoustic wave equation. Applied Acoustics, 73(2), pp. 1—13. Mehrabian, A., 1980. Basic Dimensions for a General Psychological Theory: Implications for Personality, Social, Environmental and Developmental Studies. Cambridge. Merkle, P., Muller, K., Smolic, A. and Wiegand, T., 2006. Efficient compression of multi-view video exploiting inter-view dependencies based on H.264/MPEG4-AVC. IEEE International Conference on Multimedia and Expo, pp. 1717—1720. Merrell, P., Akbarzadeh, A., Wang, L., Mordohai, P., Frahm, J.M., Yang, R., Nister, D. and Pollefeys, M., 2007. Real-time visibility-based fusion of depth maps. International Conference on Computer Vision and Pattern Recognition, pp. 1—8. Merritt, L. and Vanam, R., 2004. X264: A high performance H.264/AVC http://wweb.uta.edu/ee/Dip/Courses/EE5359/overview_x264_v8_5%5B1%5D.pdf. encoder. Mikolajczyk, K. and Schmid, C., 2004. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), pp. 1615—1630. Moe, M.E.G., Tavakolifard, M. and Knapskog, S.J., 2008. Learning trust in dynamic multiagent environments using HMMS. 13th Nordic Workshop on Secure IT Systems (NordSec), pp. 135—146. Moeslund, T.B., Hilton, A. and Kruger, V., 2006. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104(2—3), pp. 90—126. Moglie, F. and Primiani, V.M., 2011. Reverberation chambers: Full 3D FDTD simulations and measurements of independent positions of the stirrers. IEEE International Symposium on Electromagnetic Compatibility (EMC), pp. 226—230. Moller, T., (1996). Radiosity techniques for virtual reality – faster reconstruction and support for levels of detail. Computer Graphics and Visualization’96 (WSCG’96). Moore, D.C., 2002. The IDIAP Smart Meeting Room. IDIAP Communication, 7, pp. 1—13. Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F. and Sayd, P., 2009. Generic and real-time structure from motion using local bundle adjustment. Image and Vision Computing, 27, pp. 1178— 1193. Mueller, M., Klapuri, A., Ellis, D. and Richard, G., 2011. Signal processing for music analysis. IEEE Journal on Selected Topics in Signal Processing, 5(6), pp. 1088—1110. Mulligan, J. and Daniilidis, K., 2001. Real time trinocular stereo for tele-immersion. International Conference on Image Processing, pp. 959—962. Myszkowski, K. and Kunii, T.L., (1994). Texture mapping as an alternative for meshing during walkthrough animation. Photorealistic Rendering Techniques, pp. 389–400. Nahrstedt, K., Arefin, A., Rivas, R., Argawal, P., Huang, Z., Wu, W. and Yang, Z., 2011a. QoS and resource management in distributed interactive multimedia environments. Multimedia Tools and Applications, pp. 1—34. Nahrstedt, K., Yang, Z., Wu, W., Arefin, A. and Rivas, R., 2011b. Next generation session management for 3D tele-immersive interactive environments. Multimedia Tools and Applications, 51, pp. 1—31. 122 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Ni, K., Steedly, D. and Dellaert, F., 2007. Out-of-core bundle adjustment for large-scale 3D reconstruction. IEEE 11th International Conference on Computer Vision (ICCV), pp. 1—8. Niewiadomski, R. and Pelachaud, C., 2007a. Model of facial expressions management for an embodied conversational agent. 2nd International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 12—23. Niewiadomski, R. and Pelachaud, C., 2007b. Fuzzy similarity of facial expressions of embodied agents. 7th International Conference on Intelligent Virtual Agents, pp. 86—98. Niewiadomski, R. and Pelachaud, C., 2010. Affect expression in ECAs: Application to politeness displays. International Journal on Human-Computer Studies, 68, pp. 851—871. Niewiadomski, R., Hyniewska, S. and Pelachaud, C., 2011. Constraint-based model for synthesis of multimodal sequential expressions of emotions. IEEE Transactions on Affective Computing, 2(3), pp. 134—146. Ochs, M., Pelachaud, C. and Sadek, D., 2008. An empathic virtual dialog agent to improve human-m achine interaction. Autonomous Agent and Multi-Agent Systems (AAMAS), pp. 89—96. Ochs, M., Sabouret, N. and Corruble, V., 2009. Simulation of the dynamics of non-player characters’ emotions and social relations in games. IEEE Transactions on Computational Intelligence and AI in Games, 1(4), 281—297. Ochs, M., Niewiadomski, R., Brunet, P. and Pelachaud, C., 2012. Smiling virtual agent in social context. Cognitive Processing, pp. 1—14. Oros, N., Adams, R.G., Davey, N. Cañamero, L. and Steuber, V., 2008. Encoding sensory information in spiking neural network for the control of autonomous agents. 19th European Meeting on Cybernetics and Systems Research, pp. 1—6. Oskam, T., Hornung, A., Bowles, H., Mitchell, K. and Gross, M., 2011. OSCAM – Optimized stereoscopic camera control for interactive 3D. ACM Transactions on Graphics, 30(6), pp. 1—8. Ott, D.E. and Mayer-Patel, K., 2004. Coordinated multi-streaming for 3D tele-immersion. 12th Annual ACM International Conference on Multimedia, pp. 596—603. Oudeyer, P.Y., 2003. The production and recognition of emotions in speech: Features and algorithms. International Journal of Human-Computer Studies, 59(1—2), pp. 157—183. Oviatt, S.L., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J. and Ferro, D., 2000. Designing the user inferface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions. Human Computer Interaction, 15, pp. 263—322. Ozerov, A., Philippe, P., Bimbot, F. and Gribonval, R., 2007. Adaptation of Bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech and Language Processing, 15(5), pp. 1564—1578. Ozerov, A., Liutkus, A., Badeau, R. and Richard, G., 2011. Informed source separation: Source coding meets source separation. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1—4. Paik, K, and Iwerks, L., (2007). To Infinity and Beyond! The Story of Pixar Animation Studios, Virgin Books. Pal, P., Iyer, A.N. and Yantorno, R.E., 2006. Emotion detection from infant facial expressions and cries. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, pp. 721—724. 123 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Paleari, M. and Lisetti, C., 2006. Psychologically grounded avatars expressions. 1st Workshop on Emotion and Computing at the 29th Annual Conference on Artificial Intelligence, pp. 1—4. Paris, S., Sillion, F. and Quan, L., 2006. A surface reconstruction method using global graph cut optimization. International Journal of Computer Vision, 66(2), pp. 141—161. Parke, F.I., 1972. Computer generated animation of faces. ACM Annual Conference, 1, pp. 451— 457. Parker, G.B. and Probst, M.H., 2010. Using evolution strategies for the real-time learning of controllers for autonomous agents in Xpilot-AI. IEEE Congress on Evolutionary Computation (CEC), pp. 1—7. Partala, T. and Surakka, V., 2004. The effects of affective interventions in human-computer interaction. Interacting with Computers, 16, pp. 295—309. Patil, R.A., Sahula, V. and Mandal, A.S., 2010. Automatic recognition of facial expressions in image sequences: A review. International Conference on Industrial and Information Systems, pp. 408— 413. Perse, M., Kristan, M., Pers, J., Music, G., Vuckovic, G. and Kovacic, S., 2010. Analysis of multi-agent activity using petri nets. Pattern Recognition. 43(4), pp. 1491—1501. Pertaub, D.P., Slater, M., and Barker, C., (2001) an experiment on fear of public speaking in virtual reality. Medicine Meets Virtual Reality 2001, pp.372–378. Pertaub, D.P., Slater, M., and Barker, C., (2002). An experiment on public speaking anxiety in response to three different types of virtual audience. Presence: Teleoperators and Virtual Environments, 11(1), pp. 68–78. Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, A., 2007. Object retrieval with large vocabularies and fast spatial matching. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1—8. Pokorny, L.R. and Ramakrishnan, C.R., 2005. Modeling and verification of distributed autonomous agents using logic programming. Declarative Agent Languages and Technologies II, pp. 148—165. Pollefeys, M., Koch, R. and Van Gool, L., 1999. Self-calibration and metric reconstruction in spite of varying and unknown intrinsic camera parameters. International Journal of Computer Vision, 32, pp. 7—25. Pons, J., Keriven, R. and Faugeras, O., 2007. Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. International Journal of Computer Vision, 72(2), pp. 179—193. Poppe, R., 2010. A survey on vision-based human action recognition. Image Vision Computing, 28(6), pp. 976—990. Prendinger, H. and Ishizuka, M., 2005. The empathic companion: A character-based interface that addresses users’ affective states. International Journal of Applied Artificial Intelligence, 19, pp. 285—297. Prendinger, H., Yingzi, J., Kazutaka, K. and Ma, C., 2005. Evaluating the interaction with synthetic agents using attention and affect tracking. 4th International Joint Conference on Autonomous Agents and Multi-agent Systems (AAMAS), pp. 1099—1100. Rabiner, L.R., 1993. Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs. 124 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Raghuvanshi, N., Snyder, J., Mehra, R., Lin, M. and Govindaraju, N., 2010. Precomputed wave simulation for real-time sound propagation of dynamic sources in complex scenes. ACM Transactions on Graphics (TOG), 29(4), p. 68. Rehm, M. and André, E., 2005. Catch me if you can – Exploring lying agents in social settings. International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pp. 937—944. REVERIE (2011). Description of work. Rich, S.C. and Sidner, C.L., 2009. Robots and avatars as hosts, advisors, companions and jesters. AI Magazine, 30(1), pp. 29—41. Ripeanu, M., 2001. Peer-to-peer architecture case study: Gnutella network. 1st International Conference on Peer-to-Peer Computing, pp. 99—100. Robillard, G., Bouchard, S., Fournier, T., and Renaud, P., (2003). Anxiety and presence during VR immersion: a comparative study of the reactions of phobic and non-phobic participants in therapeutic virtual environments derived from computer games. Cyberpsychology & Behavior, 6(5), pp. 467–476. Robles-De-La-Torre, G., 2006. The importance of the sense of touch in virtual and real environments. IEEE Multimedia, 13(3), Special Issue on Haptic User Interfaces for Multimedia Systems, pp. 24—30. Rossi, M., 1986. Electroacoustique. Dunod, Paris. Roy, S., Klinger, E., Legeron, P., Lauer, F., Chemin, I., and Nugues, P., (2003). Definition of a VR-based protocol to treat social phobia. Cyberpsychology & Behavior, 6(4), pp. 411–420. Ruttkay, Z., Noot, H. and Ten Hagen, P., 2003. Emotion disc and emotion squares: Tools to explore the facial expression face. Computer Graphics Forum, 22(1), pp. 49—53. Sand Castle Studios LLC (2012). Hunger in Los Angeles debuts at Sundance. Retrieved 22 Feb 2012, from http://changingworldsbuildingdreams.com/hunger-in-los-angeles-debuts-at-sundance, Sand Castle Studios Press Release, 20 Jan 2012. Schaffalitzky, F. and Zisserman, A., 2002. Multi-view matching for unordered image sets, or How do I organize my holiday snaps. Computer Vision, pp. 414—431. Scherer, K.R., 2001. Appraisal considered as a process of multilevel sequential checking. Appraisal Processes in Emotion: Theory, Methods, Research, pp. 92—119. Schneider, D.C., Kettern, M., Hilsmann, A. and Eisert, P., 2011. A global optimization approach to high-detail reconstruction of the head. 16th International Workshop on Vision, Modeling and Visualization (VMV), pp. 1—7. Schröder, M., Heylen, D. and Poggi, I., 2006. Perception of non-verbal emotional listener feedback. Speech Prosody, pp. 43—46. Semertzidis, T., Daras, P., Moore, P., Makris, L., and Strintzis, M.G., (2011). Automatic creation of 3D environments from a single sketch using content-centric networks. Communications Magazine, IEEE, 49(3), pp. 152-157. Serra, X., 1989. A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition. Ph. D. Stanford University. Shao, W. and Terzopoulos, D., 2005. Autonomous pedestrians. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 19—28. 125 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Shiratori, T., Soo Park, H., Sigal, L., Sheikh, Y. and Hodgins, J.K., 2011. Motion capture from bodymounted cameras. ACM Transactions on Graphics (TOG), 30(4), p. 31. Sibert, L.E. and Jacob, R.J.K., 2000. Evaluation of eye gaze interaction. Conference of Human-Factors in Computing Systems, pp. 281—288. Sidner, C. L., Kidd, C.D., Lee, C. and Lesh, N., 2004. Where to look: A study of human-robot interaction. Intelligent User Interfaces Conference, pp. 78—84. Sidner, C.L., Lee, C., Kidd, C.D., Lesh, N. and Rich, C., 2005. Explorations in engagement for humans and robots. Artificial Intelligence, 166(1—2), pp. 140—164. Slater, M., Pertaub, D.P., and Steed, A., (1999). Public speaking in virtual reality: facing an audience of avatars. Computer Graphics and Applications, IEEE, 19(2), pp 6–9. Slyper, R., and Hodgins, J., 2008. Action capture with accelerometers. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 193—199. Smolic, A., Merkle, P., Müller, P., Fehn, C., Kauff, P. and Wiegand, T., 2008a. Compression of multiview video and associated data. 3D Television, pp. 313—350. Smol;ic, A., Müller, K., Dix, K., Merkle, P., Kauff, P. and Wiegand, T., 2008b. Intermediate view interpolation based on multiview video plus depth for advanced 3D video systems. IEEE International Conference on Image Processing (ICIP), pp. 2448—2451. Smolic, A., Sondershaus, R., Stefanoski, N., Vasa, L., Müller, K., Ostermann, J. and Wiegand, T., 2008c. A survey on coding of static and dynamic 3D meshes. 3D Television, pp. 239—311. Smolic, A., 2011. ETH Zeurich Lecture Notes on Multimedia http://graphics.ethz.ch/teaching/mmcom11/notes.php, Accessed: 2011. Communication, Smolic, A., Kauff, P., Knorr, S., Hornung, A., Kunter, M., Mueller, M. and Lang, M., 2011. 3D video postproduction and processing. IEEE Special Issue on 3D Media and Displays. Snavely, N., Seitz, S. and Szeliski, R., 2006. Photo tourism: Exploring photo collections in 3D. ACM Transactions on Graphics (TOG), 25(3), pp. 835—846. Snavely, N., Seitz, S. and Szeliski, R., 2008. Skeletal graphs for efficient structure from motion. IEEE Computer Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1—8. Soucy, M. and Laurendeau, D., 1995. A general surface approach to the integration of a set of range views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(4), pp. 344—358. Steidl, S., Levit, M., Batliner, A., Noth, E. and Niemann, H., 2005. Of all things the measure is man: Automatic classification of emotions and inter-labeler consistency. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 317—320. Stockhammer, T., 2011. Dynamic adaptive streaming over HTTP: Design principles and standards. 2nd Annual ACM Conference on Multimedia Systems, pp. 133—144. Stoiber, N., Seguier, r. and Breton, G., 2009. Automatic design of a control interface for a synthetic face. 13th International Conferenece on Intelligent User Interfaces, pp. 207—216. Tautges, J., Zinke, A., Kruger, B., Baumann, J., Weber, A., Helten, T., Muller, M., Seidel, H. and Eberhardt, B., 2011. Motion reconstruction using sparse accelerometer data. ACM Transactions on Graphics, 30(3), pp. 1—12. Terzopoulos, D. and Waters, K., 1990. Physically-based facial modeling, analysis and animation. Journal of Visualization and Computer Animation, 1(2), pp. 73—80. 126 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Thiébaux, M., Marsella, S., Marshall, A. and Kallmann, M., 2008. SmartBody: Behavior realization for embodied conversational agents. 7th Conference on Autonomous Agents and Multi-Agent Systems, pp. 151—158. Tiesal, J.P. and Loviscach, J., 2006. A mobile low-cost motion capture system based on accelerometers. International Symposium on Visual Computing, pp. 437—446. Tola, E., Lepetit, V. and Fua, P., 2010. DAISY: An efficient dense descriptor applied to wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5), pp. 815—830. Tola, E., Strecha, C. and Fua, P., 2011. Efficient large scale multi-view stereo for ultra-high resolution image sets. Machine Vision and Applications, pp. 1—18. Towles H. Kum S, Sparks T, Sinha S, Larsen S, Beddes N.2003. Transport and Rendering Challenges of Multi-Stream, 3D Tele-Immersion Data in NSF Lake Tahoe Workshop on Collaborative Virtual Reality and Visualization (CVRV’03) pp. 50-56. Tran, S. and Davis, L., 2006. 3D surface reconstruction using graph cuts with surface constraints. European Conference on Computer Vision (ECCV), pp. 218—231. Triggs, B., McLauchlan, P.F., Harley, R. and Fitzgibbon, A.W., 1999. Bundle adjustment – A modern synthesis. Vision Algorithms: Theory and Practice, pp. 298—375. Truong, K.P. and van Leeuwen, D.A., 2007. Automatic discrimination between laughter and speech. Speech Communication, 49, pp. 144—158. Turk, G. and Levoy, M., 1994. Zippered polygon meshes from range images. 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 311—318. Van Welbergen, H., Reidsma, D., Ruttkay, Z. and Zwiers, J., 2010. Elkerlyc – A BML realizer for continuous, multimodal interaction with a virtual human. Journal on Multimodal User Interfaces, 4, pp. 271—284. Varcholik, P.D., LaViola Jr, J.J., and Hughes, C., (2009). The bespoke 3DUI XNA framework: a low-cost platform for prototyping 3D spatial interfaces in video games. Proceedings of the 2009 ACM SIGGRAPH Symposium on Video Games, ACM, pp. 55– 61. Vasudevan, R., Kurillo, G., Lobaton, E., Bernardin, T., Kreylos, O., Bajcsy, R. and Nahrstedt, K., 2011. High quality visualization for geographically distributed 3D tele-immersive applications. IEEE Transactions on Multimedia, 13(3), 573—584. Verges-Llahi, J., Moldovan, D. and Wada, T., 2008. A new reliability measure for essential matrices suitable in multiple view calibration. 3rd International Conference on Computer Vision Theory and Applications (VISAPP), 1, pp. 114—121. Vilhjálmsson, H., Cantelmo, N., Cassell, J., Chafai, N.E., Kipp, M., Kopp, S., Mancini, M., Marsella, S., Marshall, A., Pelachaud, C., Ruttkay, Z., Thórisson, K., Van Welbergen, H. and Van Der Werf, R., 2007. The behavior markup language: Recent developments and challenges. 7th International Conference on Intelligent Virtual Agents, 4722, pp. 99—111. Vincent, E., Gribonval, R. and Fevotte, C., 2005. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), pp. 1462—1469. Vincent, E., Bertin, N. and Badeau, R., 2010. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Transactions on Audio, Speech and Language Processing, 18(3), pp. 528—537. Virtanen, T., 2006. Unsupervised learning methods for source separation in monaural music signals. Signal Processing Methods for Music Transcription, pp. 267—296. 127 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Virtanen, T., 2007. Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech and Language Processing, 15(3), pp. 1066—1074. Vlasic, D., Adelsberger, R., Vannucci, G., Barnwell, J., Gross, M., Matusik, W. and Popović, J., 2007. Practical motion capture in everyday surroundings. ACM Transactions on Graphics, 26(3), p. 35. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S. and Matusik, W., 2009. Dynamic shape capture using multi-view photometric stereo. 28(5), p. 174. Vogiatzis, G., Hernandez, C., Torr, P. and Cipolla, R., 2007. Multiview stereo via volumetric graphcuts and occlusion robust photo-consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), pp. 2241—2246. Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L. and Schilling, A., 2007. Reconstruction of deforming geometry from time-varying point clouds. 5th Eurographics Symposium on Geometry Processing, pp. 49—58. Wand, M., Adams, B., Ovsjanikov, M., Berner, A., Bokeloh, M., Jenke, P., Guibas, L., Seidel, H. and Schilling, A., 2009. Efficient reconstruction of non-rigid shape and motion from real-time 3D scanner data. ACM Transactions on Graphics (TOG), 28, p. 15. Ward, J.A., Lukowicz, P. and Troster, G., 2005. Gesture spotting using wrist worn microphone and 3axis accerlerometer. Joint Conference on Smart Objects and Ambient Intelligence: Innovative Context-Aware Services: Usages and Technologies, pp. 99—104. Weinland, D., Ronfard, R. and Boyer, E., 2011. A survey of vison-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2), pp. 224—241. Weise, T., Bouaziz, S., Li, H. and Pauly, M., 2011. Realtime performance-based facial animation. ACM Transactions on Graphics (TOG), 30(4), p. 77. Welch, G. and Foxlin, E., 2002. Motion tracking: No silver bullet, but a respectable arsenal. IEEE Computer Graphics and Applications, 22(6), pp. 24—38. Wenger, S., 2003. H.264/AVC over IP. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), pp. 645—656. Wiegand, T., Sullivan, G.J., Bjontegaard, G. and Luthra, A., 2003. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), pp. 560— 576. Wilson, C.A., Ghosh, A., Peers, P., Chiang, J., Busch, J. and Debevec, P., 2010. Temporal unsampling of performance geometry using photometric alignment. ACM Transactions on Graphics (TOG), 29, p. 17. Wu, W., Yang, Z. and Nahrstedt, K., 2009. Dynamic overlay multicast in 3D video collaborative systems. 18th International Workshop on Network and Operating Systems Support for Digitial Audio and Video, pp. 1—6. Xu, F., Liu, Y., Stoll, C., Tompkin, J., Bharaj, G. Dai, Q., Seidel, H., Kautz, J. and Theobalt, C., 2011. Video-based characters – Creating new human performances from a multi-view video database. ACM Transactions on Graphics (TOG), 30(4), p. 32. Yang, R., Pollefeys, M. and Welch, G., 2003. Dealing with textureless regions and specular highlights – A progressive space carving scheme using a novel photo-consistency measure. 9th IEEE International Conference in Computer Vision (ICCV), pp. 576—584. 128 FP7-ICT-287723 - REVERIE D3.1-Report On Available Cutting-Edge Tools Yang, Z., Wu, W., Nahrstedt, K., Kurillo, G. and Bajcsy, R., 2009. Enabling multi-party 3D teleimmersive environments with ViewCast. ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), 6(2), p.7. Zeng, G., Paris, S., Quan, L. and Sillion, F., 2007. Accurate and scalable surface representation and reconstruction from images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), pp. 141—158. Zhang, T., Hasegawa-Johnson, M. and Levinson, S.E., 2004. Children’s emotion recognition in an intelligent tutoring scenario. 8th European Conference on Speech Communication and Technology (INTERSPEECH), pp. 1—4. Zhang, X., Liu, J., Li., B. and Yum, Y.S.P., 2005. CoolStreaming/DONet: A data-driven overlay network for peer-to-peer live media streaming. 24th Annual Joint Conference of the IEEE Computer and Communications Society (INFOCOM), 3, pp. 2102—2111. Zhang, S., Wu, Z., Meng, H.M. and Cai, L., 2007. Facial expression synthesis using PAD emotional parameters for a Chinese expressive avatar. 2nd Affective Computing and Intelligent Interaction Conference (ACII), Lecture Notes in Computer Science, pp. 24—35. Zhang, X., Biswas, D.S., and Fan, G., (2010). A software pipeline for 3D animation generation using mocap data and commercial shape models. Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 350–357. Zollhoefer, M., Martinek, M., Greiner, G., Stamminger, M. and Suessmuth, J., 2011. Automatic reconstruction of personalized avatars from 3D face scans. Computer Animation and Virtual Worlds, 22(2—3), pp. 195—202. Zwicker, M., Vetro, A., Yea, S., Matusik, W., Pfister, H. and Durand, F., 2007. Signal processing for multi-view 3D displays: Resampling, antialiasing and compression. IEEE Signal Processing Magazine, pp. 88—96. 129