D3.1-Report on available cutting-edge tools

Transcription

FP7-ICT-287723 - REVERIE
D3.1-Report On Available Cutting-Edge Tools
Information and Communication Technologies (ICT)
Programme
WP3
D3.1 Report on Available Cutting-Edge
Tools
Leading Author(s): Qianni Zhang, Julie Wall (QMUL)
Status -Version: Version 1.0
Contractual Date: 29 February 2012
Actual Submission Date: 29 February 2012
Distribution - Confidentiality: Public
Code: REVERIE_D3_1_QMUL_V01_20120124.docx
1
 Copyright by the REVERIE Consortium
Disclaimer
This document contains material, which is the copyright of certain REVERIE contractors, and may
not be reproduced or copied without permission. All REVERIE consortium partners have agreed
to the full publication of this document. The commercial use of any information contained in
this document may require a license from the proprietor of that information.
The REVERIE Consortium consists of the following companies:
No
Participant name
1
STMicroelectronics
2
Queen Mary University of London
3
Short
name
Roll
Country
Co-ordinator
Italy
QMUL
Contractor
UK
CTVC Ltd
CTVC
Contractor
UK
4
Blitz Games Studios
BGS
Contractor
UK
5
Alcatel-Lucent Bell N.V.
ALU
Contractor
Belgium
6
Disney Research Zurich
DRZ
Contractor
Switzerland
7
Fraunhofer Heinrich Hertz Institute
HHI
Contractor
Germany
8
Philips Consumer Lifestyle
PCL
Contractor
Netherlands
9
Stichting Centrum voor Wiskunde en Informatica
CWI
Contractor
Netherlands
10
Institut Telecom - Telecom ParisTech
TPT
Contractor
France
11
Dublin City University
DCU
Contractor
Ireland
12
Synelixis Solutions Ltd
SYN
Contractor
Greece
13
CETRH/Informatics and Telematics Institute
CERTH
Contractor
Greece
ST
The information in this document is provided “as is” and no guarantee or warranty is given that
the information is fit for any particular purpose. The user thereof uses the information at its
sole risk and liability.
2
Contributors
Name
Company
Qianni Zhang
QMUL
Julie Wall
QMUL
Sigurd van Broeck
ALU
Fons Kuijk
CWI
Aljoscha Smolic
DRZ
Petros Daras
CERTH
Rufael Mekuria
CWI
Catherine Pelachaud
IT/TPT
Angélique Drémeau
IT/TPT
Philipp Fechteler
HHI
George Mamais
SYN
Philip Kelly
DCU
Menelaos Perdikeas
SYN
Gaël Richard
IT/TPT
Tamy Boubekeur
IT/TPT
Internal Reviewers
Name
Company
Daniele Alfonso
ST
Petros Daras
CERTH
3
Document Revision History
Date
Issue
Author/Editor/Contributor
Summary of main changes
24/01/2012
1.0
Q. Zhang
Document structure
22/02/2012
2.0
J. Wall
First integrated version
29/02/2012
2.1
D. Alfonso, P. Daras, all
Revised version
29/02/2012
3.0
J. Wall
Final version
05/03/2012
4.0
J. Wall
Final version with the new logo
4
Table of contents
Abbreviations ................................................................................................................................ 6
Executive Summary ....................................................................................................................... 9
1. Related Tools to WP4: Multi-modal and multi sensor signal acquisition .................................. 10
1.1. Content creation........................................................................................................................ 10
1.2. Multi-modal Capturing .............................................................................................................. 17
1.3. Performance Capture ................................................................................................................ 29
1.4. Activity recognition ................................................................................................................... 35
1.5. 3D reconstruction ...................................................................................................................... 37
1.6. 3D User-Generated Worlds from Video/Images ....................................................................... 46
2. Related Tools to WP5: Networking for immersive communication .......................................... 49
2.1. Network Architecture ................................................................................................................ 49
2.2. Naming ...................................................................................................................................... 53
2.3. Resource Management ............................................................................................................. 58
2.4. Streaming .................................................................................................................................. 62
2.5. Signaling for 3DTI real-time transmission ................................................................................. 70
2.6. MPEG-V Framework .................................................................................................................. 72
3. Related Tools to WP6: Interaction and autonomy ................................................................... 75
3.1. 3D Avatar Authoring Tools ........................................................................................................ 75
3.2. Animation Engine ...................................................................................................................... 77
3.3. Autonomous Agents .................................................................................................................. 81
3.4. Audio and speech tools ............................................................................................................. 84
3.5. Emotional Behaviours ............................................................................................................... 86
3.6. Virtual Worlds ........................................................................................................................... 88
3.7. User-system interaction ............................................................................................................ 90
4. Related Tools to WP7: Composition and visualisation ............................................................. 96
4.1. Rendering of human characters ................................................................................................ 96
4.2. Scene recomposition with source separation .......................................................................... 103
4.3. 3D audio rendering of natural sources .................................................................................... 105
4.4. Composition and synchronization ........................................................................................... 107
4.5. Stereoscopic and autostereoscopic display ............................................................................. 112
5. References ............................................................................................................................ 113
5
Abbreviations
3DTI
3GPP
AAC
AEGIS
AI
API
BA
BIFS
BML
CCN
CDN
CGAL
CISS
CPM
CPU
CUDA
DASH
DDC
DESAM
DIVE
DMIF
DNS
DoW
E2E
ECA
EPVH
ER
FACS
FIPS
FML
FMTIA
FOV
GPL
GPU
GSM
GUI
GUID
HCI
HMM
HRIR
HRTF
IETF
IMS
IMU
IP
IR
ISP
6
3D Tele-Immersion
3rd Generation Partnership Project
Augmentative and Alternative Communication
Accessibility Everywhere: Groundwork, Infrastructure, Standards
Artificial Intelligence
Application Programming Interface
Bundle Adjustment
Binary Format for Scene
Behaviour Markup Language
Content Centric Networking
Content Distribution Network
Computational Geometry Algorithms Library
Coding-Based Informed Source Separation
Componential Process Model
Central Processing Unit
Compute Unified Device Architecture
Desktop and mobile Architecture for System Hardware
3D Digital Content
Décomposition en Eléments Sonores et Applications Musicales
Distributed Interactive Virtual Environment
Delivery Multimedia Interface
Domain Name Server
Description of Work
End to End
Embodied Conversational Agents
Exact Polyhedral Visual Hull
Early Reflections
Facial Action Coding System
Federal Information Processing Standard
Function Markup Language
Future Media Internet Architecture Think Tank
Field of View
GNU Public License
Graphics Processing Unit
Global System for Mobile Communications
Graphical User Interface
Globally Unique Identifier
Human Computer Interaction
Hidden Markov Model
Head-Related Impulse Response
Head Related Transfer Functions
Internet Engineering Task Force
IP Multimedia Subsystem
Inertial Measurement Unit
Internet Protocol
Infrared
Internet Service Provider
ISS
Informed Source Separation
ITU
International Telecommunication Union
LGPL
Lesser General Public License
LM
Levenberg-Marquardt
LR
Late Reverberation
MD
Message Digest
MFCC Mel-Frequency Cepstral Coefficients
MM
Man Months
MPEG Moving Picture Experts Group
MPLS Multiprotocol Label Switching
MST
Minimum Spanning Tree
MTU
Maximum Transmission Unit
MVC
Multi-view Video Coding
NI
Natural Interaction
NP-hard Non-deterministic Polynomial-time hard
NPC
Non-Player Character
NVBG Non-verbal Behaviour Generator
PAD
Pleasure, Arousal and Dominance model
PCI
Peripheral Component Interconnect
PCL
Point Cloud Library
PHB
Per Hop Behaviour
PMVS Patch-based Multi-View Stereo
PSTN Public Switched Telephone Network
QoS
Quality of Service
RC
Reverberation Chamber
REPET REpeating Pattern Extraction Technique
RFC
Request for Comments
ROS
Robot Operating System
RSVP Resource Reservation Protocol
RTCP Real-Time Transport Control Protocol
RTP
Real-Time Transport Protocol
RTSP
Real Time Streaming Protocol
RTT
Round Trip Time
SAL
Sensitive Artificial Listener
SBA
Sparse Bundle Adjustment
SC
Steering Committee
SCAPE Shape Completion and Animation of People
SDC
Sensory Devices Capabilities
SDP
Session Description Protocol
SE
Sensory Effects
SEC
Sequential Evaluation Checks
SEDL
Sensory Effect Description Language
SEM Sensory Effect Metadata
SEV
Sensory Effect Vocabulary
SfM
Structure from Motion
SHA
Secure Hash Algorithm
SI
Sensed Information
SIP
Session Initiation Protocol
SLA
Service Level Agreement
SMI
SensoMotoric Instruments
SMIL
Synchronized Multimedia Integration language
SMILE Speech & Music Interpretation by Large-space Extraction
7
STFT
TCP
TMC
TUM
UDK
UDP
URC
URI
URL
URN
USEP
USS
UUID
VCEG
VOIP
VR
VRML
Short-Time Fourier Transform
Transmission Control Protocol
Technical Management Committee
Technische Universität München
Unreal Development Kit
User Datagram Protocol
Uniform Resource Characteristics
Uniform Resource Identifier
Uniform Resource Locator
Uniform Resource Name
User’s Sensory Effect Preferences
Underdetermined Source Separation
Universally Unique Identifiers
Video Coding Experts Group
Voice Over IP
Virtual Reality
Virtual Reality Modeling Language
VTLN Vocal Tract Length Normalization
WP
Work Package
WWW World Wide Web
W3C
WWW Consortium
YAAFE Yet Another Audio Features Extractor
8
Executive Summary
The main objective of REVERIE can be summarised as to develop an advanced framework for
immersive media capturing, representation, encoding and semi-automated collaborative
content production, as well as transmission and adaptation to heterogeneous displays as a key
instrument to push social networking towards the next logical step in its evolution: to immersive
collaborative environments that support realistic inter-personal communication.
The targeted framework will exploit technologies and tools to enable end-to-end processing and
efficient distribution of 3D, immersive and interactive media over the Internet. REVERIE
envisages an ambient, content-centric Internet-based environment, highly flexible and secure,
where people can work, meet, participate in live events, socialise and share experiences, as they
do in real life, but without time, space and affordability limitations.
In order to achieve this goal, the enhancement and integration of cutting-edge technologies
related to 3D data acquisition and processing, sound processing, autonomous avatars,
networking, real-time rendering, and physical interaction and emotional engagement in virtual
worlds is required.
The project consists of eight WPs. The division between WPs has been chosen to group activity
types and skills required to implement the work programme. WP1 includes all management and
overall project coordination activities. The ‘integration, prototyping and validation’ cluster
embraces WP2 and WP3, which contains all activities related to requirement analysis, the design
of the REVERIE framework architecture and the integration of REVERIE prototype applications.
The next cluster – ‘adaptation of cutting-edge tools and R&D’ – contains four WPs (WP4-WP7).
The activities in this cluster focus on fundamental research leading to the implementation of the
tools needed to instantiate the REVERIE prototypes. The last cluster contains WP8. This cluster is
dedicated to activities related to business models exploitation and dissemination of REVERIE’s
outcomes.
This deliverable focuses on the cutting‐edge technologies related to 3D data acquisition &
processing, sound processing, autonomous avatars, networking, real‐time rendering, and
physical interaction & emotional engagement in virtual worlds.
9
1. Related Tools to WP4: Multi-modal and multi sensor
signal acquisition
1.1. Content creation
Interactive 3D virtual environments comprise digital content designed or adapted and
implemented for use within the intended environment. Whether specially created for the
targeted platform, or adapted for use, at some point in the process, the digital content has been
created from a blank canvas. In “To Infinity and Beyond: The Story of Pixar Animation Studios”,
Paik and Iwerks (2007) illustrates the magnitude of the task: “In the computer, nothing is free. A
virtual world begins as the most purely blank slate one can imagine. There are no sets, no props,
no actors, no weather, no boundaries, not even the laws of physics. The computer is not unlike a
recalcitrant genie, prepared to carry out any order you give it – but exactly and no more than
that. So not only must every item be painstakingly designed and built and detailed from scratch,
every physical law and property must be spelled out and applied.”
Realisation of interactive virtual environments is thus both technically complex, and
artistically challenging. Moreover, currently no single set of definitive all-encompassing tools
exist to produce high-quality meaningful results without significant technical expertise and
substantial time expenditure. This therefore leaves ample room for improvement as WP4 is
attempting to address. WP4 focuses on all aspects related to the capturing of real world objects,
scenes and environments and to the interaction between real and virtual worlds (REVERIE,
2011). This section reports on current state-of-the-art available cutting-edge tools to create
content for interactive virtual worlds.
The digital content (assets) which is necessary to populate virtual environments can include
the background environment, set, scene objects, characters, and props; some require
animation, and some are simply static objects. A common approach for content creation is to
first create assets in a 3D Digital Content Creation package (DDC) and optimise them for realtime use and then pass them on to an interactive 3D development tool or framework to add
interactively, behaviours and implement the interactive world itself. Error! Reference source
ot found.1 illustrates an overview of the initial steps involved.
The stages shown in Error! Reference source not found.1 begin with creating a 3D mesh
nd texturing the mesh to provide realistic or stylised visual detail. Dependent on whether
animation is required on the asset, the mesh is prepared for animation by adding a control
structure to the mesh (rigging), and then animating as explained in more detail in the following
paragraphs. Sound (usually edited in a sound editing package) can be added to enable lipsynching or other synchronisation of animation to sound.
Techniques for creating the 3D model can include modelling and sculpting using dedicated
digital content creation software packages, or creating the mesh through the use of scanning or
other 3D reconstruction techniques. Low polygon meshes are generally used within the actual
real-time environment, although high polygon models are often used in the production stages to
improve appearance, by applying the more detailed normal maps obtained from the high
polygon model to the lower polygon model. This enables the lower polygon model to mimic its
higher polygon counterpart without the extra polygon count burden (Cignoni et al. 1998; Cohen
et al. 1998).
10
Once the model is created, shaders (also known as materials) are applied to define the
surface shading of the mesh. This is normally achieved through a combination of the basic
attributes of the shader such as transparency and specularity, plus any applied textures.
Textures can provide added realism or stylised detail such as complex colour, reflections, course
and fine displacements, specular intensity and surface relief. Textures can be procedurally
driven, or be bitmaps created from photographs or from a 2D/3D paint package. The process of
applying texture maps usually involves creating UV maps to map the 2D texture information
onto the 3D mesh. Packages like Autodesk Maya, and Autodesk 3ds Max provide facilities for UV
mapping, but like many other parts of the content creation process, the process of creating UV
maps can be complex and time consuming, and require a high degree of skill. Researchers at
Walt Disney Animation Studios, and the University of Utah developed an alternative solution
called PTex: Per-Face Texture Mapping for Production Rendering (Burley and Lacewell, 2008).
PTex eliminates the need for UV assignment by storing texture information per quad face of the
subdivision mesh, along with a per-face adjacency map, in a single texture file per surface. The
adjacency data is used to perform seamless anisotropic filtering of multi-resolution textures
across surfaces of arbitrary topology. Although originally targeted for off-line rendering,
adaptations have since been made for use in real-time (McDonald and Burley 2011).
Figure 1: Content Creation Pipeline Stage 1: Creating Assets for real-time use
Realistic lighting can prove computationally expensive, and therefore a common practice is
to pre-light scenes, incorporating view-independent light information back into the textures.
This process is known as baking texture light maps. Lighting is then removed from the scenes,
and original materials are replaced with the baked textures containing the pre-calculated
lighting solution to provide the appearance of realistic light without adding the computational
burden of calculating the lighting solution in real time (Myszkowski and Kunii, 1994; Moller,
1996). Figure 2 shows an example of a baked texture, with the original tiled texture on the left,
and the tiles with pre-calculated lighting baked into the texture on the right.
Once the asset has been modeled, if animation is required then the asset must normally be
prepared through design and implementation of a rig; control structures set up to articulate the
mesh. Rigging requires associating the vertices of the geometric mesh of the asset with
hierarchical structures, such as internal virtual skeletons, iconic controls and blend shapes and
other deformers to facilitate articulation of the mesh without the animator having to resort to
11
complex direct manipulations of the individual vertices that make up the mesh, see Figure 3.
When a virtual skeleton joint system is used, the vertices of the mesh are associated with joints
through a process known as skinning. Different joints will have weighted influences on the
vertices depending on their location.
Figure 2: Original texture (left), baked texture with pre-calculated light (right)
Rigging is another time-consuming task in the pipeline that requires a high-degree of skill,
1
and therefore attempts have been made to provide auto-rigging tools. For example, Mixamo is
a tool to auto-rig characters which are then compatible for use in the popular game engine
2
Unity .
Figure 3: Stylised character & rig (available from
http://www.creativecrash.com/maya/downloads/character-rigs/c/blake)
Rig compatibility challenges can also arise when dealing with motion capture data. One
skeleton may be needed for the actual motion capture, another for the 3D DCC package, and
another to deal with refining the animation in a package like Autodesk’s MotionBuilder; a realtime 3D character animation software which provides an industry leading toolset for
manipulation of motion capture data. Zhang et al (2010) have taken on the task of simplifying
the process of dealing with motion capture data across rigs, by proposing a software pipeline
that converts the skeleton rigs to adhere to the motion capture rig (Zhang et al. 2010).
All stages in the pipeline discussed so far need to take into account optimising for the end
goal, i.e. implementation as interactive non-linear presentation in real-time. This can include
double-checking geometry for flaws, optimising textures, baking animation, and removing rig
elements that are incompatible with the interactive 3D development tool. Furthermore, DCC
packages save the file data in their own proprietary formats. In order to be passed on through
the process into a framework, game engine, or virtual environment development tool, the
1
2
http://www.mixamo.com/
http://unity3d.com/
12
assets normally have to be exported into a common format. COLLADA3 and FBX4 are two
popular digital asset exchange formats.
Popular engines to create interaction, and implement the environments include Unreal5,
2
Crytech6, Unity , Vision7 and BlitzTech8, which work optimally with hand crafted 3D assets
(REVERIE, 2011). Such engines provide toolsets to deal with scripting, imported assets,
incorporate effects, real-time shaders, sophisticated lighting, and integrating sound; see Figure
4.
Given that no definitive solution exists to produce content for virtual worlds, a wide variety
of different solutions currently exist to create such environments, including sophisticated off-the
shelf tools, proprietary software & bespoke development frameworks. Some well-established
virtual worlds such as SecondLife9 enable users to populate virtual worlds with contents created
from incorporated proprietary toolsets in addition to importing geometric meshes from
4
4
standard off-the shelf digital 3D DDC packages such as Autodesk Maya and Autodesk 3ds Max .
Figure 4: Content Creation Pipeline Stage 2: Interaction Pipeline
Semertzidis et al. (2011) have devised a system that takes advantage of a more simple and
natural interface. Rather than handcrafting the 3D models, the researchers have devised an
innovative system for creating 3D environments from existing multimedia content (Semertzidis
et al. 2011). The user sketches the desired scene in 2D, and the system conducts a search over
content-centric networks to fetch 3D models that are similar to the 2D sketches. The 3D scene is
then automatically constructed from the retrieved 3D models. Although innovative, it has two
main limitations; one being that it focuses on existing models, therefore ruling out the
possibilities of populating the world with new creations; and secondly no mention is made of
handling animation.
3
https://collada.org/mediawiki/index.php/COLLADA_-_Digital_Asset_and_FX_Exchange_Schema
http://usa.autodesk.com/
5
http://www.unrealengine.com/
6
http://www.crytek.com/
7
http://www.trinigy.net/
8
http://www.blitzgamesstudios.com/
9
http://secondlife.com/
4
13
4
Autodesk’s Catch123D minimises the expertise needed to create 3D models. 123DCatch
enables users to take photographs of desired objects from multiple viewpoints, and have them
stitched together in the cloud to generate high quality 3D models incorporating the photographs
as textures.
Many other attempts have been made to facilitate creating virtual environments. Catanese
et al. (2011) have devised an open-source framework which adapts existing platforms to
facilitate the design of 3D virtual environments. The framework includes OGRE (3D rendering
engine), OGREOggSounds (audio library which acts as a wrapper for the Open AL API), Phsyx
(physics simulator) and NxOgre (physics connector library), combined with Blender extended
with customised plugins (cross-platform open-source graphics and modelling application).
Fitting into the pipeline discussed earlier, the scene itself is designed and created in Blender and
then passed on to the appropriate OGRE based managers. Liu et al. (2009) use a similar
customised approach based on OGRE, Raknet (cross-platform, open-source, C++ networking
engine) and an Autodesk 3ds Max pipeline.
Varcholik et al (2009) propose using Bespoke 3DUI XNA Framework as a low-cost platform for
prototyping 3D spatial interfaces in video games. Their framework extends the XNA Framework;
Microsoft’s set of managed libraries designed for game development based on Microsoft .NET
Framework.
Adhering to the format illustrated by Figure 1 and Figure 4, other researchers have utilised a
pipeline that includes creating the assets and then passing them on to an interactive
development kit to add further functionality and implementation. For example pipelines
involving 3ds Max and 3DVIA Virtools (a virtual environment development and deployment
platform) have been used by researchers investigating anxiety during navigation tasks within
virtual environments (Maiano et al. 2011). The 3ds Max and Virtools pipeline was also used to
create virtual environments to treat social phobia (Roy et al. 2003)
In other studies focusing on anxiety, Robillard et al. (2003) created therapeutic virtual
environments derived from computer games. Half-Life (1998-2000) was used to custom-make
arachnophobia environments and to populate them with animated spiders of different shapes
and sizes. Acrophobia and claustrophobia environments were based on Unreal Tournament
(2000). Using a dedicated virtual environment development tool, Slater et al (1999) created
virtual environments to provide research participants with a simulation of a public speaking
environment for research into fear of public speaking. VRML (Virtual Reality Modeling Language)
assets were created, and DIVE (Distributed Interactive Virtual Environment) was used to
implement the environments (Slater et al. 1999; Pertaub et al. 2001; Pertaub et al. 2002).
The Unity 3D development platform was recently used to realise an immersive journalism
experience, presented at the 2012 Sundance Film Festival (Sand Castle Studios LLC, 2012),
consisting of a virtual recreation of an eyewitness account of a real food bank line incident.
Participants can use a headmouted display to walk around and interact with the virtual
reproduction.
This section has discussed that given that there is no ideal way of creating content, a wide
variety of approaches have been adopted to create content for virtual environments. The next
section provides more detail on the cutting-edge tools currently available.
Available tools from the literature and REVERIE partners

14
Autodesk Maya / 3ds Max
Description: Autodesk Maya / 3ds Max are industry standard 3D content and animation
software for animation, modelling, simulation, visual effects, rendering, etc. Both are
widely used for both film production, and real-time games. Maya is PC, Linux, and Mac OSX
compatible. 3ds Max is primarily Windows based, but Max 2012 can be run on a Mac using
a windows partition.

Blender
Description: Blender is a free open source 3D content creation suite. Blender is compatible
with Windows, Linux or Mac OSX
http://www.blender.org/

Autodesk 123D Catch
Description: Autodesk’s Catch123D enables users to take photographs of desired objects
from multiple viewpoints, and have them stitched together in the cloud to generate high
quality 3D models incorporating the photographs as textures.

Autodesk MotionBuilder
Description: Autodesk MotionBuilder is a real-time 3D character animation software widely
used in film and game animation pipelines. Complex character animation, including motion
capture data, can be edited, and played back in a highly responsive, interactive
environment. MotionBuilder is ideal for high-volume animation and also includes
stereoscopic toolsets.

Autodesk Mudbox
Description: Autodesk Mudbox is a 3D digital sculpting and digital painting software that
enables creation of production-ready 3D digital artwork for ultrarealistic 3D character
modeling, engaging environments, and stylized props. Available for Mac, Microsoft
Windows, and Linux operating systems.

Pixologic Zbrush:
Description: ZBrush is a digital sculpting and painting program that has offers the world’s
most advanced tools for digital artists. ZBrush provides the ability to sculpt up to a billion
polygons.
http://www.pixologic.com/home.php

Unity
Description: Unity is a development platform for creating games and interactive 3D for
multi-platform deployment on the web, iOS (iPhone/iPod Touch/iPad, Mac), PC, Wii, Xbox
360, PlayStation 3, and Android. Its scripting language support includes JavaScript, C#, and a
dialect of Python named Boo. Unity also contains the powerful NVIDIA® PhysX® next-gen
Physics Engine.
All major tools and file formats are supported, and it has the added advantage of being able
to import Autodesk 3ds Max or Autodesk Maya files directly in, and converting it to FBX
format automatically. The assets in a Unity project can continue to be updated in its
15
creation software, and Unity will update automatically upon save, even while the game is
being played inside the Editor.
http://unity3d.com/
Dependencies on other technology: Content-creation packages such as Autodesk Maya /
3ds Max/ Blender.

Unreal Development Kit (UDK)
Description: UDK is a complete professional development framework which produces
advanced visualisations and detailed 3D simulations on the PC and iOS.
http://www.udk.com/
3ds Max /Blender

OGRE (Open Source 3D Graphics Engine)
Description: OGRE is a scene-oriented, flexible 3D engine designed to produce 3D
interactive applications.
http://www.ogre3d.org/
3ds Max /Blender

Crytek
Description: The CryENGINE is an advanced development solution enables creation of
games, movies, high-quality simulations and interactive applications. It provides an all-inone game development solution for the PC, Xbox 360™ and PlayStation®3
http://www.crytek.com/
3ds Max /Blender.

Vision
Description: Game Engine gives that enables creation of games on most major platforms
(PC, consoles, browsers, handhelds, mobiles), as well as services such as XBLA™,
PlayStation®Network, and WiiWare.
http://www.trinigy.net/products/vision-engine
3ds Max /Blender.

BlitzTech
Description: The BlitzTech game engine offers cross-platform runtime code which provides
all hardware specific and common code for the game titles supporting PC, PSP™, Xbox
360®, PS3™, Wii™ and Mac OS X, Browser, PlayStation®Vita, iOS, Android and emerging
platforms. Highly integrated with BlitzTech Tools, it facilitates rapid prototyping and
development by providing a common code framework right out of the box.
http://www.blitzgamesstudios.com/blitztech/engine/
3ds Max /Blender.
16
1.2. Multi-modal Capturing
3D motion capture
Optical motion capture systems, such as Vicon or Codamotion, are typically considered the
gold standard techniques for capturing the movement of human subjects. Both these systems
are optical Infrared (IR)-based depth capture techniques that use a number of cameras to track
the temporal positions of a number of markers fixed directly onto a human body, or on a tightfitting suit. Passive reflective markers are used in Vicon, whereas active IR led-based markers are
used in the Codamotion system. The results obtained from either system are highly accurate,
typically reconstructing human motion to within a few millimeters of the ground truth motion,
and have been used extensively in the computer animation, sports science and bio-mechanics
communities. However, these systems tend to be highly expensive and typically require trained
expert users to capture data. Chai and Hodgins (2005) proposed a smaller scale optical capture
system that needed far fewer markers and only two cameras. A database of poses was used to
do lazy-learning and infer the person’s pose using only the locations of a few tags on the body.
However, the accuracy of this approach is significantly reduced if the captured subject is not
directly facing the capture device. The recent emergence of the Kinect has provided controllerfree full-body gaming in home environments using image processing techniques coupled with
IR-based depth capture hardware. As opposed to Vicon, they acquire a depth map of the
individual it is surveying and infers human motion and actions directly from this depth map.
These devices are cheap, easy to use, and can track the limb motions of a tracked individual if
they are facing the capture device (the device’s accuracy can be significantly reduced if the
captured subject takes up alternative orientations). These advantages have resulted in
significant adoption of the device in the computer gaming community. Finally, Shiratori et al.
(2011) takes an interesting alternative approach, instrumenting an actor with multiple outwardlooking cameras and acquiring pose and global position of the subject using image processing
techniques.
However, capturing motion with optical systems can be impractical and cumbersome in
some scenarios. Performance capture within large spatial volumes, in particular, is challenging
because it may be difficult to densely and safely populate the capture area with enough cameras
to ensure adequate coverage (Kinect based capture tends to significantly decrease in accuracy
after 8 metres). Additional complications arise during outdoor data capture sessions, where
there is little control over other variables such as lighting conditions, occlusions and even a
subject’s movements in or out of capture areas, all of which decrease the robustness of optical
marker tracking algorithms. As such, the use of optical motion capture systems tends to be
limited to indoor settings. Furthermore, the effectiveness of optical motion capture systems can
further be hampered when strict constraints on reconstruction time leave little scope for
manual intervention to correct artifacts in the captured motions. This is especially true when
motion from high-velocity movements is required and tracking (of markers or limbs) can easily
be lost. In addition, Kinect reconstruction accuracy can be hampered for body orientations that
are not perpendicular to the camera’s principal axis (such as when a user lies on the ground) and
for fine grained movements that do not significantly deform either the depth or shape
silhouette (such as rotations around longitudinal axis of bones, twisting a wrist/foot). A review
of the technologies available for motion capture is provided in (Welch and Foxlin, 2002), while a
review of over 350 publications of computer vision based approaches to human motion capture
is provided in (Moeslund et al. 2006).
As lighting changes, coverage and occlusion introduce robustness issues for most optical
systems, acquiring motion from alternative sensor technologies has been investigated. Sensor
17
data from accelerometers (Slyper and Hodgins, 2008; Tautges et al. 2011), Inertial Measurement
Units (IMUs) (Vlasic et al. 2007) and pressure sensors (Ha et al., 2011) have been employed to
capture the biomechanical motion of a human performer in real-time. These techniques have
been employed in a variety of applications where traditional motion capture techniques are
unfeasible, including real-time game interaction and biomechanical analysis of injury
rehabilitation exercises in the home, for example. Techniques that use wearable sensor data can
be split into two general groups. The first category acquires temporal limb and body positions of
a subject by directly interpreting the acquired sensor data, a well-known use of this technology
is the Nintendo Wii controller. When held in a static position, a tri-axial accelerometer outputs a
gravity vector that points towards the earth. This information alone is enough to determine the
sensor’s pitch and roll. In prior work, the assumption of low acceleration motion has been used
to determine body limb positions using wearable accelerometers (Lee and Ha, 2001; Tiesel and
Loviscach, 2006). Combining an accelerometer with a magnetometer or a digital compass,
additionally allows sensor yaw to be determined. A gyroscope can be used to account for high
acceleration motions. The combination of all three sensors into one unit is generally referred to
as an IMU and has been incorporated into commercial products (Xsens). However purely inertial
systems are prone to drift in accuracy over time, especially if fast movements with high
accelerations are performed, although incorporating even further sensors on the body can
reduce the negative impact of this drift (Vlasic et al. 2007). In addition, although many of these
approaches can acquire the motion of subjects’ limbs, they are unable to synthesise the
temporal location and movement of a person through an environment accurately. However, a
trade-off exists, as these devices can be cheap, can operate outside of the lab and free-form
motion can be obtained in some scenarios in real-time. In such scenarios, if human actions are
required, as well as human motion, then machine learning techniques are also used to infer the
actions directly from the motion (e.g. if in tennis a significant forward motion for a right arm is
determined over a period of time, then a forehand shot action can be inferred).
The second category of techniques are data-driven, typically using accelerometer sensor data
to index into a motion database and animate motion segments that exhibit similar inertial
motion data to those acquired by the worn sensors (Slyper and Hodgins, 2008; Kumar, 2008;
Tautges et al. 2011). Each of these techniques animates an avatar in near-real time using
readings from wearable accelerometers from a database of pre-recorded motion of the target
activities. These techniques are typically not affected by the drift, even when only
accelerometer sensors are used, making the techniques applicable to athletes or on-camera
actors, who can tolerate only a lightweight and hidden form of instrumentation. The trade-off
for obtaining high quality capture from lightweight instrumentation is that these approaches
require a pre-captured database of short recaptured segments of motion that cover the basic
behaviours expected from the subject during the capture. As such, an inherent limitation of
these techniques is their inability to reconstruct free-form motion that is significantly different
from the movements stored in the motion database. However, if the database of motions is
labeled with specific actions, then not only motion, but actions can also be acquired without the
need for machine learning techniques.
Audio capture
Sound capturing is performed by means of microphones. A large variety of microphones
exist. They are either classified by their specific mode of transduction of the sound wave into an
electric signal or by their characteristic of directivity. For example, omnidirectional directivity is
obtained if the sensor captures the overall sound pressure. To obtain a specific directivity a
pressure gradient needs to be measured (Rossi, 1986).
18
In practice, the use of microphones in a closed space is directly associated to two essential
concepts in room acoustics: the concept of reverberation and critical distance. The sound field
at the microphone is composed of a direct sound field (i.e. the direct path between microphone
and sound source) and a diffuse field, also called reverberation field (composed by the signals
produced by the sound source after having been altered by reflection, diffusion and diffraction).
The principal goal of sound capturing is to capture the direct sound field. However capturing the
diffuse sound field may be either desirable or problematic, depending on the situation/use case.
Indeed, it is desirable to capture part of the diffuse field to reproduce the ambiance in which the
sound sources are placed or to reproduce specific room characteristics (in particular this is often
desired for classical music or live pop music recordings). However, in other situations, the
diffuse sound field may principally contain sources of perturbation (such as noises and
concurrent voices) which are problematic and lower the speech intelligibility or the music
recording quality.
Another concept of interest is the capability to record an audio scene with the spatial
information of each source. This is usually done by using multiple microphones.
It is finally worth noting that a large number of aspects of audio recording are influenced by
the capturing process itself; thus it is possible to adopt a number of sound processing
techniques that would improve the sound recording (such as noise reduction, and source
separation).
Conventional stereophony
Conventional stereophony encodes the relative position of the sound sources (during the
recording) in terms of intensity and delay differences between the two signals that would be
emitted by the two loudspeakers. This is commonly obtained, during the recording, by using two
microphones of different directivity which are either placed at different positions or oriented in
different directions.
When the microphones are placed in different locations we refer to a “couple of non-coincident
microphones”. In the other case, when the two microphones are collocated, we refer to a
“couple of coincident microphones”.
Figure 5: Schematic diagram of coincident microphones
19
Figure 6: Stereophonic AB-ORTF couple
Surround sound recording
When an improved surround sound experience is desired, it is common practice to increase
the number of microphones. This is indeed the case for the sound recording in the Ambisonics
format (called UHJ or B Format for the 1st order approximation) which can be obtained, for
example, through a sound-field microphone with 4 sensors. More information on Ambisonics
recordings and synthesis can be found at http://www.ambisonic.net/ and
http://en.wikipedia.org/wiki/Ambisonics.
Using the sound-field microphone, we obtain four signals from the microphone capsules; this is
often referred as A-Format. The resulting signals are then further processed and encoded in Bformat; for example see (Craven and Gerzon, 1975).
Figure 7: Schematic diagram of a 4 sensors sound field microphone
Figure 8: Plug-and-Play setup for Surround Ambience Recording ORTF Surround Outdoor Set
with the wind protection (http://www.schoeps.de/en/products/ortf-outdoor-set)
Microphone directivity
As briefly outlined above, microphones can be classified according to their directivity
characteristics. We provide below typical directivity patterns for specific microphones from 3
different categories: omnidirectional, cardioid, and shotgun. The following directivity pattern is
20
typical of an omnidirectional microphone where the sensitivity is similar in all directions; except
for very high frequencies above 10 kHz.
Figure 9: Directivity pattern for microphone U89 in omnidirectional position
The directivity pattern of cardioid microphones is clearly asymmetric.
Figure 10: Directivity pattern for microphone U89 in cardioid position
The directivity pattern of shotgun microphones is close to hypercardioid. Shotgun
microphones are known to be quite robust to wind interference in outdoor recordings.
Figure 11: Directivity pattern for shotgun microphone KMR 82
21
Sound capturing and critical distance
When dealing with sound capturing it is of primary importance to assess the impact of the
microphone selectivity on the recording. This aspect is particularly important: in outdoor
recordings when noise sources may be prominent; and in indoor situations when the room has a
strong influence on the sound characteristic. For indoor recordings, the concept of critical
distance is particularly relevant.
The room impulse response clearly depends on the location of the sound source and on the
microphone (or sensor) location. The sound captured by/at the microphone is the sum of
several components: the direct sound field, the early reflections (ER) and the diffuse sound field
(or reverberation). A key-aspect of sound recording is how the sound field is modified when the
sources or sensors are moving. It is possible to show that the diffuse sound field is relatively
independent of the source and microphones’ locations, while the direct sound field component
largely depends on those positions. The critical distance is defined as the distance for which the
energy of the direct sound field is equal to the energy of the diffuse sound field. It is therefore
possible to compare the critical distance of several microphones types, obtaining an indication
for their appropriate use in real recordings.
Figure 12: Influence of directivity on the critical distance
Noise reduction
Noise reduction has been a field of intense research for many decades. We often subdivide
the research domain depending on the number of microphones or sensors used. Spatial
information is not available when only one microphone is employed. Therefore, the resulting
performances are of lower quality than those that can be attained when multiple sensors can be
employed.
In most situations, it is possible to collect an example of the background noise (without the
signal of interest), and thus to estimate the background noise spectral properties. For audio
signals, an excerpt of 0.5 seconds is typically sufficient to obtain a satisfying estimation of the
background noise spectral properties. It is however necessary to automatically detect such
portions of the signals, where only the background noise is present. This is usually done by
considering that the noise has different statistical properties than the signal of interest. This
type of approach is well suited for stationary noises (noises whose statistical properties of the
two first orders are independent of time). In practice, the stationary hypothesis is only valid for
finite time horizons and it is thus necessary to rely on adaptive algorithms which will adaptively
estimate the noise spectral properties. While in some cases noise reduction can benefit from a
22
source production model (in particular this is possible for noise reduction of speech signals), in
general it is not possible to exploit such models due to the high variability of audio sources.
However, methods initially introduced for speech signals, which rely on the fact that the signal
of interest is periodic or quasi-periodic, will also work to a certain extent for general audio
signals. Such methods are usually based on the principle of spectral subtraction, which consists
in lowering or even suppressing the low energy component (or those which are between
harmonics) of the short term Fourier transform. The interested reader may consult the following
references (Cappé, 1994; Ephraim and Malah, 1984; Ephraim and Malah, 1985; Ephraim and
Malah, 1998; Godsill and Rayner, 1998).
When multiple microphones are available the spatial information can also be exploited. The
most used principle consists of re-synchronizing multiple microphones and then adding the
different contributions. As a result, the signal of interest would be amplified while the undesired
noise signals will have no reason to interfere positively. An example of such setting for a linear
array of sensors is shown in Figure 13.
Figure 13: Schematic representation for beam forming in a linear microphone array
Improved noise-reduction performances can be obtained with such arrays. Note that a wide
range of combinations and structures can be designed for microphone arrays. The interested
reader may in particular consult (Brandstein and Ward, 2001).
Source separation
This brief overview of source separation was extracted from (Mueller et al. 2011). The goal of
source separation is to extract all individual sources from a mixed signal. In a musical context,
this translates to obtaining the individual track of each source or instrument (or individual notes
for polyphonic instruments, such as a piano). A number of excellent overviews of source
separation principles are available, see (Virtanen, 2006; Comon and Jutten, 2010). In general,
source separation refers to the extraction of full bandwidth source signals but it is interesting to
mention that several polyphonic music processing systems rely on a simplified source separation
paradigm. For example, a filter bank decomposition (splitting the signal in adjacent well defined
frequency bands) or a mere Harmonic/Noise separation (Serra, 1989) (as for drum extraction
(Gillet and Richard, 2008) or tempo estimation (Alonso et al. 2007)) may be regarded as
instances of rudimentary source separation.
Three main situations occur in source separation problems: the determined case corresponds
to the situation where there are as many mixture signals as different sources in the mixtures;
contrary, the overdetermined/underdetermined case refers to the situation where there are
23
more/less mixtures than sources; Underdetermined Source Separation (USS) is obviously the
most difficult case.
The problem of source separation classically includes two major steps that can be realized
jointly: estimating the mixing matrix and estimating the sources. Let X = [x1(n), .., xN(n)]T be the N
mixture signals, S = [s1(n), .., sM(n)]T the M source signals, and A = [a1, a2, .., aN]T the N×M mixing
matrix with mixing gains ai = (ai1, ai2, .., aiM). The mixture signals are then obtained by: X = AS.
This readily corresponds to the instantaneous mixing model (the mixing coefficients are simple
scalars). The more general convolutive mixing model considers that a filtering occurred between
each source and each mixture (see Figure 14).
Figure 14: Convolutive mixing model
In this case, if the filters are represented as N×M finite impulse response filters of impulse
response hij(t), the mixing matrix is given by A = (h1(t), h2(t), ..., hN(t))T with hi(t) = [hi1(t), ... ,
hiM(t)], and the mixing model corresponds to X = A * S.
A wide variety of approaches exist to estimate the mixing matrix and rely on techniques such
as independent component analysis, sparse decompositions or clustering approaches (Virtanen,
2006). In the determined case, it is straightforward to obtain the individual sources once the
mixing matrix is known: S = A-1X. The underdetermined case is much harder since it is an illposed problem with an infinite number of solutions. Again, a large variety of strategies exists to
recover the sources including heuristic methods, minimization criteria on the error ||X −AS||2,
or time-frequency masking approaches. One popular approach, termed adaptive Wiener
filtering, exploits soft time-frequency masking. Music signal separation is a particularly difficult
example of USS of convolutive mixtures (many concurrent instruments, possibly mixed down
with different reverberation settings, many simultaneous musical notes and in general a
recording limited to, at best, 2 channels). The problem is often tackled by integrating prior
information on the different source signals. For music signals, different kinds of prior
information have been used including timbre models (Ozerov, 2007), harmonicity of the sources
(Vincent et al. 2010), temporal continuity or sparsity constraints (Virtanen, 2007). In some cases,
by analogy with speech signal separation, it is possible to exploit production models, see
(Durrieu et al. 2010). Concerning evaluation, the domain of source separation of audio signals is
also now quite mature and regular evaluation campaigns exist along with widely used evaluation
protocols (Vincent et al. 2005).
As examples of multimodal capture set-up, we can mention the dance-class scenario
considered in the 3DLife/Huawei ACM MM Grand Challenge (http://perso.telecomparistech.fr/~essid/3dlife-gc-11/) which uses a set of microphones, inertial sensors, cameras and
Kinects, and the IDIAP Smart Meeting Room (Moore, 2002) consisting of a set of microphones
and cameras.
24
(a) Kinect depth maps
(b) Skeleton tracking from Kinect depth data and using the OpenNI library.
Figure 15: Kinect depth maps and OpenNI skeleton tracking

25
Microsoft Kinect RGB+Depth sensor
Description: The Kinect sensor features 1. a regular RGB camera and 2. a depth scanner,
consisting of a stereo pair of an IR projector and an IR camera (monochrome CMOS sensor),
with a baseline of approximately 7.5cm. The depth values are inferred by measuring the
disparity between the received IR pattern and the emitted one, which is a fixed pattern of
light and dark speckles. The Kinect driver outputs a Nx × Ny = 640 × 480 depth grid with a
precision of 11 bits at 30 frames/sec. The RGB image is provided in the same resolution and
frame rate with the depth data. According to the Kinect-ROS (Robot Operating System)
wiki, the RGB camera’s Field of View (FOV) is approximately 58 degrees, while the depth
camera’s FOV is approximately 63 degrees. Microsoft officially specifies that the Kinect’s
depth range is 1.2−3.5 meters, but it can be experimentally verified that the minimum
distance can be as low as 0.5 meters and the maximum distance can reach 4 meters.
Essentially, the IR projector and camera constitute a stereo pair, hence the expected
precision on Kinects’ depth measurements is proportional to the square of the actual
depth. The experimental data presented in the ROS wiki confirm this precision model,
showing a precision of approximately 3mm at a distance of 1 meter and 10mm at 2 meters.
The accuracy of a calibrated Kinect sensor can be very high, of the order of ±1mm. The
Kinect sensor will be useful mainly in Task 4.3 (Capturing and Reconstruction) and may be
used in Task 4.4 (Multimodal User Activity Analysis) for capturing the human body motion
(exploiting the OpenNI API, see below).
http://www.ros.org/wiki/kinect_calibration/technical
http://www.ros.org/wiki/openni_kinect/kinect_accuracy
http://www.xbox.com/kinect
Dependencies on other technology: PrimeSense Kinect driver; OpenNI API or the Microsoft
Kinect SDK (MS Windows 7 or newer).

OpenNI API (http://www.openni.org/)
Description: The OpenNI framework provides an API for writing applications utilizing NI
devices, such as the Kinect sensor. The API covers both the low-level communication with
the devices, as well as it provides high-level middleware solutions (e.g. for visual human
tracking, see Figure 15(b)). The OpenNI library is necessary for communicating with the
Kinect sensor and using the RGB+Depth measurements that it captures. It will be useful in
Task 4.3 (Capturing and Reconstruction) and it can be probably used in Task 4.4
(Multimodal User Activity Analysis) for human skeleton tracking .
Dependencies on other technology: MS Kinect Sensor and PrimeSense’s Kinect driver.

Computational Geometry Algorithms Library (CGAL)
Description: The CGAL open source project aims at providing easy access to efficient and
geometric algorithms in the form of a C++ library. CGAL can be used in various areas
needing geometric computation, such as in computer graphics. The CGAL library offers
optimized C++ implementations of algorithms, which may be necessary in Tasks 4.3
(Capturing and Reconstruction) and 4.5 (3D User-Generated Worlds from Video/Images),
including: triangulations (e.g. 2D constrained triangulations and Delaunay triangulations in
2D and 3D), mesh generation (2D Delaunay mesh generation and 3D surface and volume
mesh generation), geometry processing (e.g. mesh simplification), alpha shapes, convex
hull algorithms (in 2D and 3D).
http://www.cgal.org/
Dependencies on other technology: Boost library (http://www.boost.org/). For a detailed
list of its dependencies see:
http://www.cgal.org/Manual/latest/doc_html/installation_manual/

Point Cloud Library (PCL)
Description: The PCL is an open project for 3D point cloud processing. The PCL framework
contains implementations of numerous state-of-the art algorithms that would be useful in
Tasks 4.3 (Capturing and Reconstruction) and 4.5 (3D User-Generated Worlds from
Video/Images), including algorithms for point-cloud’s registration and surface
reconstruction.
http://www.pointclouds.org/
Dependencies on other technology: A set of third-party libraries is required for the
compilation and usage of the PCL library. For a detailed list, see:
http://www.pointclouds.org/downloads/source.html

Nvidia’s Compute Unified Device Architecture (CUDA) SDK
Description: CUDA™ is a parallel computing programming model that enables parallel
computations on modern GPUs (Nvidia’s CUDA-enabled GPUs) without the need for
mapping them to graphics APIs. The use of the CUDA SDK can dramatically increase
computing performance, needed in Task 4.3 (Capturing and Reconstruction) needed to
realize real-time reconstructions.
26
http://www.nvidia.com/object/cuda_home_new.html
Dependencies on other technology: Nvidia’s CUDA-enabled GPUs.

ROS Kinect
Description: The ROS Kinect open-source project focuses on the integration of the
Microsoft Kinect sensor with ROS. The Kinect ROS stack contains some
packages/components that may be useful in the 3D reconstruction task from Kinect data.
http://www.ros.org/wiki/kinect
Dependencies on other technology: OpenNI Kinect drivers and API, PCL, OpenCV

Bundler: Structure from Motion (SfM) for Unordered Image Collections
Description: Bundler is a SfM system for unordered image collections (for instance, images
from the Internet) written in C and C++. Bundler takes a set of images, image features, and
image matches as input, and produces a 3D reconstruction of camera and (sparse) scene
geometry as output. Bundler has been successfully run on many Internet photo collections,
as well as more structured collections.
http://phototour.cs.washington.edu/bundler/
Dependencies on other technology: Sparse Bundle Adjustment (SBA) package of Lourakis
and Argyros

Generic SBA C/C++
Description: C/C++ package for generic SBA that is distributed under the GNP. Bundle
Adjustment (BA) is almost invariably used as the last step of every feature-based multiple
view reconstruction vision algorithm to obtain optimal 3D structure and motion (i.e.
camera matrix) parameter estimates. Provided with initial estimates, BA simultaneously
refines motion and structure by minimizing the reprojection error between the observed
and predicted image points. The minimization is typically carried out with the aid of the
Levenberg-Marquardt (LM) algorithm.
http://www.ics.forth.gr/~lourakis/sba/
Dependencies on other technology: SBA relies on LAPACK (http://www.netlib.org/lapack/)
for all linear algebra operations arising in the course of the LM algorithm.

Patch-based Multi-View Stereo (PMVS) Software
Description: PMVS is a multi-view stereo software that takes a set of images and camera
parameters, then reconstructs a 3D structure of an object or a scene visible in the images.
Only rigid structure is reconstructed, in other words, the software automatically ignores
non-rigid objects such as pedestrians in front of a building. The software outputs a set of
oriented points instead of a polygonal (or a mesh) model, where both the 3D coordinate
and the surface normal are estimated at each oriented point.
http://grail.cs.washington.edu/software/pmvs/

Multicore BA
Description: This project, considers the design and implementation of new inexact Newton
type BA algorithms that exploit hardware parallelism for efficiently solving large scale 3D
scene reconstruction problems. This approach overcomes severe memory and bandwidth
limitations of current generation GPUs and leads to more space efficient algorithms and to
surprising savings in processing time.
27
http://grail.cs.washington.edu/projects/mcba/

Vicon
Description: Motion capture or mocap are terms used to describe the process of recording
movement and translating that movement on to a digital model. It is used in military,
entertainment, sports, and medical applications, and for validation of computer vision. The
Vicon infra-red motion capture system is the industy gold-standard for passive-tag motion
capture. It is a semi-automated optical motion capture system that tracks the 3D position
of infra-red reflective markers in 3D space using specialized camera hardware. Each of the
marker tags can be tracked with a high degree of accuracy. A performer wears markers
near each joint to identify the motion by the positions or angles between the markers. The
motion of the subject can then be reconstructed to a high degree of accuracy using these
tracked markers.
http://www.vicon.com

Codamotion
Description: A technology similar to Vicon, but uses active sensors, which are powered to
emit their own light. The power to each marker can be provided sequentially in phase with
the capture system providing a unique identification of each marker for a given capture
frame at a cost to the resultant frame rate.
http://www.codamotion.com

Xsens MVN - Inertial Motion Capture
Description: Xsens MVN performs motion capture using inertial sensors which are attached
to the body by a lycra suit. This approach gives you freedom of movement because MVN
uses no cameras. It is a flexible and portable Motion Capture system that can be used
indoors and outdoors. Xsens MVN requires minimal clean-up of captured data as there is
no occlusion or marker swapping.
http://www.xsens.com

Xsens MVN – BIOMECH
Description: MVN BIOMECH is an ambulant, full-body, 3D human kinematic, camera-less
measurement system. It is based on inertial sensors, biomechanical models and sensor
fusion algorithms. MVN BIOMECH is ambulatory, can be used indoors and outdoors
regardless of lighting conditions. The results of MVN BIOMECH trials require minimal postprocessing as there is no occlusion or lost markers. MVN BIOMECH is used in many
applications like biomechanics research, gait analysis, human factors and sports science.

Kinect SDK Dynamic Time Warping Gesture Recognition
Description: An open-source project allowing developers to include fast, reliable and highly
customisable gesture recognition in Microsoft Kinect SDK C# projects. This project is
currently in setup mode and only available to project coordinators and developers.
http://kinectdtw.codeplex.com/

Recovering Articulated Pose of 3D Point Clouds (Fechteler and Eisert, 2011)
Description: We present an efficient optimization method to determine the 3D pose of a
human from a point cloud. The well-known ICP algorithm is adapted to fit a generic
28
articulated template to the 3D points. Each iteration jointly refines the parameters for rigid
alignment, uniform scale as well as all joint angles. In experimental results we demonstrate
the effectiveness of this computationally efficient approach.
Figure 16(from left to right): Input scan, template model, both meshes superimposed on
each other initially and after pose adaptation.

Alcatel-Lucent Background Removal Software
Description: The proprietary Alcatel-Lucent background removal software uses 2D video
input. It uses a non-disclosed set of algorithms to efficiently and qualitatively remove the
background from the user’s moving image in the foreground in a 2D video stream.
http://www.alcatel-lucent.com

Digital Y-US1 Yamaha Piano
Description: Digital Y-US1 Yamaha Piano with the Mark III disklavier system which enables
the simultaneous recording of MIDI type information.

Large anechoic room
Description: The large anechoic room (125 m3) has a reverberation time lower than 30
msec at 125 Hz. Mounted on trapezes with high quality material cones, this anechoic room
possesses excellent anechoic characteristics which permit high quality recordings and 3D
audio rendering experiments.

Audio recording studio
Description: The recording studio allows for 16 tracks of professional quality recordings. It
is fully equipped for high quality sound recording using multiple sensors (softwares, mixing
tables, directional and omnidirectional microphones, KEMAR® headset or/and cameras).

Loudspeakers for 3D audio
Description: 3D audio system with a large number of dedicated loudspeakers including a
set of 12 passive TANNOY® loudspeakers for realistic 3D sound rendering.
1.3.Performance Capture
A lot of effort has been made by the computer graphics community to find efficient and
accurate methods to reconstruct and capture dynamic 3D objects. The methods range from
global performance capture to more constrained facial performance capture. The most
29
interested branches of industry in this kind of method are the movie and video game studios.
This kind of application needs a very high quality synthesis that can be editable and that is
coherent over time. The real time goal isn’t the main target of these methods; however they
give an interesting insight about what can be done with typical sensors, such as cameras,
structured lighting, a lot of processing power, etc.
On a quite global point of view some research focus on reconstructing a consistent 4D
surface (3D space and time) from time varying points cloud (Wand et al., 2007). Others try to
estimate directly a consistent surface with a smooth or simplified template, in this category
some use a skeleton-based meshed template or a fully general meshed template.
Figure 17: Side-by-side comparison of input and reconstruction of a dancing girl wearing a
skirt.
Figure 18: Input scans (left) and the reconstruction result (right)
Figure 19: 1st image: the articulated template model. 2nd image: using the estimated surface
of the previous frame, the pose of the skeleton (1st image) is optimized such that the
deformed surface fits the image data. 3rd image: since skeleton-based pose estimation is not
able to capture garment motion the surface is refined to fit the silhouettes. 4th image: the
final reconstruction with the fitted skeleton and the refined surface.
Figure 20: From left to right, our system starts with a stream of silhouette videos and a rigged
template mesh. At every frame, it fits the skeleton to the visual hull, deforms the template
30
using Linear Blend Skinning and adjusts the deformed template to fit the silhouettes. The user
can then easily edit the geometry or texture of the entire motion.
Some address the full task of reconstructing a consistent surface without any prior i.e.
template like (Wand et al., 2009) or with the only prior that the observed surface comes from an
articulated object. Some results are shown in the following figures.
Figure 21: Left: input range scans. Right: poseable, articulated 3D model. The articulated
global registration algorithm can automatically reconstruct articulated, poseable models from
a sequence of single-view dynamic range scans.
Figure 22: Results from (Wand et al., 2009). Top: input range scans. Middle: reconstructed
geometry. Bottom: the hand is mapped to show the consistency of the reconstructed surface.
Vlasic et al. (2009) tried to estimate the normals along with the position of the surface by
using an active lighting setup that permits the extraction of information of the normals. While
providing a really vivid geometry (with a lot of details, thanks to the normals estimation) they do
not output a consistent geometry throughout the entire sequence since each mesh is computed
31
independently at each time step. Results are shown in Figure 23.
Figure 23: Results from (Vlasic et al., 2009). Top and middle: photometric inputs, normal maps
and reconstruct geometry. Bottom: The acquisition setup consists of 1200 individually
controllable light sources. Eight cameras are placed around the setup aimed at the
performance area.
These methods differ also by their input data type: 3D point clouds for (Wand et al., 2009,
2007), multi-view cameras (with a strong use of visual hull) for (Vlasic et al., 2009).
If we look at more constrained setups like those used for facial performance capture we end up
with ultra-high quality reconstructions. Some research (Wilson et al., 2010) tries to tackle one of
the caveat of (Vlasic et al., 2009) (mainly the necessary high video/photometric throughput to
perform the normal estimation). Others rely on passive multi-view stereo only by tracking the
very fine details of the skin such as pores (which give a tremendous amount of information of
how the surface of the face deforms itself). Some results are shown in the following figures.
Figure 24: Results from (Wilson et al., 2010). Smiling sequence captured under extended
spherical gradient illumination conditions (top row), synthesized intermediate photometric
normals (center row), and high-resolution geometry (bottom row) of a facial performance as
reconstructed with the same temporal resolution as the data capture.
32
Figure 25: Top row: one reference frame, the reconstructed geometry (1 million polygons),
final textured result. Bottom row: two different frames from the final textured result,
acquisition setup.
Some methods use a learning stage to extrapolate high dimensional synthesis from low
dimensional capture. This involves the use of a sparse setup of markers on the face of an actor,
to learn how the skin folds and deforms according to the displacement of the markers to
synthesize densely in time and space the deformation of a template. In this category we find
Weise et al. (2011) who tries to compute the most probable actors’ face pose from previous
computed ones in the blendshape space (which is a low dimensional representation of facial
expressions). A learning stage permits the learning of the probability model (the prior) in the
blendshape space which enables a full expression sequence in real time from very noisy on-line
input (dynamic 3D scans from a Kinect device). Results are shown in the following figures.
Figure 26: They synthesize new high-resolution geometry and surface detail from sparse
motion capture markers using deformation-driven polynomial displacement maps. Top row
from left to right: motion capture markers, deformed neutral mesh, deformed neutral mesh
33
with added medium frequency displacements. Bottom row from left to right: deformed
neutral mesh with added medium and high frequency displacements, ground truth geometry.
Figure 27: Results from (Weise et al., 2011). For each row, from left to right: input color image
(from a Kinect sensor), input depth map (from a Kinect sensor), tracked facial expression,
retargeting on a virtual avatar (thanks to the convenient blendshape framework).
Usefulness to the project
Performance capture is becoming a standard pipeline for high quality movie SFX production
with animated characters. Developing a specific methodology approximating the result of such
offline procedures will provide a high-end version of the REVERIE framework, for users running
REVERIE solutions on powerful computers and high quality capture setup.
 Dense 3D Motion Capture for Human Faces (Furukawa and Ponce, 2009b)
Description: A novel approach to motion capture from multiple, synchronized video streams,
specifically aimed at recording dense and accurate models of the structure and motion of
highly deformable surfaces such as skin, that stretches, shrinks, and shears in the midst of
normal facial expressions. Solving this problem is a key step toward effective performance
capture for the entertainment industry, but progress so far has been hampered by the lack of
appropriate local motion and smoothness models. This contributes a novel approach to
regularization adapted to non-rigid tangential deformations. Concretely, we estimate the
non-rigid deformation parameters at each vertex of a surface mesh, smooth them over a
local neighborhood for robustness, and use them to regularize the tangential motion
estimation. To demonstrate the power of the proposed approach, the performances of the
algorithm was tested on three extremely challenging face datasets that include highly nonrigid skin deformations, wrinkles, and quickly changing expressions. Additional experiments
with a dataset featuring fast-moving cloth with complex and evolving fold structures
demonstrate that the adaptability of the proposed regularization scheme to non-rigid
tangential motion does not hamper its robustness, since it successfully recovers the shape
and motion of the cloth without over fitting it despite the absence of stretch or shear in this
case.
34
Figure 28(from left to right): sample input image, reconstructed mesh model, estimated
motion, and a texture mapped model for one frame with interesting structure/motion. The
right two columns show the results in another interesting frame.
As performance capture is still a rather novel and very active research field, there is almost
no software resource available and specific development will be required. Still, Microsoft Kinect
SDK can be used as a basic tool chain to perform capture from multiple Kinects.
1.4. Activity recognition
Activity recognition aims to recognize the actions and goals of one or more agents from a
series of observations on the agents' actions and the environmental conditions. This is a very
challenging problem for traditional optical based approaches which attempt to track and
understand the behaviour of agents through videos using computer vision. Action recognition is
a very active research topic in computer vision with many important applications, including user
interface design, robot learning, and video surveillance, among others. In this work we
specifically address techniques and applications for human-computer interaction. Historically,
action recognition in computer vision has been sub-divided into topics such as gesture
recognition, facial expression recognition and movement behaviour recognition. Typically
generic activity recognition algorithms involve the extraction of some features, segmentation of
possible actions from a sequence and the classification of this action using the extracted
features using machine-learning techniques. A thorough overview of computer vision
techniques for activity recognition are out of the scope of this document, interested readers are
directed to three recent publications (Aggarwal and Ryoo, 2011; Weinland et al. 2011; Poppe,
2010), which together provide a thorough overview and survey of the state-of-the-art research
into human body motion activity recognition using computer vision techniques, both in terms of
feature extraction and classification techniques. In these papers, the authors tend to focus on
determining the actions and activities of a single human subject’s body, whose motion appears
in the FOV of a single standard visual spectrum camera.
Although most of the existing work on human body activity recognition involves the use of
computer vision, recently gesture recognition has been investigated using non-visual sensors. An
approach for spotting the temporal locations of sporadically occurring gestures in a continuous
data stream from body-worn inertial sensors. These gestures are directly subsequently classified
using Hidden Markov Models (HMM). In (Benbasat and Paradiso, 2001) an additional step is
added, where data from six-axis IMUs are firstly categorized on an axis-by-axis basis as simple
motions (straight line, twist, etc.) with magnitude and duration. These simple gesture categories
are then combined both concurrently and consecutively to create specific composite gestures,
which can then be set to trigger output routines. The work of Ward et al. (2005) extends the use
35
of solely inertial data by augmenting a wrist worn 3-axis accelerometer and a wrist worn
microphone for continuous activity recognition. In this scenario, characteristic movements and
sounds were used to classify hand actions in a workshop scenario, such as sawing and drilling.
Until now, this review has focused on human body motion activity recognition. Prior
literature has also investigated analyzing the motion of multiple users for activity recognition.
This work has particular application in sport, where computer vision techniques are applied to
analyze complex and dynamic sport scenes for the purpose of team activity recognition for use
in applications such as tactic analysis and team statistic evaluation (which is useful for coaches
and trainers), video annotation and browsing, automatic highlight identification, automatic
camera control (useful for broadcasting), etc. For group activity recognition, a specific group
action may involve multiple people performing significantly different actions individually. This
makes it difficult to find effective descriptors for group actions. Some of the techniques in this
area described in the literature are of interest to researchers in REVERIE as they make use of
multi-camera systems to overcome the limitations of using single moving or static camera
systems, such as occlusions or inaccurate subject localization. Interesting papers in this area
include the work of Intille and Bobic (2001) who propose Bayesian belief networks for
probabilistically representing and recognizing agent goals from noisy trajectory data of American
football players. Blunsden et al. (2006) present both global and individual model approaches for
classifying coordinated group activities in handball games. Perse et al. (2010) and Hervieu et al.
(2009) propose trajectory-based approaches to the automatic recognition of multi-player
activity behaviour in basketball games and team activities in handball games respectively.
Finally in this section, we review state-of-the-art approaches for the automatic recognition of
facial expressions and actions using computer vision based approaches. For human beings, facial
expression is one of the most powerful and natural ways to communicate their emotions and
intensions. As such, it is important to review such approaches within REVERIE. Although a
human being can detect facial expressions without effort, automatic facial expression
recognition using computer vision techniques is a challenging problem. There are three sub
problems while designing an automatic facial expression recognition system: face detection,
extraction of the facial expression information, and classification of the expression. A system
that performs these operations more accurately and in real time would be crucial to achieve a
human-like interaction between man and machine. Many techniques adopt the use of the Facial
Action Coding System (FACS), to encode expressions, which are then used as features for
expression classification. Using the FACS system, facial expressions can be decomposed in terms
of 46 component movements, which roughly correspond to the individual facial muscles.
(Bartlett et al. 2006; and Braathen et al. 2002) present machine learning approaches, using
support vector machines, to automatically apply FACS action unit intensities to input faces.
Although most research focuses on the extraction of features from visual spectrum cameras,
Busso et al. (2004) discusses an alternative approach where multi-modal data including speech
is employed and four main emotions are determined (anger, sadness, happiness, neutral).
Finally, a review of works in this area is presented in (Patil et al. 2010).
Activity recognition is integral in two distinct aspects of the REVERIE framework. Firstly,
determining human activity from their full body motion in a virtual environment; as opposed to
accurately rendering the volumetric representation of a subject, accurately determining human
activity behind the motion is essential in order to correctly interpret a user’s movement, desires
and requirements in a system. Determining human activity and requirements significantly
36
increase the ability for characters to efficiently interact with each other, and the environment
surrounding them. In addition, this data can be fed into an animation module to allow each
character in the environment can be animated in a realistic manner.
Secondly, this area also covers extracting and determining human affects, emotion and
interest of users from user face and body motion. This characterization of users’ emotional
states and expressive intents is critical for WP5, as this information will be required as an
essential input for the artificial intelligence (AI) modules of autonomous avatars. As an input to
other modules in the framework, it will also significantly allow an increase in accurate rendering
of virtual characters facial features, so determined emotional states can be efficiently rendered
on virtual user avatars.
1.5. 3D reconstruction
The algorithms for 3D reconstruction can be classified according to the following diagrams. In
the first classification approach, see Figure 29(a), existing 3D reconstruction methodologies are
classified based on the target application. The target application specifies the requirements on
reconstruction accuracy and time. For example, the reconstruction of cultural-heritage objects
requires very high accuracy. In REVERIE Task 4.3 (Capturing and Reconstruction), the 3D
reconstruction of moving humans and foreground objects for tele-immersive applications
demands real-time processing. In a second categorization approach, see Figure 29(b), the 3D
reconstruction methodologies can be classified based on the employed sensing equipment. This
can range from inexpensive passive sensors, such as multiple RGB cameras, to more expensive
active devices (laser scanners, TOF cameras, range scanners, etc.). In Task 4.3 (Capturing and
Reconstruction), multiple RGB cameras, as well as inexpensive active sensors, such as Kinect
sensors, will be employed. Thirdly, see Figure 29(c), reconstruction methods can be classified
with respect to the resulting reconstruction type, into a) volumetric and b) surface-based
approaches. Volumetric methods (Curless and Levoy, 1996; Kutulakos and Seitz, 2000;
Matsuyama et al. 2004) are based on discretization (voxelization) of the 3D space. They are
robust, but they are either prone to aliasing artifacts (with low resolution discretizations) or they
require increased computational effort (with high-resolution discretizations). On the other hand,
surface-based methods (Turk and Levoy, 1994; Matusik et al. 2001; Franco and Boyer, 2009)
explicitly calculate the surfaces from the given data.
A more detailed classification scheme of reconstruction algorithms is given in Figure 29(d).
Multi-view based 3D reconstruction approaches are classified into a) methods that are not
based on optimization (Kutulakos and Seitz, 2000; Matsuyama et al. 2004; Matusik et al. 2001;
Franco and Boyer, 2009), and b) optimization-based methods (Boykov and Kolmogorov, 2003;
Paris et al. 2006; Zeng et al. 2007; Pons et al. 2007; Vogiatzis et al. 2007; Kolev et al. 2009), i.e.
methods that minimize a certain cost function that incorporates photo-consistency and
smoothness constraints. In the former category, i) silhouette-based approaches extract the
foreground objects’ silhouette in each image and construct the objects’ “visual-hull”
(intersection of the silhouette’s visual cones) (Matsuyama et al. 2004; Matusik et al. 2001;
Franco and Boyer, 2009). They are simple, fast and robust, but they lack reconstruction accuracy
(they are able to reconstruct details and especially concavities). In another subcategory, ii)
voxel-coloring or space-carving techniques (Kutulakos and Seitz, 2000; Yang et al. 2003), recover
the objects’s “photo-hull” (the shape that contains all possible photo-consistent
reconstructions), by sequentially eroding photo-inconsistent voxels in a plane sweeping
37
framework. Space-carving techniques are generally more accurate than silhouette-based
approaches, but are slower and less robust.
38
(a) Classification based on the target
application
(b) Classification based on the sensing technology
(d) Classification based on the algorithmic approach
Figure 29: Classification of 3D reconstruction techniques
39
(c) Classification based on the resulting
reconstruction type
`
In the second category, optimization-based approaches introduce an objective function to be
minimized/maximized, which (apart from photo-consistency) incorporate additional constraints,
mainly smoothness constraints on the reconstructed surface. Various mathematical tools have been
employed for optimization, such as i) active contour models (e.g. snakes and level-sets (Pons et al.
2007; Jin et al. 2005)) and ii) graph-cut methods (Boykov and Kolmogorov, 2003; Paris et al. 2006;
Vogiatzis et al. 2007). Furthermore, they can be classified based on whether they apply i) global
(Boykov and Kolmogorov, 2003; Vogiatzis et al. 2007; Tran and Davis, 2006), or ii) local optimization
(Zeng et al. 2007; Furukawa and Ponce, 2009a). Optimization-based approaches can produce highly
accurate reconstructions. However, they are unsuitable for real-time reconstruction applications,
such as tele-immersion, because the required computation time is very high, ranging from several
minutes to hours.
In another category, reconstruction is achieved by the fusion of dense depth maps, which are
produced either by i) active direct-ranging sensors (Curless and Levoy, 1996; Turk and Levoy, 1994;
Soucy and Laurendeau, 1995), or ii) by passive stereo camera pairs (Kanade et al. 1999; Mulligan
and Daniilidis, 2001; Merrell et al. 2007; Vasudevan et al. 2011). Most of the methods in (i) present
relatively high accuracy, but they work off-line to combine range data generated at different time
instances. In (ii), some methods achieve fast, near real-time reconstruction: Kanade et al. (1999),
using a large number of distributed cameras, achieved very fast, but not real-time, full-body
reconstruction (less than a single frame per second). Mulligan and Daniilidis (2001) used multiple
camera triplets for image-based reconstructions of the upper human body, in one of the first teleimmersion systems. In a very recent work (Merrell et al. 2007), multiple depth maps are computed
in real-time from a set of images, captured by moving cameras and a viewpoint-based approach for
quick fusion of stereo depth maps is used. In a state-of-the-art work (Vasudevan et al. 2011), a
method for the creation of highly accurate textured meshes from a stereo pair is produced.
Exploiting multiple stereo pairs, multiple meshes are generated in real-time, which are combined to
synthesize high-quality intermediate views for given viewpoints. However, the authors in
(Vasudevan et al. 2011) do not address the way separate meshes could be merged to produce a
single complete 3D mesh, rather than intermediate views.
The Microsoft Kinect sensor, released in November 2010, has attracted the attention of many
researchers due to its ability to produce accurate depth maps, compared to its low price. During
2011, many Kinect-based applications appeared on the internet, including 3D reconstruction
applications. However, most Kinect-based 3D reconstruction approaches combine depth data,
captured by a single Kinect from multiple view-points, to produce off-line reconstructions of 3D
scenes. Only a few official Kinect-based works have been published so far (Zollhoefer et al. 2011;
Izadi et al. 2011). In (Zollhoefer et al. 2011), personalized avatars are created from a single RGB
image and the corresponding depth map, captured by a Kinect sensor. In (Izadi et al. 2011) (which is
more relevant to Task 4.5), the problem of dynamic 3D indoor scenes reconstruction is addressed,
by the fast fusion of multiple depth scans, captured by a single hand-held Kinect sensor. Currently,
IT/TPT works on real-time 3D reconstruction of moving humans and foreground objects from
multiple
Kinects
and
have
obtained
promising
results
(http://www.reveriefp7.eu/resources/demos/). The relevant work is shortly described later in this
document. IT/TPT aims at optimizing the approach during the project.
Below, the most promising state-of-the-art approaches for fast 3D reconstruction are given.
 Silhouette-based approach for 3D reconstruction and efficient texture mapping (Matsuyama
et al. 2004)
Description: A fast volumetric silhouette-based approach for 3D reconstruction and efficient
texture mapping. The method is described in Figure 30 and consists of the following stages: i)
silhouette extraction (background subtraction), as shown in the upper row of Figure 30(a); ii)
silhouette volume intersection: Each silhouette is back-projected into the common 3D
volumetric space and the visual cones are intersected to produce the “visual-hull” volumetric
data. This is accomplished using a fast plane-based intersection method (see Figure 30(b)); iii)
40
application of the Marching Cubes algorithm (Lorensen and Cline, 1987) to convert the voxel
representation to surface representation; iv) Texture mapping. The proposed intersection
method ensures real-time generation of the volumetric data, but the authors state that their
Marching Cubes implementation is time-consuming and their overall method cannot run in
real-time. However, IT/TPT is aware of and have already used a real-time Marching Cubes
implementation in NVidias’ CUDA. With this implementation, IT/TPT expects to realize very fast
executions of the underlying method.
Dependencies on other technology: Multiple RGB cameras
Figure 30: Outline of the silhouette-based method from Matsuyama et al. 2004; Images taken
from Matsuyama et al. 2004.

41
Exact Polyhedral Visual Hulls (EPVHs) (Franco and Boyer, 2009)
Description: A fast and efficient silhouette-based approach for 3D reconstruction, depicted in
the following Figure 31(a). The methodology is surface-based, rather than a volumetric one. It
computes directly the visual-hull surface in the form of a polyhedral mesh. It relies on a small
number of geometric operations to compute the visual hull polyhedron in a single pass. The
algorithm combines the advantages of being fast, producing pixel-exact surfaces, and yielding
manifold and watertight polyhedra. Some results are given in Figure 31(c). These results were
produced from 19 views. The algorithm is very fast, but still not in real time; the average
computation time is 0.7 seconds per frame. A single 3GHz PC was used with 3Gb RAM. The
source code of the EPVH algorithm was freely distributed until 2007
(http://perception.inrialpes.fr/~Franco/EPVH). However, the authors established a company
(4D View solutions - http://www.4dviews.com), producing solutions that are based on the
EPVH software, see Figure 31(b).
(a) EPVH
(b) 4D View solutions product, using EPVH
(c) Published results of EPVHs– From 19 views. Average computation time: 0.7 seconds per frame, using a
single 3GHz PC with 3Gb RAM.
Figure 31: EPVHs (Franco and Boyer, 2009)
 Viewpoint-based approach for quick fusion of multiple stereo depth maps (Merrell et al. 2007)
Description: A viewpoint-based approach for quick fusion of multiple stereo depth maps. Depth
maps are computed in real-time from a set of multiple images, using plane-sweeping stereo. The
42
depth maps are fused within a visibility-based framework and the methodology is applied for
viewpoint rendering of scenes. The method was also tested for 3D reconstruction on the MultiView Stereo Evaluation dataset (http://vision.middlebury.edu/mview/), presenting high
accuracy, yet requiring an execution time of a few seconds/frame, even with a GPU
implementation.
 3D Tele-Immersion (3DTI) system (Vasudevan et al. 2011)
Description: A high quality full system for 3DTI. A method for the creation of highly accurate
textured meshes from a stereo pair, in real-time. Exploiting multiple stereo pairs, they generate
multiple high-quality meshes in real-time, which are combined to synthesize intermediate views
for given viewpoints. However, they do not address the way that separate meshes could be
merged to a single, complete full 3D mesh.
Dependencies on other technology: Multiple stereo cameras
(a) The intermediate-view composition approach, given multiple
triangular meshes.
(b) Result of the approach
Figure 32: Synthesis of high-quality intermediate views for given viewpoints. Image taken from
(Vasudevan et al. 2011).
 Kinect Fusion (Izadi et al. 2001)
Description: The problem of dynamic 3D indoor scenes reconstruction is addressed, by the fast
fusion of multiple depth scans, captured with a single hand-held Kinect sensor. It is one of the
most high-quality works with Kinect. However, it addresses mainly the problem of 3D indoor
scene reconstruction, by capturing the scene from multiple view-points with a single device and
is more relevant to Task 4.5 (3D User-Generated Worlds from Video/Images).
http://research.microsoft.com/en-us/projects/surfacerecon/
Dependencies on other technology: Kinect sensor
 Fast silhouette-based 3D reconstruction (visual hulls)
Description: IT/TPT has experience in silhouette-based 3D reconstruction and has implemented
relevant algorithms in C++, exploiting CUDA to achieve (near) real-time reconstruction.
Dependencies on other technology: Multiple cameras
43
Figure 33: Near real-time silhouette-based 3D reconstruction of humans.
 Real-time 3D reconstruction of humans from multiple Kinects
Description: IT/TPT currently works on the real-time construction of full 3D meshes of humans
and foreground objects, having obtained promising results. The approach is based on a)
capturing of RGB and depth images from multiple Kinects (30 fps with a single Kinect, ~20 fps
with 4 Kinects), b) construction of separate meshes from the separate depth maps, c) alignments
of the meshes, d) Fusion (zippering) of the meshes to produce a single combined mesh.
http://www.reveriefp7.eu/resources/demos/
Dependencies on other technology: Multiple Kinect sensors, OpenNI API
Figure 34: Real-time 3D reconstruction from 4 Kinects
 A Global Optimization Approach to High-detail Reconstruction of the Head (Schneider et al.
2011)
Description: An approach for reconstructing head-and-shoulder portraits of people from
calibrated stereo images with a high level of geometric detail. In contrast to many existing
systems, these reconstructions cover the full head, including hair. This is achieved using a global
intensity-based optimization approach which is stated as a parametric warp estimation problem
and solved in a robust Gauss-Newton framework. A computationally efficient warp function for
mesh-based estimation of depth is formulated, which is based on a well-known imageregistration approach and adapted to the problem of 3D reconstruction. The use of sparse
correspondence estimates for initializing the optimization is addressed as well as a coarse-to-fine
scheme for reconstructing without specific initialization. We discuss issues of regularization and
brightness constancy violations and show various results to demonstrate the effectiveness of the
approach.
44
Figure 35: Very high detail reconstructions (rendered depth maps), computed with the coarse-tofine scheme without initial shape estimate. Fine structures become visible on the face as well as
on clothing.
 Multiple View Segmentation and Matting (Kettern et al. 2011)
Description: A robust and fully automatic method for extracting a highly detailed transparencypreserving segmentation of a persons’ head from multiple view recordings, including a
background plate for each view. Trimaps containing a rough segmentation into foreground,
background and unknown image regions are extracted exploiting the visual hull of an initial
foreground-background segmentation. The background plates are adapted to the lighting
conditions of the recordings and the trimaps are used to initialize a state-of-the-art matting
method adapted to a setup with precise background plates available. From the alpha matte,
foreground colours are inferred for realistic rendering of the recordings onto novel backgrounds.
Figure 36(from left to right): Background, object, bi-map, tri-map, alpha matte, realistic
foreground rending of object.
 Model-Based Camera Calibration Using Analysis by Synthesis Techniques (Eisert, 2002)
Description: A new technique for the determination of extrinsic and intrinsic camera parameters
is presented. Instead of searching for a limited number of discrete feature points of the
calibration test object, the entire image captured with the camera is exploited to robustly
determine the unknown parameters. Shape and texture of the test object are described by a 3D
computer graphics model. With this 3D representation, synthetic images are rendered and
matched with the original frames in an analysis by synthesis loop. Therefore, arbitrary test
objects with sophisticated patterns can be used to determine the camera settings. The scheme
can easily be extended to deal with multiple frames for a higher intrinsic parameter accuracy.
45
Figure 37: Two different calibration text objects (left) and visualization of perspective projection.
 Alcatel-Lucent Stereo Matching Software
Description: The proprietary Alcatel-Lucent stereo matching software uses two video cameras as
input. It uses a non-disclosed set of algorithms to efficiently and qualitatively compute the depth
map of the scene captured by the video cameras.
http://www.alcatel-lucent.com
 Telecom ParisTech CageR System
Description: The proprietary Telecom ParisTech CageR system allows the reconstruction of an
animated cage from an animated mesh to provide editing, deformation transfer, compression
and motion manipulation on raw performance capture data.
http://www.tsi.telecom-paristech.fr/cg
Dependencies on other technologies: Telecom ParisTech space deformation tool set.
1.6. 3D User-Generated Worlds from Video/Images
The Computer Vision community has put a lot of effort in to developing new approaches for
modelling and rendering complex scenes from a collection of images/videos. While a few years ago
the main applications were dedicated to robot navigation and visual inspection, recently, potential
fields of application for 3D modelling and rendering include computer graphics, visual simulation,
VR, computer games, telepresence, communication, art and cinema. Additionally, there is an
increasing demand for rendering new scenes from (uncalibrated) images acquired from normal
users using simple devices. SfM techniques are used extensively to extract the 3D structure of the
scene as well as camera motions by analyzing an image sequence.
Most of the existing SfM techniques are based on the establishment of reliable correspondences
between two/multiple images, which are then used to compute the corresponding 3D scene points
by using a series of computer vision techniques such as camera calibration (Curless and Levoy,
1996), structure reconstruction and BA (Kutulakos and Seitz, 2000).
In most cases, it is impossible to detect image correspondences by comparing every pixel of one
image with every pixel of the next because of the high combinatorial complexity and computational
cost. Hence, local-scale image features are extracted from the images and matched on this scope.
Feature matching techniques can be divided in two categories: narrow- and wide-baseline. Though
dense short-baseline stereo matching is well understood (Triggs et al. 1999; Pollefeys et al. 1999),
its wide baseline counterpart is much more challenging since it can handle images with large
perspective distortions, extended occluded areas and lighting changes. Moreover, it can yield more
accurate depth estimates while requiring fewer images to reconstruct a complete scene. The main
idea for wide-baseline two-image matching is to extract local invariant features independently from
two images, then characterise them by invariant descriptors and finally build up correspondences
between them. The most influential local descriptor is SIFT (Lowe, 2004), where key locations are
defined as maxima and minima of the result of the difference of Gaussian functions applied in scalespace to a series of smoothed and re-sampled images. Dominant orientations are assigned to
46
localized keypoints. Finally, SIFT descriptors are obtained by considering pixels around a radius of
the key location, blurring and resampling of local image orientation planes. GLOH (Mikolajczyk and
Schmid, 2004) is an extension of the SIFT descriptor, designed to increase its robustness and
distinctiveness. The RIFT (Lazebnik et al. 2004), a rotation-invariant generalization of SIFT, is
constructed using circular normalized patches divided into concentric rings of equal width and
within each ring a gradient orientation histogram is computed. The more recent SURF descriptor
(Bay et al. 2006) is computationally effective with respect to computing the descriptor’s value at
every pixel, but it introduces artefacts that degrade the matching performance when used densely.
Recently, the DAISY (Tola et al. 2010) descriptor was proposed, which retains the robustness of SIFT
and GLOH, while it can be computed quickly at every single image pixel like SURF.
In order to establish reliable correspondences between image descriptors and remove outliers,
feature matching algorithms are exploited. Traditional approaches are mainly based on extensions
(Philbin et al. 2007; Chum and Matas, 2008) of the RANSAC algorithm (Fischler and Bolles, 1981) or
extensions (Lehmann et al. 2010) of the Hough transform (Ballard, 1981).
The sets of image correspondences are exploited by SfM algorithms, in order to recover the
unknown 3D scene structure and retrieve the poses of the cameras that captured the images. So
far, the most common data sources of large SfM problems have been video and structured survey
datasets, which are carefully calibrated and make use of surveyed ground control points.
Recently, there has been observed an interest in using community photo collections such as
Flickr, which are unstructured and weakly calibrated since they are captured by different users. As
the number of photos in such collections continues to grow into hundreds or even thousands, the
scalability of SfM algorithms becomes a critical issue. With the exception of a few efforts
(Mouragnon et al. 2009; Ni et al. 2007; Agarwal et al. 2009a; Agarwal et al. 2009b; Li et al. 2008;
Schaffalitzky and Zisserman, 2002; Snavely et al. 2006) the development of large scale SfM
algorithms has not yet received significant attention. Existing techniques performing SfM from
unordered photo collections (Agarwal et al. 2009a; Agarwal et al. 2009b; Li et al. 2008; Schaffalitzky
and Zisserman, 2002; Snavely et al. 2006) make heavy use of nonlinear optimization, which highly
depends on initialization. These methods run iteratively, starting with a small set of photos, then
incrementally add photos and refine 3D points and camera poses. Though these approaches are
generically successful, they suffer from two significant disadvantages: a) they have increased
computational cost and b) the final SfM result depends on the order that photos are considered.
This sometimes leads to failures because of mis-estimations in the camera poses or drifting into bad
local minima.
Recent methods have exploited clustering and graphs to minimize the number of the images
considered in the SfM process (Agarwal et al. 2009a; Agarwal et al. 2009b; Li et al. 2008; Bajramovic
and Denzler, 2008; Snavely et al. 2008; Verges-Llahi et al. 2008). However, the graph algorithms can
be costly and may not provide a robust solution. Factorization methods (Soucy and Laurendeau,
1995; Verges-Llahi et al. 2008) that attempt to solve the SfM problem in a sole batch optimization
are difficult to be applied to perspective cameras with significant outliers and missing data (which
are present in Internet photo collections).
Below are provided, the most promising state-of-the-art approaches for generating 3D data from
video/image input.
 Large-scale multi-view stereo matching (Tola et al. 2011)
Description: A novel approach for large-scale multi-view stereo matching, which is designed to
exploit ultra-high-resolution image sets in order to efficiently compute dense 3D point clouds.
Based on a robust descriptor named Daisy (Tola et al. 2010), it performs fast and accurate
matching between high-resolution images that limits the computational cost of other algorithms.
47
Experimental results showed that this algorithm is able to produce 3D point clouds containing
virtually no outliers. This fact makes it exceedingly suitable for large-scale reconstruction.
Figure 38: Lausanne Cathedral Aerial Reconstruction (Tola et al. 2011).
 Structure-from-Motion method for unstructured image collections (Crandall et al. 2011)
Description: An innovative SfM method for unstructured image collections, which considers all
the photos at once rather than incrementally building up a solution. This method is faster than
current incremental BA approaches and more robust to reconstruction failures. This approach
computes an initial estimate of the camera poses using all available photos, and then refines that
estimate and solves for scene structure using BA.
Figure 39: Central Rome Reconstruction (Crandall et al. 2011)
 Reconstruction of accurate surfaces (Jancosek and Pajdla, 2011)
Description: A novel method for reconstructing accurately surfaces that are not photo-consistent
but represent real parts of a scene (e.g. low-textured walls, windows, cars) in order to achieve
complete large-scale reconstructions. The importance of these surfaces towards complete 3D
reconstruction in exhibited on several real-world data sets.
48
Figure 40: A 3D reconstruction before (on the right) and after (on the right) applying the method
in (Jancosek and Pajdla, 2011)
 Fast automatic modeling of large-scale environments from videos (Frahm et al. 2010)
Description: This approach tackles the active research problem of the fast automatic modeling of
large-scale environments from videos with millions of frames and collection of tens of thousands
of photographs downloaded from the Internet. The approach leverages recent research in robust
estimation, image based recognition and stereo depth estimation, at the same time it achieves
real-time reconstruction from video and reconstructs within less than a day from tens of
thousands of downloaded images on a single commodity computer.
2. Related Tools to WP5: Networking for immersive
communication
2.1. Network Architecture
Many types of network architectures for added services in the internet exist. Research is
conducted on overlay networking, content-based networking and P2P structures, all to make the
internet handle real-time multimedia traffic streams in a more optimal way. REVERIE will be the first
overlay network architecture that will allow real-time interactive 3D video over the internet mixed
with traditional media delivery techniques. REVERIE will provide the overlay and synchronization
mechanisms for multiple correlated streams that can serve as a reference for future internet
architectures that enable 3DTI.
This task helps to create an overview of WP5 tasks and facilitate its integration. The architectural
considerations itself are also research area as deployment of proper architecture can greatly
enhance system performance.
This first version focuses on the state of the art for Tasks 5.1, 5.4 and 5.6 .The issue of resource
management is also important to determine the overall Network architecture (5.1) and the overall
REVERIE architecture (2.2).
The proposed architecture is shown in Figure 41. The architecture consists of two clouds that
serve two different types of communication purposes. The left cloud emphasizes real-time
interactive communication where the delay from sender to receiver should be minimal. The Content
Distribution Network (CDN) based media distribution cloud provides enhanced ways to distribute
pre-generated media content to different receivers. The first cloud will encompass tasks 5.1 and 5.4
49
and 5.6 while the second cloud is more related to WP 5.1, 5.2 and 5.5 and will rely on previous
approaches from the COAST project.
Real-Time Interactive
Communication Cloud
CDN Based Media Distribution
Cloud
Static
Resources
(internet)
Avatar Engine
Audio
Composition
Real-Time Video
Scene
Composition
Media Composition Server
Signalling
Sync
REVERIE
UPLOAD TOOL
Publishing
Front End
Signalling
Entry Point
Media Cloud
QoS Monitor
Federated
Cache
Admission Control
Active participant
Networking and monitoring
Networking
Signalling
Avatar data
Real-time streaming
Depacketize, decode,
synchronization
Ad. compression
World
state
Depacketize, decode,
synchronization
Real time audio/video clouds
Cached media, audio, video, images
Upload tool
Real time capturing
Rendering
Active participant
Rendering
Passive
Participant
(viewer)
Figure 41: System Architecture
Figure 41 shows the current envisioned architecture for REVERIE. Currently in REVERIE, 4 types
of traffic are defined: signaling data, avatar control traffic, real-time media (audio/video) for
conversation and cached/stored media such as videos or pictures. Active participants can stream
their captured content after compression and networking decisions. All this is done in an adaptive
fashion to the current network and REVERIE conditions. Signaling data invokes data for
starting/ending sessions and monitoring/reporting transmission statistics from clients. The avatar
control traffic is generated from signaling related to the autonomous avatars in REVERIE based on
their movement, emotions, actions and graphics. The REVERIE architecture adopts the approach
from COAST for publishing media through the publisher front-end with optimized caching and
naming schemes. A user that decides to upload its own content or content referred to by a URL
somewhere in the internet currently does so via the REVERIE upload tool that subsequently uses the
publishing front-end to enable naming and caching functionalities. The scene composition for realtime interaction currently consists of both the video compositor and the audio compositor. To
ensure synchronization they are represented in a single block. The powerful composition server,
including audio and video decomposition, executes complex media operations. While the exact
functionalities are not yet clearly specified and described, this server is already anticipated
/expected to be a bottleneck to the REVERIE application. Therefore a distributed implementation of
this server in the network is a possibility in a later stage; however the consequences for
synchronization and real-time capabilities have to be clearly assessed. Another architectural
decision was taken to limit the scope; security issues from the network architecture perspective are
not taken into account in REVERIE but are left for future research. Currently, REVERIE supports two
different use cases. The first use case is more geared towards the use of autonomous avatars,
50
multiple users and web-interfaces. The second use case aims to have less users, but with more
capability for 3D video stream traffic. The architecture for both use cases is the same, but the
implementation of the components might be different.
Table 1 Component development in use case 1 and use case 2
Use Case
1
2
Composition
Server
Light
Heavy
Avatar
Control
Heavy
None/Light
Cached media
Signaling
Monitoring
Heavy
optional
Heavy
Light
Light
Heavy
Current architectural decisions for the REVERIE network architecture:
1. An overlay network architecture is defined for REVERIE by the recommendations of the Future
Media Internet Architecture Think Tank (FMTIA). This constitutes an overlay on top of the
current internet that can be separated in two media aware clouds, one focusing on interactive
communication, and the other on content distribution.
2. The real time aware cloud that supports real-time immersive 3D communication, signaling and
monitoring support for sessions, real-time video and audio composition.
3. The media distribution aware cloud, that supports caching and efficient naming schemes to
support efficient delivery of third party or user uploaded content for immersion in 3D scenes.
4. Support for different terminal capabilities i.e. scalable terminals. Currently 3 types of terminals
are defined for participants and an optional viewer terminal.
5. The architecture features independent components for monitoring, signaling media
composition and avatar reasoning, that can be usefully applied in different application areas.
6. Six types of communication traffic are defined: audio, video, avatar control, session
control/monitoring, streams of cached content, upload data.
7. Powerful centralized server for composition of audio and video in the scene representation.
8. No implementation of security issues arising in the network.
9. Architecture supports both use cases, components can be different according to their
relevance to the use case 1
2.1.1. REVERIE future internet network overlay
Currently, a debate on how the internet should evolve to support novel multimedia applications
is ongoing. One important topic is how the internet can efficiently support video delivery content.
Better client/server software does not solve this because application bandwidth requirements is the
bottleneck. Such scenarios demand changes in the network infrastructure to achieve theiraims
(bandwidth allocation is the main network issue, but security, session control and intellectual
property management also play a role in the network). Enabling real-time interactive applications,
such as video conferencing and possibly 3D immersion as depicted in REVERIE, may also need
support from the underlying network. Massive distribution of video to multiple users has been a
challenge for the internet, massive streaming of 3D videos will pose even larger challenges. Clearly,
REVERIE can provide great insights to aid development in the architecture of the future of the
internet to enable real-time 3D video. In fact, the network architecture developed in REVERIE will be
interpreted as future internet architecture based on the concepts from FMTIA for 3D-tele
immersion. To illustrate the concepts described by FMTIA we will first discuss the architecture
proposed by them.
Figure 42 shows the high level architecture developed by FMTIA. It supports overlay networks (in
the blue areas) to support novel services and functionalities. An overlay network refers to a network
deployed on top of existing network. In REVERIE, an overlay for 3DTI has to be deployed that
provides scalability, low latency and bandwidth adaption.
51
Figure 42: FMTIA future internet overlay
These application-aware overlays consist of nodes (edge router, home gateway, terminal
devices, and caches) and overlay links between them. Note that one link in the overlay network can
span multiple links in the underlying network (the regular internet). Between these nodes a network
can be dynamically constructed usually based on the application needs and the conditions of the
underlying network (as far as this is known). For achieving this aim, the topology of the constructed
overlay is important (similar as in regular networking), also algorithms/heuristics to dynamically find
topologies are of interest. We give a list of relevant topologies/algorithms/heuristics below:
Client Server: In REVERIE, all real-time data is relayed through the composition server. Software
like Skype and QQ also use such an approach. Its’ advantage is that firewall blocks are more easily
avoided and there is centralized control. However, in the case of 3DTI the burden of video
streaming and processing can be too much in the case of multiple video streams.
Full Mesh: All peers are connected and send data directly to each other, this will result in a
generally lower end-to-end latency. However because all terminals send data to each other,
bandwidth can become a bottleneck due to high node degrees (high number of outgoing links).
Hybrid Scheme: Within a hybrid scheme there is more than one application-aware node that
forwards data (i.e. more than in the client server approach). This can alleviate the link bandwidth.
The main difference is that the number of forwarding nodes is fixed. A CDN network for video
distribution with caches in the overlay network can be seen as an example of a hybrid scheme.
Application level Multicast: Here the overlay network is dynamically constructed between a not
necessarily limited numbers of application=aware nodes. Paths between terminals can be
determined using different algorithms and heuristics such as Kruskal’s minimum spanning tree,
minimum latency tree, delay variation multicast algorithm and others.
52
For REVERIE, the currently defined architecture for Real-time interaction between participants is
chosen to be a client server model. While this approach is not really scalable to support heavy
streams with many users, the architecture has some practical advantages such as centralized
functionalities and avoiding firewall policies at end terminals. When the core functionality of the
system performs well, peer-to-peer opportunities and more complex overlays can eventually be
studied to improve scalability and communication latency. For distribution of more statically
allocated content, the caching scheme in the content aware cloud can decrease the overall load on
the links.
2.2. Naming
We describe, in this section the state of the art in naming the current and next generation
of networks for the delivery of various types of multimedia content. Interestingly, naming can
appear a very mundane concept but actually the concept of a name, the conventions around it, the
way we employ it, the mechanics of how we assign it, store it and publish it, and more importantly
the semantics we attach to it, can ultimately involve arcane technical, linguistic and philosophical
issues.
According to the American Heritage Dictionary a name is "a word or words by which an entity is
designated and distinguished from others". However even in the physical world, where "entity" is
more often than not used to denote concrete objects or things, the concept could in cases be blurry
enough to at least inspire the famous "what's in a name?" line and, we suspect, some philosophical
debate as well (for which we can provide no references). In computer science and engineering, the
realm of possible entities is expanded to include concepts that can be quite abstract, such as a
database record, a variable, a video, a segment of a video or even something that appears inside a
video. Accordingly the semantics attached to names assigned to such entities require more rigorous
definitions. Skipping the first few decades of computer science the first systematic approach to
define a naming framework that extended past the confines of the computer system or network
and was also externally visible to the end users came with the advent of the World Wide Web
(WWW). WWW standards define three kinds of names:
 Uniform Resource Names (URNs) that uniquely identify entities, not necessarily web
resources. A URN does not imply availability of the identified resource, i.e. a representation for
it may not be obtained. E.g. the valid, registered URN "urn:ietf:rfc:2648" identifies the IETF
(Internet Engineering Task Force) RFC (Request for Comments) 2648 "A URN Namespace for
IETF Documents" but a user can't paste it in to their browsers’ address bar and fetch the
document. URNs were introduced with RFC 1737 (1994) which doesn't really formally define
what a URN is but rather specifies a minimum set of requirements for this class of "Internet
resource identifiers" and states that "URNs are used for identification".
 Uniform Resource Locators (URLs). According to RFC 1738 (1994) these are "compact string
representations for a resource available via the Internet".
 Uniform Resource Identifiers (URIs). These are defined in RFC 3986 (2005) as a "compact
sequence of characters that identifies an abstract or physical resource".
It is telling that even an engineering body such as IETF has had to refine and further qualify the
above concepts, over a period of more than ten years, through a series of updates and additional
RFCs to fully clarify their role and their relationship to one another. At this time, the consensus is
that URNs are used to identify resources, URLs are used to obtain them, and URIs are the parent
concept, the base class if one is to use a programming analogy, i.e. both URLs and URNs are URIs.
However, the URL and URN sets are "almost disjoint", i.e. there are schemes that combine aspects
of both a URL and a URN. The triad of URIs, URLs and URNs is depicted in Figure 43.
53
URI
URN
URL
Figure 43: Relationship between URIs, URNs and URLs.
Setting aside the nuances surrounding URIs, URNs and URLs, there is even some room for debate
on what actually a resource is and how it is defined. For instance, consider the URL
www.bbc.co.uk/weather/Paris. Should one construe this URL to denote a resource whose
representation will provide:
 The current weather in Paris, in human readable form?
 The current weather in Paris, in machine-readable form (according to a grammar published
elsewhere)?
 An index to historical data on Parisian weather (again in human or machine-readable form?)
 A five day forecast for the weather in Paris?
It is almost impossible to devise a framework that would allow one to formally define what a
particular URL is, unless one also has a world ontology (and such does not exist at the end of 2011).
The IETF had, circa 1995, a working group on Uniform Resource Characteristics (URC) with the more
modest goal of providing a metadata framework for Web resources but the approach was ill-fated
although it influenced related technologies like the Dublin Core and the Resource Description
Framework.
We spent some time above discussing the naming schemes used in the context of the WWW for
four reasons:
1. To highlight some of the dimensions, properties and challenges of a naming scheme.
2. Since the WWW standardization effort is a rigorous and yet open process that allows one to
view the historical evolution of these concepts through the published documents (RFCs)
3. Since it is the system that most non-technical users are aware of (even though the
undefined term "web address" is usually used).
4. Since web browsers are ubiquitous client applications and multimedia content is mostly
consumed through them.
In general the following dimensions or properties define a naming scheme or a class of naming
schemes working together:
 semantics
 uniqueness
 persistency
 verifiability
 metadata versus opacity
We discuss the above dimensions in the subsections that follow, briefly presenting the state of
the art in each dimension, when applicable. We conclude the section with some thoughts on the
REVERIE approach on this matter.
Semantics
The meaning we attach to identifiers and names can vary considerably within each different
scheme. We already discussed how URNs are an example of a class of schemes that identifies
54
resources as opposed to URL, which is a class of schemes that provide a method for retrieving a
representation of these resources. This seems to indicate a dichotomy: "name of the object" - vs. "name of a method to get the object". Taking into account the full Web software stack and the
domain name translation that inevitably takes place behind the scenes, there is really a trichotomy:
 the name of the object
 how to fetch it
 the location of the server where the object is hosted
Although URLs also denote a location since they include a domain and a port component in their
syntax:
scheme://domain:port/path?query_string#fragment_id
In today's web, there is a variety of methods by which the actual location is not really constraint
by the domain component. For instance, CDNs use a variety of techniques to actually route requests
for the same domain to different physical servers:
 Server balancing
 DNS-based request routing
 Dynamic metafile generation
 HTML rewriting
 BGP-based anycasting
The end goal of these techniques is to ensure optimal routing; the criteria can be diverse but
usually the predominant consideration is the geographical proximity of the server, i.e. the closest of
a group of servers is chosen10. Other possible criteria could be server load or the "cost" of certain
network routes (in the case of where another provider's infrastructure has to be used). The above
methods vary greatly in the mechanics they employ, the infrastructure they assume, their degree of
transparency to different layers of the software stack or network equipment and the portion of the
path that they can affect but the end result is the same: they succeed in introducing a further
degree of separation between a URL and the actual location. URLs in such a scheme then become in
a sense more like URNs as they no longer denote a location.
However, even the above discussion makes sense only in the context of today's Internet, and
more precisely, only when one assumes the use of the Internet Protocol (IP) for the network layer
(or OSI Layer 3). In more than one sense they are hacks that violate the end-to-end principle the
Internet was built on. The various efforts of the last ten years towards Content Centric Networking
(CCN) will (if fruitful) remove the need for such wizardry by redesigning the software stack.
Topologically, the existing Internet can be viewed as a graph of physical equipment; without any
loss of generality we can only consider routers in this discussion. Though strictly not a tree, in the
graph theory sense, the Internet can be approximated as a tree with the edge devices and user
equipment (clients and servers) located at the leaves and the core routing equipment at the
"branches" of the various "levels"11. Routing tables of the IP then operate to ensure reliable
transport of data from any leaf to any other leaf. Although this model is accurate, it is also very
general and fails to take advantage of the well-known disparity in the relative populations of data
sources versus data sinks or of the fact that data, and especially multimedia content, is usually
produced only once and can remain in demand for years, consumed by multiple source sinks.
10
Geographical distance is obviously not the same a network distance (e.g. number of hops or Layer 3 routers
in the path), but empirical results have shown it to be a good enough approximation. The haversine formula is
typically used to compute distance based on the geo-coordinates, if such are available, or using a reverse GeoIP database.
11
These terms appear in quotes as they are non-technical, without a definition in graph theory.
55
Research in CCNs was prompted by this observation. Accordingly, in CCNs the focus shifts from the
leaves of the Internet tree to the data or content itself (hence the content-centric designation). This
is reflected by the fact that names are now used, even at routing tables, to designate not the
communication endpoints but rather the content itself. The TRIAD project at Stanford circa 1999
was the first one that proposed avoiding Domain Name Server (DNS) lookups by using the name of
an object to route towards a close replica of it. A few years later (2006), the DONA project at
Berkeley built upon TRIAD by incorporating security (authenticity) and persistence in the
architecture. In 2009, PARC announced their content-centric architecture within the Van Jacobsonled CCNx project; specifications for interoperability and an initial open source GPL implementation
were released later the same year.
Naming is critical to all CCN projects as identifiers for content are directly used at the routing
tables in most of these architectures. This also forces naming to operate at a fairly low level as the
name of the content has to sufficiently describe the data; in other words the name should uniquely
identify a byte stream or a portion thereof. Since CCN architectures obviously rely on packet
switching and have to support streaming modes of content delivery, naming operates at the packet
level; i.e. names are assigned to packets of data, not to the entire content as a whole. This
necessitates automatically naming such packets in a unique way and the usual approach is to assign
a hash, usually a cryptographic hash such as MD5 or SHA-256 for at least some of the name
components. This obviously does not preclude the use of other components to store metadata such
as IPR owner, version, encoding format, etc. On the other hand, at the routing level the network can
simply take the names as they are without trying to infer meaning; conventions and their
enforcement can be
That's the approach CCNx takes, where the name is formally defined using the BNF notation as:
Name ::= Component*
Component ::= BLOB
That is, the CCNx protocol does not specify any assignment of meaning to names, only their
structure; assignment of meaning (semantics) is left to applications, institutions or global
conventions which will arise in due time. From the network's point of view, a CCNx name simply
contains a series of Component elements where each Component is simply a sequence of zero or
more bytes without any restrictions as to what this byte sequence may be.
In general, it is fair to say that CCNs have not yet entered the main stream nor do the dynamics
(at least in terms of community size, publications or interest) appear to suggest that it will in the
short or even medium time frame. The CCNx project led by Van Jacobson seems to be the leading
effort in this field for the last few years but even there, the current state is described as follows:
CCNx technology is still at a very early stage of development, with pure
infrastructure and no applications, best suited to researchers and
adventurous network engineers or software developers12.
In our view, this is not to be attributed to any deficit in the theoretical foundations or the
engineering behind the CCNx effort but rather to the fact that existing solutions like CDNs are too
widespread and seem to work just well enough, for the time being, even though undoubtedly
lacking in grace. It may also be the case that the current Internet architecture and the often
criticized "hourglass" IP stack represent such a massive investment in terms of capital and human
talent (theoretical work, applications, protocols, deployed assets) that is going to take a comparable
commitment or a clearly perceived technical roadblock to prompt any serious redesign. Perhaps
recognizing that, certain application-layer designs have also been proposed. The Juno middleware is
12
Source: http://www.ccnx.org/about/ accessed November, 2011.
56
such a content-centric solution, which again relies on "generating one or more hash values from the
content’s data" to implement a self-certifying content naming approach13.
Uniqueness
In the previous section on semantics where we surveyed some prevalent and emerging
networking approaches, we mentioned that generating hashes (preferably cryptographic hashes
such as those in the MD and SHA families) from the content data is the obvious way to generate
unique identifiers. These hashes may then be further extended with additional name components
which may be administrative-derived or assigned by the applications and which can be used to store
metadata. Another way to guarantee uniqueness is to rely entirely on central or hierarchical
registration authorities and (for the last part of a name) to administrative house-keeping within an
organization or application. This is the approach URIs take, which rely on the hierarchical DNS
system. In these approaches, there is no automatic validation built in the naming which is why other
methods have to be used for that purpose (relying on certificates or securing the https channel).
Finally, there are also completely or partially random approaches at naming which generate
unique identifiers without any reference to content, by hashing together information such as a
computer's MAC address, a timestamp plus a 64-bit or 126-bit random number to make it
sufficiently improbable that no two identifiers generated in such a scheme will ever be unique. Such
identifiers are known as Universally Unique Identifiers (UUID) and are documented in RFC 4122 and
other equivalent ITU-T or ISO/IEC recommendations. Microsoft's Component Object Model was the
first to use them (called GUIDs for "globally unique"). Open source implementations to generate
such identifiers are available for most programming language and database frameworks.
Naming Persistency
Defined rather crudely and by way of example, naming persistency is the property of a naming
system that would allow one to write down on a piece of paper (or store in a file if the name is not
human-friendly) a name used to identify some data (or a service), disconnect from the system and
come back after an arbitrary period of time (during which length of time servers may have
undergone reboots or even crashes) and use the name to retrieve the same or equivalent data
provided of course that the data itself has survived. The Web is an example of a system without
naming persistency, as witnessed almost every day by broken links and 404 responses, yet some
persistency for practical purposes is still assumed (which is why even this document provides URLs
for references). Naming Persistency is an especially challenging requirement in Peer to Peer
networks due to the dynamics of the peer population and the ad-hoc formation and disbanding. In
such a context (i.e. one with a non-stable population of nodes, especially in high churn situations),
distributed name registration and resolution are used to provide persistency. However, it is not
always the case that the benefits of such a property outweigh the complexity it adds. The paper
from Bari et al (2011) explores some of these issues.
Verifiability
The sub-section on semantics also discussed some of the verifiability concerns. We summarize
here and provide some additional pointers since this property merits independent treatment.
Verification in the case of immutable data isn't really an issue as any hash can be used to provide
reasonable validation. Of course that should preferably be a strong, collision-resistant cryptographic
hash function. Most functions in the MD or SHA families fit the bill although advances in
13
"Juno: An Adaptive Delivery-Centric Middleware" available from:
http://www.dcs.kcl.ac.uk/staff/tysong/files/FMN12.pdf accessed November, 2011.
57
cryptanalysis compel users to move to stronger hashes over time (e.g. the MD5 was broken14 in
2008). SHA-1 and SHA-2 are the most commonly used cryptographic hash functions as of 2011,
together with MD5 despite being "broken". SHA-3 is expected to become a FIPS (Federal
Information Processing Standard) around 2012. An interesting case arises when the verifiability of
mutable data is concerned. For mutable data, two approaches can be taken:
 Take the view that each change actually generates a completely different object and that
there is no need to maintain ancestry information as part of the name (such metadata
information can be maintained externally if necessary). In such an approach the issue
becomes moot.
 Try to encode ancestry / versioning information in the name itself. In such an approach the
name will need to have a fixed component which will be supplemented by a changing set of
version hashes. This however means that the totality of the name can no longer be verified
and so full verification has to involve other cryptographic techniques such as digital
signatures which require the overhead of a public key infrastructure.
Metadata versus Opacity
The discussion of metadata versus opacity relates to names since it affects the decision, when
devising a naming scheme, of whether to include metadata information as part of the name itself
(e.g. a component of the name as in a file extension or in file names which embed a date) or
whether to consider it extraneous and maintain it in a separate infrastructure (if at all). In an
opaque arrangement of software handling, the name effectively treats it as a blob and does not try
to infer any meaning or differentiate the processing by examining name components. Clearly in
cases where the names embed metadata it is still possible for lower levels to treat them opaquely
and only expect higher application layers to use this metadata information; but practically the
temptation may be too strong to ignore. Architectures where identifiers are really opaque (i.e.
opaque by construction as opposed to being treated as opaque) make such proper layering
considerations or end-to-end principles easier to enforce (forfeiting the possibility of "clever"
improvements in the network). COBRA's IOR or J2EE object references are two such cases in point.
2.3. Resource Management
This section discusses the state of the art in managing resources in 3DTI applications similar to
the one developed in REVERIE. In REVERIE, we aim to develop more advanced resource
management schemes. As starting point, the previous resource management scheme as applied in
3DTI applications in Illinois, Berkeley Pennsylvania and Chapel Hill will be discussed (Nahrstedt et al
2011a) and (H.Towles et al. 2003).
Managing resources is one of the main challenges in REVERIE. 3DTI is a demanding application that
consumes network and computing resources rapidly and requires low delays for interactive
communications. Proper resource usage will benefit the project in the following ways:
- High interactivity due to smaller end-to-end delays
- Better image quality/perceptional quality due to better bandwidth usage
- Fewer system breakdowns due to unavailable resources
14
It should be noted that "broken" doesn't mean that the hash no longer offers any protection, just that an
attack with a complexity lower than previously imagined has been devised against it. Be that as it may and
since the computation costs for most hash functions are negligible there is no reason why one shouldn't move
to stronger hashes over time except for high software update and regression testing costs.
58
In 3D Tele-immersion, large volume data is interchanged in real time. To achieve high quality
real-time experience, low latency and high throughput is required. To achieve this, resource
management (of network, hardware and software resources) in the system is of utmost importance.
Unoccupied links, losses and CPU blocking are factors that can heavily deteriorate the user
experience. Nahrstedt et al. (2011a) have over 6 years of experience in deploying and testing 3D
tele-immersive environments and have summarized resource management issues. This work can be
consulted as a reference for resource management in 3DTI and other complex interactive media
environments. We summarize the results relevant to REVERIE below, to clearly underscore the
design issues and challenges at hand. The main concerns are the buildup of latency through the
system and blocking due to bandwidth limitations. From these techniques, we will choose suitable
techniques for managing processing power at the receivers and terminals, and for managing
network bandwidth.
Bandwidth:
Both CPU and Network bandwidth have to be taken into account to avoid partial blocking of
media streams and synchronous arrival. This resource is needed to achieve high throughput of
events occurring at the sender site to the receiver site. We will survey the following mechanisms to
manage these resources:
Network Bandwidth:
1. Diffserv: This is a technique increasingly used in the Internet that allows differentiation
between packets on the IP-level. Higher priority packets are served first. The Per Hop Behavior
(PHB) can be defined per diffserv class and defines queuing, scheduling, policing, traffic
shaping. Current PHB’s are :Default PHB, Class selector PHB, expedited forwarding PHB,
Assured Forwarding PHB. While the approach is scalable it requires knowledge of internet
traffic and the routers used in the network. Diffserv can be deployed by the telecom operator
Internet Service Provider (ISP).
2. Intserv: A flow based Quality of Service (QoS) model. Receivers reserve bandwidth for the endto-end path typically using the Resource Reservation Protocol (RSVP), allowing routers in the
network to maintain a soft-state path. RSVP/Intserv is generally considered to be not as
scalable.
3. MPLS: Multiprotocol Label Switching is often employed to provide QoS, mostly from the
operator side. MPLS [RFC 3031] allows a label insertion between layer 2 and layer 3 and is
sometimes called layer 2.5. The initial idea was that forwarding based on a label value was
faster than forwarding on an address as done in IP. Currently the three important applications
are:
a) Enable IP capabilities on non-IP forwarding devices mostly ATM, but also Ethernet/ppp also
called generalized MPLS.
b) Explicit routing paths: in some types of networks certain paths can be preferred, but the IP
routing mechanism does not support to explicitly choose a path.
. This utility is
mostly used for operators for engineering the traffic, so resources are properly used.
c) Layer 2/Layer 3 Tunneling: By using MPLS labels from the head end, different
protocols/packets can be encapsulated. ATM or other types of layer 2 services can be
emulated. This can be useful for providing legacy/old services on newer hardware. As IP can
also be encapsulated between remote sites using an MPLS-based virtual connection,
construction of VPN’s is a major application of MPLS. MPLS is a mainstream technology
mainly employed by telecom service operators and internet operators. Its applicability to
the REVERIE use case to deploy tele-immersion on the internet is limited because we cannot
59
change the operators’ networks. However employing tele-immersion over a private VPN
could be possible, but somewhat costly.
4. Bandwidth broker: A bandwidth broker manages resources by admission and allocation of
resources based on current network load and Service Level Agreement (SLA). By continuously
monitoring RTT, loss rate, packet size from monitoring output, it can adapt its parameters to
decide on admission or not. Bandwidth estimation is important in this scheme as it determines
the adaption steps by the algorithm, several techniques have been developed to do so, also
referred in Nahrstedt et al. (2011a). Bandwidth broker is usually deployed at the edge routers
in the network.
5. Application level overlay: As already discussed in the architecture section, application level
overlay can be used to achieve better management of underlying resources. For example in
Zhang et al. (2005), an overlay for streaming video on the internet was presented, which was
previously not feasible due to bandwidth restrictions. Other overlays for streaming media were
presented in Yi et al. 2004, while a general framework for deploying overlays in PlanetLab was
presented in Jiang and Xu (2005). The main advantage of using overlays is that the underlying
network infrastructure does not need be changed. In previous methods, this was the case.
However the independence of the overlay network to the underlying network/physical
infrastructure can also result in mismatch and less efficient resource usage, such as in Ripeanu,
2001.
In REVERIE we are most likely to use an overlay network or VPN (MPLS based) for bandwidth, as
Diffserv, Intserv a bandwidth broker would require network control not available by the partners.
CPU Bandwidth:
For real-time media streaming, it is important to control the CPU bandwidth (i.e. cycles per time
unit) so as to avoid blocking of real-time media throughput. Blocking of real-time media can
unnecessarily result in extra delays, jitter and skew both at the sender and receiver sites. Managing
and allocating processor bandwidth to streams can be seen as a processor scheduling problem.
Distinction can be made between best effort schedulers and real-time (media) schedulers. Best
effort schedulers can allow prioritization of some tasks (media tasks) or allow bandwidth to be
divided proportionally. However best effort schedulers do not provide hard deadline guarantees of
when a task will be scheduled.
In real time schedulers on the other hand, algorithms like earliest deadline first can provide hard
deadlines on when tasks are scheduled, i.e. real-time capabilities. Similar to the allocation of
network bandwidth, admission algorithms have been developed for admitting tasks to be
scheduled. In REVERIE we will investigate the applicability of currently available real-time and best
effort schedulers to our 3DTI immersion application when possible.
In REVERIE we are not planning to deploy CPU scheduling at the clients and servers. Instead we will
deploy monitoring tools to make sure blocking is avoided both at the clients and the composition
server.
Delay (end-to-end delay)
In general, as defined in the ITU standard G.114 (2003), delays above approximately 400ms are
unacceptable in 2 way videoconferencing. This delay requirement can also be assumed for 3DTI. In
the current (centralized) REVERIE architecture the end to end (e2e) delay for real-time data can be
described as:
E2ED = D_sender + D_network + D_server + D_network + D_receiver
D_sender consists of capturing and compression delay at the sender site. D_network represents
delays introduced in the network and D_server processing in delays at the server (trans-coding, the
60
scene composition model). Delays at the receiver consist of decoding buffering and rendering. Also
synchronization of multiple streams can cause extra delay as buffering is required to adapt to the
skew between the streams. All these steps are described in Nahrstedt et al. (2011a). We briefly
summarize them here:
Sender-side delay
Many delays are incurred in hardware/software in capturing and coding. 2D video encoding and
reconstruction of 3D scenes can be optimized by hardware/parallel implementations. References to
them are given in the related section in Nahrstedt et al. (2011a). Generally, in interactive
applications such as video conferencing, it is preferred to keep the encoding and decoding delays as
small as possible. Most of the time, compression and decompression have the same level/order of
complexity (delay).
1.
2.
REVERIE supports different types of senders (scalable terminals); each level will be tested for its
(limited) processing delays.
Preferably, an API interface to the networking/monitoring components is available that reports
delays so as to enable Quality of Experience based networking
Internet network delay
One of the aims of REVERIE is to show that 3DTI can run on the internet. Therefore an approach
with no leased or specialized networks is taken. The regular internet will introduce extra delays that
have to be taken into account. We must separate between point to point delay management pointto-point and multi-point delay management. The first one is tightly related to bandwidth
management. With multiple sites, delay management becomes important (as this accumulates over
multiple hops). Currently in REVERIE we have two different approaches to handle the inter-network
delay in each use case.
1. In use case 1, there are many users. However traffic is simplified in such a way that links are
under committed and internet delay only plays a minor role. If this is not sufficient, techniques
from P2P/overlay networking or gaming can be investigated.
2. In use case 2, there are 3 users with heavy traffic. The link to and from the server, which will be
monitored continuously for capacity/latency/loss based on the bit-rate, will be adapted using
the adaptive coding schemes that will be developed. We will create user awareness of the
network at the two terminals and the server.
Receiver-side delay
Loss concealment, receiver buffering and the decoding/rendering of 2D/3D video streams, all
introduce computational delays. Even though they are all troublesome, in the REVERIE case we
should make sure that the 2D/3D video rendering/decompression is fast enough given relatively
normally sized buffers and computationally simple concealment techniques. Also, adapting rate on
monitoring receiver side delays can allow for increased performance. (Interface measurement API).
1. REVERIE supports different types of receivers (scalable terminals); each level will be tested for
its (limited) processing delays.
2. Preferably, an API interface to the networking/monitoring components is available that reports
delays as to enable Quality of Experience based networking.
Synchronization of multiple streams:
In previous work, 3DTI have consisted of multiple correlated video streams from different
cameras/receivers. To synchronize the multiple streams again at the receiver is a challenging task. In
REVERIE a similar complex synchronization problem is encountered. In this section we briefly
explain the two problems with synchronization encountered in 3DTI described in Ott and MayerPatel (2004) and Huang et al. (2011).
61
In Ott and Mayer-Patel (2004), a single 3DTI site referred to as a cluster consists of different IPcameras which send multiple related streams via a gateway towards another cluster where the
streams are rendered. On the path between the two clusters, streams traverse similar links and
routers. If links become over committed, the streams will compete for bandwidth and parts of a
stream maybe partially dropped to avoid congestion. When the streams are processed by the
receiver gateway and rendered at the receiving cluster, it turns out that the damaged stream causes
problems rendering the intact stream. It turns out that it would have been much better if they had
experienced similar loss/delay or bit-rate adaption to avoid the congestion in the first place. As the
streams come from different IP cameras (different senders) this type of coordination is not
supported by current internet protocols. Subsequently, the authors develop a protocol on top of
User Datagram Protocol (UDP) to allow monitoring of connections at the gateways of the clusters to
allow for coordination between streams. The experimental results showed the improvements
achieved. This specific type of effect can be expected in REVERIE, for example in use case 2 when
multiple IP Cameras synchronously capture the scene and transmit to another site.
In Huang et al. (2011), a similar synchronization problem was tackled with multiple sites (N=5-9).
Each site produces multiple (N) synchronous 3D videos streams that are sent to more than 1
receiver sites. Each receiver (theoretically) needs N streams to be able to render properly. This
raises the following questions regarding the inter-stream synchronization of streams from the same
sender:
1) Can the number of streams that each receiver needs be reduced to alleviate the network?
2) To which video stream should the other streams synchronize when rendered? (Which video
stream is the master stream?)
3) Can we route the streams in such a way, that they arrive approximately at the same time at
their destination? This implies small inter-stream skew and improved inter-destination
synchronization.
Regarding questions 1) and 2) the work from Huang et al. (2011) obtains answers from a
previous study (Yang et al. 2009). Depending of the viewing direction of the user at the receiver site,
the number of needed streams is reduced (only the ones relevant to the viewing direction are
requested). Also, for synchronization the stream most in the directionof the users view is chosen as
the stream that other streams synchronize to. The work performed in Huang et al. (2011) mainly
focuses on question 3), an algorithm is developed that aims to minimize overall end-to-end delay
with both inter-stream skew and bandwidth on the links as given constraints for 5-9 nodes. For a
deeper look at this we refer to the paper itself, for REVERIE it is important to take the problems
presented here into account as we will need to develop our own custom solution to these problems
suited to the use cases.
In REVERIE large issues regarding inter-stream synchronization are expected to arise. According to
previous research, solutions lie in the adapted routing/overlay networking in Huang et al. (2011)
and adaption of internet protocol in Ott and Mayer-Patel (2004). In REVERIE we will investigate both
approaches. More specifically in use case 1 we will test overlay schemes for multiple nodes, while in
use case 2 we will investigate protocol adaption for communication between clusters. The REVERIE
consortium will report protocol adaption useful for 3DTI to the IETF as possible new drafts.
2.4. Streaming
This task deals with the live transmission of media over the internet, often referred to as
streaming. In the past, protocols, codecs and technologies to support this have been developed.
They are described specifically in this document.
Adoption of an appropriate streaming mechanism, allows smooth and interoperable media flow
within the REVERIE framework. The developed streaming mechanisms will set requirements on
62
resource management and the network architecture. Streaming does not only deal with the
compression of media data, but also with the packetisation and transport of such media over the
network in an optimal way, as is needed in REVERIE. Streaming mechanisms can be adapted to the
network/architecture and conditions or the other way around.

Signalling and synchronisation for bidirectional data exchange
Description: Overall system synchronisation; depending on the architectural choice; it may
include the synchronisation of each user with the server and/or among the users themselves. It
also includes the synchronisation of the user generated content with the cloud. Ultimately, it
will ensure that each user is synchronised with the virtual environment.
Dependencies on other technology: Depends on final implementation choices of the
architecture; e.g. P2P, hierarchical data distribution, client/server architecture.

Data streams synchronization
Description: Synchronisation of data streams at the receiving side (audio, separate 3D views,
depth map, 3D mesh model, avatar position and any type of data, which are potentially interdependent). The goal is to ensure a consistent reconstruction of the scene.
Dependencies on other technology: Chosen standards or design choices for audio, video, or 3D
shapes representation.

Real-time streaming
Description: Use of transmission protocols that ensure delay-constrained data delivery, as
required by the system. Common techniques, like Real-Time Transport Protocol (RTP) and RealTime Transport Control Protocol (RTCP) can be used to regulate the data flow depending on the
available resources and configuration demanded by the user (within the degree of control
allowed by the system).
Dependencies on other technology: Logical organisation of the components of the system
(users and cloud) i.e. user and server control on the exchange of data.
In this section we present the state of the art in streaming real time media in the internet. In
REVERIE we will be streaming multiple heavy media streams on the internet, therefore selecting the
right technology/protocol for streaming is an important issue. First we review the basic principles of
video coding and the evolution of the related standards. Second we summarize internet protocols
available for streaming. After this we look at specific issues and research related to streaming for
3DTI. Two differences between 3DTI streaming and for example a video conferencing stream, are
the need for inter-stream synchronization between multiple related streams and streams
originating from different senders.
2.4.1. Internet Protocols
The selection of the appropriate protocol for encapsulating and transporting REVERIE media over
the internet is an important research question investigated in REVERIE. Task 5.4 investigates
different real-time internet protocols for this task. We list commonly used protocols for media
streaming below. The common distinction we will follow is between acknowledged and
unacknowledged protocols for transmission and control protocols for connection setup.
In the internet protocol stack, Transmission Control Protocol (TCP) provides reliable acknowledged
data transport. Its main principle is that unacknowledged packets (losses on the physical layer or IP
level drops) will cause it to halve its congestion window and decrease its effective send rate by half.
TCP is the dominant protocol on the internet and so far its anti-congestion mechanism have served
the internet well.
63
In the internet protocol stack, UDP provides an unreliable datagram transport function. While it
provides a checksum for error detection, handling of errors and congestion is meant to be handled
by higher level protocols, so called application level framing. The protocols RTP/RTCP provide this
type of application dependent framing on top of UDP. For many multimedia applications
(codecs/video/audio), the internet engineering taskforce provides standardized packet formats for
both the transmitted packets (RTP) and the feedback reports (RTCP). Indeed many multimedia
applications use this format. One of its main advantages is that it disobeys TCP’s behavior of halving
its send-rate, enabling higher effective throughput. As TCP traffic is still dominant in the current
internet, ISP’s have not yet shown efforts to enforce TCP friendly behavior.
While the use of RTP/UDP seems preferable for media traffic, the internet initially mainly
evolved as a medium to exchange textual information. For this type of traffic the combination of
HTTP/TCP is common. As this type of traffic has become preferred, tunneling media on top of this
has also been popular. While from a networking point of view this is not optimal, web and mobile
video transmitted over HTTP are popular. To support mobile and web video over HTTP in a
standardized way the MPEG (Moving Picture Experts Group) and the 3GPP (3rd Generation
Partnership Project) have released the HTTP DASH (Desktop and mobile Architecture
for System Hardware) standard that supports transmission of coded video in segments of 2-10
seconds in HTTP format (Stockhammer, 2011). The receiver client can merge the segments together
and display them smoothly.
For setting up and controlling sessions, several control protocols have been standardized by the
IETF that can also be deployed in REVERIE. For example, Session Initiation Protocol (SIP) is often
used to locate a sender and receiver and setup an RTP media stream connection. SIP can exchange
Session Description Protocol (SDP) parameters that store specific media parameters (format,
description, codec, bandwidth needed, etc). The Real Time Streaming Protocol (RTSP) is an
application-level protocol that gives the client direct control of the delivery of media data with realtime properties. RTSP provides an extensible framework to enable controlled, on-demand delivery
of real-time data, such as audio and video, and supports options like pause, play, fast-forward, etc.
In REVERIE we should try both RTP based streaming as TCP/HTTP based streaming according to the
new MPEG DASH specification. At first sight, RTP seems to provide the large throughput needed in
use case 2 for full media features, while http adaptive streaming could be interesting to test in use
case 1. Comparisons in performance between the two could also be interesting from a research
perspective
2.4.2. Basic Video Compression
Video compression is an extensive field of research marked by continuous industry
standardization efforts, mainly driven by MPEG and the ITU-T Video Coding Experts Group (VCEG).
The stakes are high, applications like digital TV, internet videos, video conferencing systems, and
cinemas all use video compression and reach billions of people worldwide. In this section we will
review the principles of video compression as implemented in related standards. In REVERIE,
encoding/decoding mechanisms are needed to meet the bandwidth requirements. However, as
encoding operations introduce computational and buffering latency an efficient scheme needs to be
selected to minimize this unwanted latency. To do this we need knowledge of 3D video coding,
which turns out to be related to or based on basic 2D video coding (hybrid video coders). We will
not be too specific on technical details, but instead focus on issues relevant to the REVERIE use
cases and give brief summaries. We will first explain the four basic techniques/principles used in
most video coders and then review the existing video standards.
The first technique/term that is fundamental to both image and video coding is intra-coding. In
an image, pixels/areas close to each other often contain similar values. Using this for compression is
called intra-coding and is used in most current standards, often with transform-coding. Transforms
of blocks of 8x8 pixels (or 4x4 in H.264) concentrate most of the information in the first coefficients
64
that represent lower frequencies. This allows more efficient compression by dropping the trailing
coefficients that are close to zero. Also, in H.264, prediction between transformed blocks in the
same image is applied obtaining even higher intra-compression gains. This technique is applied in
both image coding and video coding.
The second technique/principle fundamental to video coding is inter-frame coding. Inter-frame
coding takes advantage of sequential frames in a video as they are often very similar, therefore the
next frame can be estimated from the previous. By estimating the motion of blocks in the frame,
the next frame can be predicted with reasonable accuracy. As a consequence apart from a motion
estimator only the difference between the estimation and the real-frame has to be transmitted,
which tends to be small and can be compressed efficiently (less quantization steps are needed).
While forward prediction was enabled in the H.261 video standard for conversational video, later
MPEG-1, MPEG-2 and H.264/AVC also support bi-predictive coded frames that also use a frame in
the future for prediction. This introduces an extra coding delay, the simple profile of MPEG-2 and
the Baseline profile of H.264 are recommended as they disable the use of bi-predictive coding.
The third important principle that applies to both image and video coding (and any other type of
information source) is entropy coding. Without going into theoretical details, entropy coding aims
to reduce the average bitrate by assigning shorter word lengths to more likely symbols. In the
English alphabet symbols like ‘e’ and ‘a’ would be assigned shorter bit-words, while less likely
symbols such as ‘q’ or ‘z’ are assigned longer bit-words. Known schemes to obtain these words are
Huffman coding and arithmetic coding; the latter is often used in modern video standards. Entropy
coding will give good results, especially if values tend to be clustered.
The fourth important basic principle/technique that can be found in most video coding standards
is quantization and rate control. Quantization is a form of lossy coding and refers to representing a
large (or infinite) set of values by a (much) smaller set. The simplest example is analog to digital
conversion when reading values from a sensor. The continuous values of the sensor are converted
to a digital representation. For compression/storage purpose, fewer bits are preferred. Many
parameters in a video coding system can also be re-quantized and sent using fewer bits; most
standards support many quantizers to adapt to signal. Regarding the output of the encoder, an
approximately constant bit-rate is wanted. As bit-rate can depend on the video content and
changes, bit-rate control is often achieved by selecting a different quantizer.
Now that the fundamental principles of video coding have been briefly explained, the evolution
of the different standards is briefly discussed. H.261 (1993) was one of the first hybrid video coders
and supported conversational video at 64 kbps. H.261 features all the above techniques except bipredictive encoding. The MPEG-1 standard is very similar to H.261 but had application video storage
on disks and hard drives. MPEG-1 introduced bi-predictive coding and provided random access to
media files. MPEG-2 was subsequently developed as a more application independent standard that
provided different levels and profiles for each application. MPEG-2 supports conversational video,
broadcasting and stored video in its different profiles. It notably supports interlacing common in TV
signals and has become very dominant in TV broadcasting. Much of the current digital TV
infrastructure supports MPEG-2 transmission/broadcasting. Therefore, sometimes MPEG-2 is also
referred to as a transmission medium. H.263 and H.263+ can be seen as the successors of H.261 and
gained significantly better compression than this standard for conversational video.
In 2003 H.264/AVC was released, the standard achieves much better compression rates
compared to previous standards (up to 50% where latency is tolerated). Apart from that H.264/AVC
supports the different types of applications (conversational, stored file format (MPEG-4 format) and
streaming) and supports the different transport layers in a clean way by using Network Abstraction
Layer Units (NALU). The profile recommended for conversational video in H.264 is the baseline
profile and does not use B-frames/slices (bi-directional prediction). For an overview of the different
technical features of the standard, the video coding layer and the network abstraction we refer to
Wiegand et al. (2003). Due to its complexity, the standard took some time to become fully
supported by fast software and hardware. Fast encoding seemed to be the most critical, especially
65
conversational (real-time) video applications that require low encoding latency. Currently, the open
source x264 encoder provides fast encoding (Merritt and Vanam, 2004). Moreover, very low latency
(<20ms) can be achieved according to Garrett-Glaser (2011) for conversational video based on an
Intel i7 1.6 GHz computer. The H.264 standard fully specifies the properties of the decoder that is
able to decode any input stream fulfilling the constraints of H.264. A description of a reference
decoder is given in Merritt and Vanam (2004). Decoders that confirm the reference should produce
similar output for a given H.264 stream. Practically speaking, the open-source library libav codec
supports fast H.264 decoding in the order of <10ms baseline.
Wenger (2003) presents a paper on employing H.264 over IP, with an emphasis on wireline
networks with relatively low error rates where most of the losses in transmission are caused by
congestion. Wenger (2003) argues that adding redundancy (forward error correction/detection bits)
to combat losses increases traffic and is counterproductive as it will lead to more congestion and
more dropped packets. Therefore the design of H.264 provides different ways to combat the effect
of these losses without introducing redundancy. We outline them below as they can be of use in the
REVERIE streaming case:
1. Intra-placement: Due to intra/inter prediction, errors can drift in time; by inserting IDR pictures
that invalidate previous memory this effect is combated. The effect is stronger than with a
normal intra frame, as in H.264 pictures can use frames before the last intra frame for
prediction.
2. Picture segmentation in independently decodable slices: The picture is adapted in slices for a
number of macro-blocks. The main motivation is that one slice can fit one Maximum
Transmission Unit (MTU) (IP packet); in this case losses will not affect the entire picture.
3. Data partitioning: Different symbols of a slice are given priority. The most important
information in H.264, headers quantization and motion parameters have most priority. Second
is the intra-partition that carries intra-coded content only. The least priority is given to interpartition that constitutes inter-predicted content.
4. Parameter sets: In H.264 this infrequently changing parameter information (video format,
entropy coding mechanism etc.) can be sent out of band of the real-time communication
channel. This avoids the severe errors introduced when the parameter set is not available
which would make the stream temporarily undecodable.
5. Flexible macro block reordering: In this technique subsequent macro blocks are part of
different slices that are transmitted in different MTUs (packets). This way some of the interprediction is broken; but if one of the slice packets is lost, concealment is possible using the
other slice.
In his experiments, Wenger (2003) tests the mechanisms for error rates of 0%, 3%, 5%, 10% and
20% respectively on two video test sets. The results show that the mechanisms described above
improve the quality of the received signal, considerably. Without these types of control, the signal
will become useless at rates of 3%, while with error consilience error rates up to 20% seem to still
give reasonable output signals. In REVERIE we should choose the right mechanisms to support error
consilience. A new standard for video coding HEVC/H.265 is planned in January 2013 by ITU/MPEG.
While new compression gains are expected, for REVERIE it is crucial to rely on robust existing
hardware/software solutions (preferably open source) to implement our rendering and coding
platforms. For this reason HEVC/H.265 is seen as out of scope.
H.264 is the international video standard that achieves the best compression and is designed for
network friendly use. Open source implementations of both encoders and decoders are available
that can run in real-time on normal processors, such as the Intel i7. To guarantee low decoding and
coding latency, the right profile needs to be selected (baseline) and encoder tuning needs to be
performed. Various options for achieving error resilience are available that give reasonable quality
with loss rates up to approximately 20%. H.264 is recommended for video streaming in REVERIE
66
2.4.3. 3D Stereo Video
Generally speaking there are two types of 3D video. The first is stereo video, which uses two
images to create depth perception. The second refers to the ability to choose the viewpoint in the
scene and is called free viewpoint video. Both technologies can also be combined together (stereo +
free viewpoint) to enable both effects. With 3D stereo video, each eye views its particular image
and in the brain the 3D appearance is constructed by merging the images from the two eyes. So the
challenge of a 3D stereo video display is to make sure that each eye sees its own image.
Traditionally, special glasses with red/green filtering to filter both of the images are used. Moreover
novel auto-stereoscopic displays using lenticular screens can enable 3D stereo video watching
without glasses. In the case of normal stereo video, when a viewer who is watching stereo video
moves they do not get a different view of the scene. A simple way to provide for this functionality is
storing video plus depth and calculating the two screens by interpolation. So 3D stereo video is
stored as either a combination of left and right video using conventional video formats (MPEG2/H.264) or a video plus a depth map (can be a monochromatic image of the depths).
Compression of conventional stereo video is generally performed by compression of the first
video and applying inter-view prediction to the second video. The difference between the videos
generated by two different cameras is approximately fixed by their properties and geometrical
setup. The displacement between the two can be represented by a dense motion vector map called
a disparity map. Subsequently only the difference between the estimation and real signal has to be
transmitted in addition. This principle was standardized in MPEG-2 and similar representations are
available in H.263, MPEG-4 and H.264. The general approach is illustrated in Figure 44, the left
frame is independently coded while the right view gets inter-view prediction. Note: generally,
subsequent frames are more similar than stereo frames and most of the additional compression is
achieved when coding I-frames.
Figure 44: Inter-view coding for conventional stereo video (Smolic, 2011)
An alternative to transmitting both the left and right frames is to transmit one central view
including a depth map indicating the depth of the different objects in this image. The stereo-pair
can then be generated at the receiver site. In the ATTEST project (Fehn et al. 2002), compression
efficiency of depth data in combination with state of the art codecs was tested. The results showed
that depth data can be compressed at about 10-20% of the original colour data. Subsequently, a
backwards compatible bit-stream format for distributing video and depth was developed and added
to MPEG-2 and H.264. The main disadvantage of video and depth is generation of the depth map,
however in REVERIE we can use different sensors like Kinect to generate depth-maps.
In REVERIE we recommend video and depth for 3D video. Video and depth is more efficient
compression wise to conventional stereo and allows some change in viewing direction. The difficulty
of obtaining depth maps should be tackled by partners working on the capturing part (WP 4).
2.4.4. Free viewpoint / Multi-view video
This section is largely based on the more extensive overview of the topic presented in Smolic et
al. (2008a). In multi-view coding, the user can choose from which angle/viewpoint they want to
view the scene. This technology has links to computer graphics where multiple viewpoints of 3D
models/scenes are often supported. The way the scene is represented is important, as it determines
67
the capturing setup needed, the storage format and the rendering methods available. According to
Smolic et al. (2008a) 3D scenes can be represented as a trade-off between two extremes; image
based and geometry based. In Smolic et al. (2008a), they include a picture categorizing the existing
application on a scale from image to geometry based.
 Geometry based generally refers to the rendering of 3D models from its particular viewpoint,
including texture and other enhancements. Mesh representations are an example of a 3D
model that is often employed.
 Image based generally refers to interpolating multiple camera views to obtain the view; this
method has no knowledge of the 3D geometry of objects.
In this section we briefly look at image based and compression aspects relevant to REVERIE. In a
request for proposals to support multi-view in H.264, a design based on inter-view prediction by the
Heinrich Hertz Institute in Berlin won. Results of testing using PSNR (Peak-Signal to Noise Ratio)
showed gains up to 2dB compared to coding all videos regularly (Merkle et al. 2006). The design is
added as an amendment to the H.264 standard. Figure 45 shows some of the prediction relations,
by assuming one stream is known, other correlated streams can achieve some extra compression by
inter-view prediction.
Figure 45: Inter-view prediction in Multi-view Video Coding (MVC) (Smolic, 2011)
2.4.5. 3D Mesh Compression
In Smolic et al. (2008c) an overview is given on compression of static and dynamic meshes, most
of the content in this section is based on that book section. A mesh M is a data-structure M(C,G)
that consist of a connectivity part C and a geometry G and can represent any 3D shape in scene. The
data-structure basically represents 2D surfaces in a 3D space entangled between edges and vertices.
Mapped to the 2D space, these surfaces are faces in a graph structure of edges and vertices. Often
these faces are only taken as triangles. The vertices, edges, faces and an incidence relationship
represent the connectivity of the mesh. The geometry of the mesh contains the mapping of the
vertices, edges and faces/triangles back to 3D. It is clear that a scene based on a mesh data
structure is very different from a scene consisting of interpolated multi-view images.
Connectivity C(V,E,F,R): A graph structure with only triangle faces.
Geometry G(mapping): mapping of vertices from C to points in 3D geometric space of vertices and
possible line segments. Also vertices can contain colour or texture that can be rendered on the face
surface.
For static meshes, Smolic et al. (2008c) present two types of coders:
Single rate coders: The connectivity graph is encoded and decoded as one stream, the sequential
strip of triangles either using a spanning tree or region growing methods. The geometry (mapping of
vertices in space) uses previously decoded vertices and the connectivity graph to predict the
coordinates of the next vertices obtain good compression rates.
68
Progressive coders: This approach can be compared to scalable video coding. By collapsing edges or
removing vertices, a simpler mesh is represented first as a base layer and then refinement is added
to each step. To decide which edges/vertices to collapse, different constraint metrics are taken into
account; such as volume preservation or boundary preservation. Often when the decoder knows
how the encoder works, the mesh can be reconstructed completely in progressive steps. Other lossy
progressive techniques include multi-resolution approaches; such techniques send more coarse
(lower resolutions) first that can be upgraded in later steps but do not reconstruct the original mesh
exactly.
In the case of dynamic meshes, it is not necessary to send a new full mesh every time instant.
With most objects the connectivity of the mesh will remain the same; only the 3D coordinates of
the vertices will change. This geometry part can be modelled by a static and a dynamic part,the
static part represents a mesh structure and the dynamic part motion parameters of the individual
meshes. Another way to save even more is to assign key meshes that are represented by the
original mesh and displacement and interpolated meshes. Some known algorithms/standards for 3D
dynamic mesh compression are: MPEG-4 AFX, Dynapack, D3DMC and RD3DMC. They all contain
prediction and sometimes clustering techniques (to detect if parts of the mesh can be modelled as a
rigid body).
For REVERIE, depending on the chosen scene representation, MESH based models can be very
efficient from a compression point of view. However, the interpolated/predicted movement might
look unnatural sometimes. Depending on the achievements of the capturing in WP4 (can we
capture a human in a mesh to start with), we can use mesh based representation in use case 1 or
use case 2. Also, there is strong rendering and decompression support from graphics cards allowing
possibly even the rendering of the human face. Moreover regarding WP5, as most compression
techniques have been “optimized” for the graphics pipeline, possible interesting research could
arise from studying their properties when transmitted over the network.
2.4.6. Additional requirements 3DTI STREAMING
In addition to the basic requirements for general streaming, there are some additional
challenges for 3DTI streaming:
 Handling multiple streams from the same site:
o Streaming based on user view (Yang et al. 2009)
o Complexity of 3D compression algorithm
o Intra-synchronization among streams from one site (Ott and Mayer-Patel, 2004);
Huang et al. 2011)
o Inter-synchronization among streams from multiple sites (Ott and Mayer-Patel,
2004)
 Bandwidth competition between related streams from different senders: Related streams
from different cameras will compete for bandwidth in the network. This can lead to
retransmissions and extra latency, this effect is described in Ott and Mayer-Patel (2004). In
this approach, gateway routers maintain connection states and additional routers and
packets have an additional header to allow the gateway to do this. As a result the
application can adapt to the state of the overall connection.
 Inter and Intra synchronization of multiple streams: In Huang et al. (2011) an entire
architecture is proposed for a 3DTI system that keeps the skew between streams below a
certain value. It presents an architecture to handle inter-destination synchronization and
the synchronization between different senders. Also by using an overlay, network latency
and bandwidth is kept within constraints.
While previous research reveals some new issues in media streaming for 3DTI. In REVERIE we are
likely to see new and more complex versions of these problems by streaming more heterogeneous
69
data; such as a composition model, synchronization of different media, Point cloud, 2D movies, and
pictures, silhouettes/skeletons from Kinect data, sensor data and 3D movies. Therefore in REVERIE,
more complicated synchronization and user issues will arise that will have to be dealt with. These
issues may lead to novel protocols to support the future internet.
2.5. Signaling for 3DTI real-time transmission
Signaling mechanisms allow for setup and tear down of sessions, similar to the traditional phone
calling mechanisms. In the telecom (ITU) and internet world (IETF), many standards have been
developed for universal session handling. Well known are SIP and H.323 for telephony and video
conferencing applications.
As the end goal in REVERIE is to enable immersive communications between (groups) of
participants, initiation of sessions is of utmost importance to enable this. It should be easy to setup
reliable sessions with other participants that are our friends, teachers or co-workers, that we want
to share an experience with.

Ambulant SMIL player
Description: CWI developed and maintain the open source SMIL player AMBULANT. When
SMIL is employed this software can be used/enhanced to provide specific signaling
functionality.
2.5.1. Internet Protocols
Signalling functions in previous distributed multimedia applications are often based on IETF
standardized protocols such as RTP/RTCP, SIP, RTSP and SDP. This is necessaryl for compatibility in
the internet as the IP Multimedia Subsystem (IMS). The IMS is a telecommunication architecture
that integrates the internet with fixed and mobile networks. It offers, for example, interoperability
between PSTN phones, mobile phones (GSM and up) and the internet. Even though most operators
do not fully implement the architecture, they use some of its components to achieve
interoperability. While currently the IMS supports traditional applications such as telephony (VOIP)
and television (IPTV), it is designed for complex multimedia applications such as video conferencing
and maybe tele-immersion. IMS uses internet protocols for signalling such as SIP and SDP. In the
next section, we will discuss the MPEG-4 signalling framework, which was developed for advanced
multimedia application with multiple media streams/objects and 3D capabilities.
From the six year experience with 3DTI systems, Nahrstedt et al. (2011b) conclude that these
protocols do not take requirements from 3DTI into account very well and present their
experiences/solutions. They developed solutions for five signalling issues: session initiation, multistream topology, lightweight session monitoring, session adaption and session teardown. We will
summarize each of these issues, except topology.
Session initiation:
a) Registration of various devices and resources (in REVERIE this can be 3D/2D renderers, 3DTVs
and so on).
b) Construction of an initial content dissemination topology based on the overlay network formed
between service gateways (similar to path selection for 2 way communications).
In REVERIE we will be mostly concerned with a) as we will have scalable terminals that allow for
different devices. SIP-like messages such as Camera_join, display_join and gate_way_join, will need
to have compatibility with non-network devices connected directly to the gateway (USB, firewire, or
PCI).
70
Session monitoring: Nahrstedt et al. (2011b) developed Q-Tree as a solution for detecting
component failures and monitoring the status of metadata in the network. It builds a management
overlay for signalling traffic based upon low Round Trip Times (RTT) (minimum latency). Designing
the overlay is already a bit complex (approximately 10 nodes) as the construction of a degree
bounded minimum spanning tree is already an NP-hard problem (k-MST). However, they develop a
suitable heuristic for finding an appropriate overlay. After the overlay is constructed high-level
semantic queries can be compiled by the query engine and sent into the overlay. The novelty
compared to other systems is in the support for range queries on multiple attributes (i.e. bandwidth
is between 10 and 20; CPU utilization is bigger than 60%; etc.). When a node issues a query, this
query is routed to the hierarchically distributed nodes that have this range assigned and contains
the requested metadata item. This metadata item is stored in the content store of the node, and
can be both local metadata or metadata from a remote node. To update from metadata changes,
the hierarchically organized nodes are periodically updated. In case of highly variable metadata such
as loss, jitter, bandwidth or CPU utilization and others, a high frequency of updates would be
required. The authors indicate that for these cases a simple multicast seems better to handle range
queries (ask each nod by multicast, the right one will reply its metadata). Moreover the approach
assumes uniformly distributed metadata or that the distribution of the metadata is known to
implement the hierarchical query overlay, indeed in many practical cases this assumption does not
hold.
Session adaption: Approaches that adapt to the user view and interest are presented in Yang et al.
(2009) and Wu et al. (2009).
Session Teardown: As one terminal consists of multiple devices, the devices should also be able to
leave the session gracefully, i.e. a camera_leave message should be specified. Gateways or
terminals should be able to leave using a gateway_leave message.
For REVERIE it seems we can use less sophisticated approaches as we simplified the traffic load
in use case 1 and reduced the number of nodes in use case 2 to three. Therefore a more reduced
scheme is useful. Also in the current state of the art, the viewpoint of the users’ matters and
reflects which streams in the network are given priority, and which are not. View control should
therefore be incorporated in the signalling mechanism.
2.5.2. MPEG-4 transmission Framework
A framework that supports signalling and transmission is MPEG-4. MPEG-4 is an extensive
multimedia framework that consists of various tools for multimedia applications. MPEG-4 part-1
(systems) provides a way to link media objects and achieve synchronization. MPEG-4 part-7 on the
other hand provides a framework for monitoring and signalling MPEG-4 sessions: the Delivery
Multimedia Interface (DMIF). DMIF (or Delivery Application Interface (DAI)) provides the following
services and specifically DMIF Network interface for communications over IP. The DMIF has
primitives for starting sessions, services monitoring sessions, and performing user commands. In
principle these services are suitable to be used in an IP based 3DTI system. However most of its
functionality over IP is implemented by making use of internet signalling protocols such as RTSP, SIP,
SDP and RTP/RTCP, that can be used directly instead.
However as other parts of MPEG-4 support 3D scene description with spatial composition
(MPEG-4 part-11 BIFS), 3D and 2D mesh coding schemes (part-16), high level video compression
(part-10 MPEG-4/H.264), audio and visual coding standards (part-2 and part-3), an MPEG-4
interface could still prove useful to REVERIE.
71
2.5.3. SMIL multimedia presentation
SMIL is a multimedia presentation language that can be used to author and describe multimedia
presentations, including synchronization of different objects and maintaining state. In November
2008 version 3.0 of this XML based description language was released by the WWW Consortium
(W3C). It can primitively be seen as the multimedia equivalent of HTML, it allows spatial
composition of media objects, as HTML allows spatial composition of text, links and images.
However as multimedia, more than text is time dependent, SMIL offers many timing and
synchronization features. A reference to its specification of elements and attributes is given in
Bulterman and Ruthledge (2008).
Playback of SMIL documents requires a specific player that can be embedded in the web browser.
CWI developed and actively maintains the open source SMIL player ambulant that supports most
SMIL functionality and runs on most platforms/web browsers. Therefore in REVERIE, specifically for
use case 1, ambulant can be a web based renderer that can compose and synchronize various
media formats based on SMIL. While SMIL is not widely deployed, its features seem to be getting
adopted in newer web standards, HTML 5 and MPEG-DASH.
2.6. MPEG-V Framework
The MPEG-V (ISO/IEC 23005) media context and control is an official standard that intends to
bridge the differences in existing and emerging virtual worlds while integrating existing and
emerging (media) technologies like instant messaging, video, 3D, VR, AI, chat, and voice. The
standard was defined as an integral part and deliverable of the ITEA2 Metaverse1 project.
Work is already advancing on a second version of the standard to extend its application domains.
There is a lot of interest, for instance in biosensors; measuring vital body parameters and using
them as inputs for either games or lifestyle-related applications. Next to that, Gas and Dust, Gaze
Tracking, Smart Cameras, Attributed Coordinate, Multi-Pointing, Wind and Path Finding sensors are
under consideration.
The official standard consists of the following parts.
2.6.1. Architecture
The system architecture of the MPEG-V framework is depicted in the next figure. For more
information, please refer to the MPEG-V documentation.
2.6.2. Control information
This part specifies syntax and semantics required to provide interoperability in controlling
devices in real as well as virtual worlds. The adaptation engine (RV or VR engine), which is not within
the scope of standardization, takes five inputs (1) Sensory Effects (SE), (2) User’s Sensory Effect
Preferences (USEP), (3) Sensory Devices Capabilities (SDC), (4) sensor capability, and (5) Sensed
Information (SI) and outputs sensory devices commands and/or SI to control the devices in real
worlds or a virtual worlds' object. The scope of this part covers the interfaces between the
adaptation engine and the capability descriptions of actuators/sensors in the real world and the
user’s sensory preference information, which characterizes devices and users, so that appropriate
information to control devices (actuators and sensors) can be generated. In other words, user’s
sensory preferences, sensory device capabilities, and sensor capabilities are within the scope of this
part. For more information, please refer to the MPEG-V documentation.
72
Figure 41: System Architecture of the MPEG-V Framework
2.6.3. Sensory information
Sensory information which is part of standardization area B specifies syntax and semantics of
description schemes and descriptors that represent sensory information. Also haptic, tactile, and
emotion information fit in this part. The concept of receiving sensory effects in addition to
audio/visual content is depicted in the next figure.
Figure 42: Concept of MPEG-V Sensory Effect Description Language
The Sensory Effect Description Language (SEDL) is an XML Schema-based language which enables
one to describe so-called sensory effects such as light, wind, fog, vibration, etc. that trigger human
senses. The actual sensory effects are not part of SEDL but defined within the Sensory Effect
Vocabulary (SEV) for extensibility and flexibility allowing each application domain to define its own
sensory effects. A description conforming to SEDL is referred to as Sensory Effect Metadata (SEM)
and may be associated to any kind of multimedia content (e.g., movies, music, Web sites, games).
73
The SEM is used to steer sensory devices like fans, vibration chairs, lamps, etc. via an appropriate
mediation device in order to increase the experience of the user. That is, in addition to the audiovisual content of, e.g. a movie, the user will also perceive other effects such as the ones described
above, giving her/him the sensation of being part of the particular media which shall result in a
worthwhile, informative user experience. For more information, please refer to the MPEG-V
documentation.
2.6.4. Virtual world object characteristics
This 4th part specifies syntax and semantics of description schemes and descriptors used to
characterize a virtual world object related metadata, making it possible to migrate a virtual world
object (or only its characteristics) from one virtual world to another and/or control a virtual world
object in a virtual world by real word devices. For more information, please refer to the MPEG-V
documentation.
2.6.5. Data formats for interaction devices
This part specifies the syntax and semantics of the data formats for interaction devices, i.e.,
Device Commands and SI, required for providing interoperability in controlling interaction devices
and in sensing information from interaction devices in real and virtual worlds. For more
2.6.6. Common types and tools
Part 6 specifies the syntax and semantics of the data types and tools common to the tools
defined in other parts of MPEG-V. To be specific, data types which are used as basic building blocks
in more than one tool of MPEG-V, for example; color-related basic types and time stamp types
which can be used in device commands and sensed information to specify timing. For more
2.6.7. Conformance and reference software
For more information, please refer to the MPEG-V documentation.
Scene composition and visualization (WP7) will depend heavily on acquired sensory information
(WP4) and user interaction and autonomy (WP6). The MPEG-V standard on the other hand defines a
standard interface and architecture for various kinds of sensory information (maybe of interest to
WP4), data formats for interaction devices (maybe of interest to WP6), and virtual world object
characteristics (may be of interest to WP7).
In defining the interface between WP4, WP6 and WP7, the use of the MPEG-V (ISO/IEC 23005)
standard should be taken into account. Furthermore, a comparison between the standardized
MPEG-V architecture and the proposed FP7 REVERIE architecture may be required. In the event that
the MPEG-V standard would be of importance or interest to the partnership, the standard could be
adopted and updates or improvements to the standard may be submitted to the ISO/IEC
subcommittee of the related joint technical committee of the related workgroup.
74
3. Related Tools to WP6: Interaction and autonomy
3.1. 3D Avatar Authoring Tools
3D virtual characters are basically created in the same way as the virtual environments they are
intended to populate are brought to life. The same techniques are used in order to hand-model (or
sometimes capture and reconstruct) a polygon mesh representing the static appearance of the
character. The latter may be enhanced for increased detail, by applying advanced texture mapping
techniques (such as normal maps). This static appearance is usually complemented by a rig, a set of
control structures used to control the mesh vertices and apply animation to the character. A rig’s
most powerful advantage is that it reduces the overall complexity of the character mesh (consisting
of hundreds of vertices) into a limited set of degrees of freedom, representing the realistic
movement of the human body. Skeletal rigs are the most common type of body animation rigs used
by modellers and animators, as they provide an abstraction to the human skeleton. The mesh
control structures are represented as bones, connected to one another using rotational joints,
ultimately forming a hierarchy. Animation is applied by rotating specific joints, applying rotational
effects to any possible children joints further down the hierarchy, therefore roughly simulating
actual skeletal movement. Animation of the bones is applied to the vertices of the mesh using a
technique called smooth skinning. This process consists of applying several weighted influences on
each vertex, signifying which, and how much bones contribute to its displacement from the static
appearance.
Facial animation is usually handled by separate facial rigging techniques, which include the
traditional bone structure (with the exception being that facial bones are mainly interconnected by
translational joints, and are not an abstraction to any actual skeletal structure of the face), and the
more popular Morph Targets or Blend Shapes (Parke, 1972). This latter technique consists of
actually displacing vertices on the facial mesh to new locations, forming several expressions. These
actions are stored as shape keys and are blended together to produce new facial expressions. A
third method that has been proposed for extreme accuracy but not yet used in practice due to
complexity, is the physically realistic modelling of the human facial muscles (Terzopoulos and
Waters, 1990).
Figure 43: Polygon Mesh and Rig combined to produce the skinned model in a new pose
As the creation of 3D virtual characters is at the forefront of many multi-billion dollar industries
including film and gaming, many commercial, as well as open source tools have been developed in
order to speedily produce realistic humanoid meshes and rigs for integration onto graphics engines.
Most of these tools, as well as popular 3D modelling software come with pre-implemented tools for
exporting meshes and rigs to a usable file format, storing information on the mesh vertices, normals
and texture coordinates, as well as the rig's joint hierarchy and rotational constraints. Popular
75
export formats include the Wavefront OBJ format (which only supports mesh export), the COLLADA
(.dae), the 3D Studio (.3ds) and the DirectX (.x) formats for both mesh and rig export. Graphics
engines developers have also been known to implement their own formats, best suiting the needs
of their engines.
As REVERIE aims to provide users with the tools to create their own virtual avatars, the use of 3D
avatar authoring tools is mandatory. Users should be able to customize their character to their liking
and should not be bothered with the polygon mesh modelling, rigging and skinning processes,
which should be handled automatically by the authoring tool.

MakeHuman
Description:
Open
Source
http://www.makehuman.org/
tool
for
creating
3D
human
characters.

PeoplePutty
Description: People Putty is a commercial tool that allows users to create interactive 3D
characters and dress them up. These characters can then be used to act as tour guides for
personal web pages.
http://www.haptek.com/
Dependencies on other technology: Haptek Player

Digimi Avatar Studio
Description: Digimi tools allow users to create and personalize a realistic 3D avatar from a
single face image. The avatars which are created can be deployed in virtual worlds, crossplatform games, social networks, mobile applications and animation tools. It is an easy-to-use
platform for generating personalized Avatars delivered in 3D Flash format, which is compatible
with all web rich-media, social applications and games.
http://www.digimi.com/newsite/presite/home.jsp

ICT Virtual Human Toolkit
Description: A collection of modules, tools and libraries that allows users, authors and
developers to create their own virtual humans.
http://vhtoolkit.ict.usc.edu/index.php/Main_Page
Dependencies on other technology: AcquireSpeech; Watson; NPCEditor; Non-verbal Behaviour
Generator (NVBG); SmartBody.

Evolver Avatar Engine
Description: A free avatar creation engine, which allows users to quickly build an avatar and
make it available on dozens of online destinations such as movies in social media, virtual worlds
or massively multiplayer online games.

Autodesk MotionBuilder
Description: Autodesk MotionBuilder is a real-time 3D character animation software, which is
particularly useful for motion-capture data.
http://usa.autodesk.com/adsk/servlet/pc/index?id=13581855&siteID=123112
Dependencies on other technology: Content-creation packages such as Autodesk Maya / 3ds
Max
76

Image Metrics' PortableYou
Description: The PortableYou platform is a suite of web services that enables a developer to
create applications with advanced avatar creation, customization, and projection functionality.
With a PortableYou enabled application, end-users can instantly generate customizable 3D
avatars from photos of their face and carry these avatars across other enabled third-party
applications.
http://www.image-metrics.com/Portable-You/PortableYou

Digital Art Zone's DAZ Studio 4 Pro
Description: DAZ Studio is a feature rich 3D figure customization, posing, and animation tool
that enables the creation of stunning digital illustrations and animations. DAZ Studio is the
perfect tool to design unique digital art and animations using virtual people, animals, props,
vehicles, accessories, environments and more. Simply select your subject and/or setting,
arrange accessories, setup lighting, and begin creating beautiful artwork.
http://www.daz3d.com/i/products/daz_studio
3.2. Animation Engine
Building an Embodied Conversational Agent (ECA) system needs the involvement of many research
disciplines. Issues like speech recognition, motion capture, dialog management, or animation
rendering require different skills from their designers. Soon it became obvious that there was the
need to share expertise and to exchange the components of an ECA system. SAIBA (Vilhjálmsson et
al. 2007) is an international research initiative whose main aim is to define a standard framework
for the generation of virtual agent behaviour. It defines a number of levels of abstraction, from the
computation of the agent’s communicative intention, to behaviour planning and realization, as
shown in Figure 49.
Figure 44: SAIBA architecture
The Intent Planner module decides the agent’s current goals, emotional state and beliefs, and
encodes them into the Function Markup Language (FML) (Heylen et al. 2008). To convey the agent’s
communicative intentions, the Behaviour Planner module schedules a number of communicative
signals with the Behaviour Markup Language (BML). It specifies the verbal and non-verbal
behaviours of ECAs [Vilhjálmsson et al. 2007]. Finally the task of the third element of the SAIBA
framework, Behaviour Realizer, is to realize the behaviours scheduled by the Behaviour Planner. It
receives input in the BML format and it generates the animation. A feedback system is needed in
order to inform the modules of the SAIBA about the current state of the generated animation. This
information is used, for example, by Intent Planner to re-plan the agent’s intentions when an
interruption occurs.
77
There exist several implementations of the SAIBA standard. SmartBody [Thiébaux et al. 2008] is
an example of the Behaviour Realizer. It takes as input BML code (including speech timing data and
the world status updates), and it composes multiple behaviours and generates the character
animation synchronized with audio. For this purpose it uses an extended version of BML, allowing
one to define interruptions and predefined animations. SmartBody is based on the notion of
animation controllers. The controllers are organized in a hierarchical structure. Ordinary
controllers manage the separate channels, e.g. pose or gaze. Then the meta-controllers manipulate
the behaviours of subordinate controllers allowing the synchronization of the different modalities to
generate consistent output from the BML code.
SmartBody can be used with the NVBG that corresponds to the Behaviour Planner in the SAIBA
framework. It is a rule-based module that generates BML annotations for non-verbal behaviours
from the communicative intent and speech text. SmartBody can be used with different characters,
skeletons and different rendering engines.
Heloir and Kipp (2009) extend the SAIBA architecture by a new intermediate layer called the
animation layer. Their EMBR agent is a real-time character animation engine developed by the
Embodied Agents Research Group that offers a high degree of animation control through the
EMBRScript language. This language permits control over skeletal animations, morph target
animations, shader effects (e.g. blushing) and other autonomous behaviours. Any animation in
EMBRScript is defined as a set of key poses. Each key pose describes the state of the character at a
specific point in time. Thus the animation layer gives access to animation parameters related to the
motion generation procedures. It also gives to the ECA developer, the possibility to better control
the process of the animation generation without constraining him to enter into the implementation
details.
Elckerlyc (van Welbergen et al. 2010) is a modular and extensible Behaviour Realizer following
the SAIBA framework. It takes as input a specification of verbal and non-verbal behaviours encoded
with extended BML and can eventually give feedback concerning the execution of a particular
behaviour. Elckerlyc is able to re-schedule behaviours that are already queued with behaviours
coming from a new BML block in real-time, while maintaining the synchronization of multimodal
behaviours. It receives and processes a sequence of BML blocks continuously allowing the agent to
respond to the unpredictability of the environment or of the conversational partner. Elckerlyc is also
able to combine different approaches to animation generation to make agent motion more humanlike. It uses both procedural animation and physical simulation to calculate temporal and spatial
information of motion. While the physical simulation controller provides physical realism of motion,
procedural animation allows for the precise realization of the specific gestures.
BMLRealizer (Arnason and Porsteinsson, 2008) created in the CADIA lab is another
implementation of the Behaviour Realizer layer of the SAIBA framework. It is an open source
animation toolkit for visualizing virtual characters in a 3D environment that is partially based on the
SmartBody framework. As input it also uses BML; the output is generated with the use of the
Panda3D rendering engine.
RealActor (Cerekovic et al. 2009) is another BML Realizer developed recently. It is able to
generate the animation containing verbal content that is complemented by a rich set of non-verbal
behaviours. It uses the algorithm, based on neural networks, to estimate the duration of words.
Consequently it can generate the correct lip movement without explicit information about the
phonemes (or visemes). RealActor was integrated in various open-source 3D engines (e.g. Ogre,
HORDE3D).
The Greta architecture is SAIBA compliant (Niewiadomski et al., 2011). It allows for creating
cross-media and multi-embodiment agents (see Figure 50). It proposes a hierarchical organization
for the SAIBA Behaviour Realizer. It also introduces different levels of customization for the agent.
Different instantiations of the agent can share Behaviour Planning and Realization; only the
animation computation and rendering display may need to be tailored. That is, the Greta system is
78
able to display the same communicative intention with different media (AR, VR, web, etc.),
representations (2D, 3D) and/or embodiments (robots, virtual and web Flash-based agents).
Figure 45: Greta architecture
The SAIBA framework can be very useful to the REVERIE project. Its modularity and
decomposition into several steps allow us to integrate various modules at different levels of the
SAIBA framework. Virtual agents can be driven by setting their emotions and communicative
intentions using EmotionML and FML-like languages (e.g. APML-FML). Identically, when necessary,
they can be driven by specifying their behaviours using BML. Integration within larger frameworks,
such as SEMAINE, can be ensured by using either FML or BML languages. Several virtual agent
platforms such as Elckerlyc, Greta or SmartBody, etc. follow this standard.

EMBR
Description: Free, real-time animation engine for embodied agents that offers a high degree of
animation control via the EMBRScript language. Animations are described either by prerecorded animations or by sequences of key poses influencing parts of a character. A key pose
may specify a sub-skeleton configuration, a shader parameter value, a set of kinematic
constraints, or a combination of morph targets.
http://embots.dfki.de/EMBR/

FACEWARE
Description: FACEWARE is a commercial performance-driven animation technology for the film
and games industry. Developed over 10 years of animation production experience, FACEWARE
utilizes a marker-less video analysis technology and artist-driven performance transfer toolset
to deliver ultra-high fidelity, highly efficient, truly believable facial animation in a fraction of the
time of more traditional methods. This software has been used internally by the Image Metrics
79
production team to produce thousands of minutes of facial animation as a service and has
been an integral tool on many award winning facial animation projects.
http://www.image-metrics.com/Faceware-Software/Overview

Interactive Face Animation – Comprehensive Environment (iFACE)
Description: iFace is a free Face Multimedia Object framework – a framework for all the
functionality and data related to facial actions.
http://img.csit.carleton.ca/iface/
Dependencies on other technology: DirectX v9.0c; .NET framework.

SmartBody
Description: SmartBody is a modular, controller-based character animation system. SmartBody
uses BML to describe a character’s performance, and composes multiple behaviours into a
single coherent animation, synchronizing skeletal animation, spoken audio, and other control
channels.
http://sourceforge.net/projects/smartbody/

FaceFX
Description: FaceFX is a cutting edge solution for creating realistic facial animation from audio
files.
http://facefx.com
Dependencies on other technology: Used in game engine pipelines (possibly in Unity, etc.)

faceshift
Description: faceshift is a new technology for real-time markerless facial performance capture
and animation. The software automatically produces highly detailed facial animations based on
FACS expressions from depth cameras such as Microsoft’s Kinect. faceshift works seamlessly
for fast facial expressions, head motions, and difficult environments.
http://www.faceshift.com/faceshift.html

CharToon
Description: Tool for authoring and real-time rendering of 2D (cartoon-like) graphics models. It
has a 3-tier architecture (Graphics, Control, Choreography) and is designed for conversational
agents that feature Lip-sync, Emotions, Gestures and can accept GESTYLE markup for nonverbal communication.
http://www.cwi.nl/projects/FASE/CharToon
Dependencies on other technology: SVG

Elckerlyc
Description: Elckerlyc is a BML compliant Behaviour Realizer for generating multimodal verbal
and non-verbal behaviour for virtual humans. It follows the SAIBA framework. It supports
animation of real-time continuous interaction.
http://elckerlyc.ewi.utwente.nl/

MPEG4 H-Anim
Description: MPEG-4 (ISO-IEC standard) adopts a VRML based standard for virtual human
representation (H-Anim), and provides an efficient way to animate virtual human bodies.
MPEG-4 standardizes the definition of the shape and surface of a model and anatomic
deformations. By transfer of animation parameters it is an efficient and flexible way to animate
virtual humans.
80
http://h-anim.org/
3.3. Autonomous Agents
An intelligent avatar can appear in many different forms: as a graphical representation of a
human in a virtual reality system; a Non-Player Character (NPC) in 3D computer games; an online
sales-assistant; an online nurse encouraging health diets and exercise, etc. (Rich and Sidner, 2009;
Larson, 2010). For avatars to engage with humans credibly in an emotional and expressive manner,
four key issues need to be taken into account (Rich and Sidner, 2009):
1. Engagement: whether you have interaction between two autonomous avatars; or an
autonomous avatar and an avatar with a human user; or two avatars with human users, it is
very important that the initiation of contact, ongoing contact and the termination of contact
appears lifelike, smooth and familiar (Sidner et al., 2005).
2. Emotion: Using appropriate gestures, stances, facial expressions, voice intonations, etc., an
avatar needs to express emotional information in a way that human users can understand. On
the other hand, intelligent and autonomous avatars need to be able to recognize and
understand these same emotional characteristics of the human user in order to behave and
respond correctly (Gratch et al., 2009).
3. Collaboration: For the coordination of activities to occur, a high level of communication is
required between the avatar and human user. To appear credible, collaboration also relies on
engagement and emotion (Grosz and Kraus, 1996).
4. Social Relationship: It is becoming more and more common that developers are creating
avatars which have long-term relationships with their human users. These avatars can work
with and help human users with long-term weight-loss, healthy-eating diets, education and
learning, etc. To have a successful social relationship between intelligent avatars and human
users, the other three factors of engagement, emotion and collaboration are necessary.
Virtual agents have been endowed with human-like conversational and emotional capabilities.
To some extent, they can communicate emotions through facial expressions and body movements.
Several computational models have been proposed to allow the agents to display a large palette of
emotions, going beyond the six prototypical expressions of emotions (Arya et al. 2009). Emotions
such as relief, embarrassment, anxiety and regret, can be shown as sequences of multimodal signals
(Niewiadomski et al. 2011). Being a social interactant, agents can control the communication of
their emotional states. Agents can display complex emotions such as the superposition of emotions,
the masking of one emotional state by another one by combining signals of the different emotional
states on their face (Niewiadomski and Pelachaud 2010; Bui 2004; Mao et al. 2008).
Models of display rules (Prendinger and Ishizuka 2005) ensure the agents decide when to show
an emotion and to whom. Several perceptual studies have shown that human users perceive when
an agent lies through asymmetric facial expressions (Rehm and André 2005). They can also
distinguish the display of a polite smile vs. an embarrassed or a happy one by the agent (Ochs et al.
2012).
In an interaction, agents are both speaker and listener. Models of turn-taking ensure a smooth
interaction schema. Agents are active listeners. They display backchannels to tell how they view
what their interactants are saying (often with very limited natural language understanding); and
how engaged in the interaction they are (Sidner et al. 2004). Imitating specific interactants’
behaviour allows the agent to maintain engagement as well as to build rapport (Bevacqua et al. to
appear; Huang et al. 2010).
Social capabilities such as politeness and empathy have also been considered to some extent.
Agents can adapt their facial movements (Niewiadomski and Pelachaud 2010) and their gestures
(Rehm and André 2005) depending on the social relationship with their interactants. These models
81
work in very specific controlled contexts. By simulating the emotions, their interactants can
potentially feel and agents can display empathy toward them (Ochs et al, 2008). However such
models suffer from strong limitations such as showing anger to users’ anger. Studies have shown
that users’ stress and frustration decrease when the agents show empathy (Prendinger and Ishizuka
2005; Beale and Creed 2009). Empathic agents are preferred, they have been found to be more
agreeable and caring. Human users show more satisfaction and engagement with them (Brave et al.
2005) and their task performance increases as well (Partala and Surakka 2004). Agents showing
appropriate emotions lead to higher perceived believability, and are perceived to be warmer and
more competent (Demeure et al. to appear).
To simulate human intelligence and perform credibly in a virtual online world, autonomous
avatars must interact in a way that is familiar to human users and the processes and techniques
which enable this and that have been reported in the state of the art are extremely varied:




Search and optimization: This technique involves intelligently searching through every possible
solution. However, this can become too slow as the amount of possible solutions can grow
exponentially depending on the problem being solved. In this case heuristics are used which can
be described as intelligent guesses on choosing the correct path to take, also known as
“pruning”. Optimization is a form of search where the initial step includes a guess at the correct
path and continuing the search from that point. In the literature, many researchers have used a
search and optimization technique for the development of autonomous agents, such as Shao
and Terzopoulos, 2005; Chung et al. 2009; and Codognet, 2011.
Logic: When logic is used as a problem-solving technique in AI, it can take the form of a set of
statements or facts which can be true or false, such as propositional logic; or a set of descriptors
which outline objects and their properties and relationships such as first-order logic; or even a
set of statements which can have a truth value between 0 and 1, as in fuzzy logic. There are
many types of logic programming and problem solving techniques which have been used in AI
and with intelligent avatars, such as in Pokorny and Ramakrishnan, 2005; Kulakov et al. 2009;
and Drescher and Thielscher, 2011.
Probability: Probability theory is used in AI when there is not a complete set of information
about the world or the events that will occur. One of the most popular probabilistic methods is
Bayesian networks, but there are many others, such as HMMs, Kalman filters, decision theory,
etc. Researchers which have used such techniques for autonomous agents include Alighanbari
and How, 2006; Moe et al., 2008; and Arinbjarnar and Kudenko, 2010.
Classifiers and statistical learning methods: Classification involves the examination of a new
piece of data, input, stimulus, action, etc. and matching it to a previously seen or known class.
Once this recognition occurs, a decision can be made about what action to take based on
previous experience. The element of learning can be done using many different techniques, such
as neural networks, support vector machines, nearest-neighbour algorithms, decision trees, etc.
However, for every new problem, a suitable classifier has to be chosen as no individual
classification method suits all problems. Examples of classification for intelligent avatars include
Jebara and Pentland, 2002; Jadbabaie et al. 2003; Brooks et al, 2007; Oros et al. 2008; Parker and
Probst, 2010; and Boella et al. 2011.
Autonomous agents are a critical part of the REVERIE project; they are required to engage in
understandable social interactions with other users in an emotional and expressive manner. They
must respond in real time and portray behaviour and responses to the virtual environment and
other users in a credible manner. They must learn to adapt their responses and behaviours to
differing environments, such as a learning environment where interaction is required, and a
narrative space (storytelling) where less interaction will occur. The REVERIE avatars may also be
controlled by their human users and when this occurs the avatars must learn their behaviours
82
through a process of repetition and reward.
“A device can either behave intelligently as a result of automated or human-controlled directions, or
a device literally can be intelligent - that it requires no external influence to direct its actions”
(Larson, 2010)
Several of the presented tools (e.g. the SEMAINE platform) allow human users to interact with
virtual agents in real-time. This interaction relies on the analysis of non-verbal cues extracted from
users and from the model of social behaviours of the agent. The flexibility of their architecture (cf.
SEMAINE) offers us to concentrate on the module we aim to extend (e.g. the attentive model or the
emotional one) while relying on the other parts of the architecture. Connection to the existing
modules is done using standard languages (EmotionML, FML, BML) and specific ones (e.g. SEMAINEML).

Greta
Description: Greta is a free, real-time 3D embodied conversational agent with a 3D model of a
woman compliant with the MPEG-4 animation standard. She is able to communicate using a
rich palette of verbal and non-verbal behaviours. Greta can talk and simultaneously show facial
expressions, gestures, gaze, and head movements.
http://perso.telecom-paristech.fr/~pelachau/Greta/

NPCEditor
Description: A package for creating dialogue responses to inputs for one or more characters. It
contains a text classifier based on cross-language relevance models that selects a character's
response based on the user's text input, as well as an authoring interface to input and relate
questions and answers, and a simple dialogue manager to control aspects of output behaviour.
http://vhtoolkit.ict.usc.edu/index.php/NPCEditor

Non-verbal Behaviour Generator (NVBG)
Description: The NVBG is a tool that automates the selection and timing of non-verbal
behaviour for ECA (aka Virtual Humans). It uses a rule-based approach that generates
behaviours given information about the agent's cognitive processes but also by inferring
communicative functions from a surface text analysis. The rules within NVBG were crafted
using psychological research on non-verbal behaviours as well as a study of human non-verbal
behaviours, to specify which non-verbal behaviours should be generated at each given context.
In general, it realizes a robust process that does not make any strong assumptions about the
markup of communicative intent in the surface text. In the absence of such a markup, NVBG
can extract information from the lexical, syntactic, and semantic structure of the surface text
that can support the generation of believable non-verbal behaviours.
http://vhtoolkit.ict.usc.edu/index.php/NVBG

Semaine
Description: Semaine is a modular, real-time architecture of Human-Agent interaction. Its
technologies embed a visual and acoustic analysis, a dialog manager and a visual and acoustic
synthesis. The system can detect the emotional states of the user through analyzing facial
expression, head movement and voice quality. Four virtual agents with specific personality
traits including different facial models, voice quality and behaviour sets have been defined. The
Semaine project is an EU-FP7 1st call STREP project and aims to build a Sensitive Artificial
Listener (SAL). SAL is a multimodal dialogue system which can: 1. Interact with humans with a
83
virtual character; 2. Sustain an interaction with a user for some time; 3. React appropriately to
the user's non-verbal behaviour. Semaine is Open Source software.
http://semaine.sourceforge.net/
Dependencies on other technology: ActiveMQ, JAVA, openSMILE
3.4. Audio and speech tools
Audio and speech based interaction between man and machine has been an active research area
in the field of HCI, of which speech (Rabiner, 1993) and speaker recognition (Beigi, 2011) have been
at the forefront for many years. Research in automatic speech recognition has had to deal with
robust representation, and the recognition process of the many parameters characterizing a highly
variable signal such as speech. Speaker recognition, on the other hand, emphasizes on recognizing
persons from the physical characteristics of the sound of their voice, as well as their manner of
speaking, such as accent, pronunciation pattern, rhythm, etc. In both cases of speaker enrolment
and speaker verification, a person’s speech undergoes a feature extraction process that transforms
the raw signal into feature vectors. In the enrolment case, a speaker model is trained using these
feature vectors; while in the recognition case, the extracted feature vectors of the unidentified
speech are compared to the speaker models in the database, outputting a similarity score. Through
advancements in recent years, several methods for representation and feature extraction have
been proposed, with the use of Mel-Frequency Cepstral Coefficients (MFCC) (Darch et al. 2008;
Davis and Mermelstein, 1980) being more prominently featured among other methods, including
perceptual linear prediction coefficients (Hermansky, 1990), normalization via cepstral mean
subtraction (Furui, 2001), relative spectral filtering (Hermansky and Morgan, 1994) and Vocal Tract
Length Normalization (VTLN) (Eide and Gish, 1996). The predominant framework for speech
recognition algorithms uses stochastic processing with HMMS (Gales and Young, 2007), while
adaptation to variable conditions (such as different speaker, vocabulary, environment, etc.) has
been addressed via maximum a posteriori probability estimation (Gauvain and Lee, 1994),
maximum likelihood linear regression (Kim et al. 2010) and eigenvoices (Kuhn et al. 1998).
Recent advances in audio and speech interaction however, have turned towards the analysis and
recognition of human emotion in audio signals (Oudeyer, 2003; Chen, 2000) and the detection of
human produced sound cues (such as laughs, cries, sighs, etc.) (Schröder et al. 2006) to complement
voice pitch and speech data in order to tackle emotion integration in intelligent HCI. Research on
audio affect recognition is largely influenced by basic emotion theory, and most existing efforts aim
to recognise a subset of basic emotions from the speech signal. Similar to speech recognition
methods, most existing speech affect recognition approaches use acoustic features, such as MFCC;
with many studies showing that pitch and energy contribute the most to affect recognition. Most
methods in the related literature are able to discriminate between positive and negative affective
states (Batliner, et al. 2003; Kwon et al. 2003; Zhang et al. 2004; Steidl et al. 2005). Several recent
efforts have been made towards automatic recognition of non-linguistic vocalizations such as
laughter (Truong and van Leeuwen, 2007), coughs (Matos et al. 2006) and cries (Pal et al. 2006),
which help improve the accuracy of affective state recognition. Others have tried to interpret
speech signals in terms of application-specific affective states, such as deception (Hirschberg et al.
2005; Graciarena et al. 2006), certainty (Liscombe et al. 2005), stress (Kwon et al. 2003) and
frustration (Ang et al. 2002). Other approaches in the field of audio interaction, have investigated
the concept of musical generation and interaction (Lyons et al. 2003).
Audio and speech tools will provide REVERIE agents with the means to socially interact with
users in a natural way. More specifically, REVERIE avatars should be able to comprehend the
message the user is communicating, as well as understand the emotional condition of the user,
84
therefore adapting its own behaviour accordingly (e.g., detecting a laughing sound will indicate a
pleasant occasion).

AcquireSpeech
Description: AcquireSpeech is a tool that connects the sound input on your computer to a
speech recognition server, while providing real time monitoring, transcripts, and recording, as
well as allowing for direct text input and playback of recorded speech samples. It has been
designed with a focus on configurability and usability, allowing for different speech recognition
systems and usage scenarios.
http://vhtoolkit.ict.usc.edu/index.php/AcquireSpeech
Dependencies on other technology: PocketSphinx

openSMILE
Description: The openSMILE feature extration tool enables you to extract large audio feature
spaces in real time. It combines features from Music Information Retrieval and Speech
Processing. Speech & Music Interpretation by Large-space Extraction (SMILE) is written in C++
and is available as both a standalone command line executable as well as a dynamic library.
The main features of openSMILE are its capability for on-line incremental processing and its
modularity. Feature extractor components can be freely interconnected to create new and
custom features, all via a simple configuration file. New components can be added to
openSMILE via an easy binary plugin interface and a comprehensive API.
http://sourceforge.net/projects/opensmile/

openEAR
Description: openEAR is the Munich Open-Source Emotion and Affect Recognition Toolkit
developed at the Technische Universität München (TUM). It provides efficient (audio) feature
extraction algorithms implemented in C++, classfiers, and pre-trained models.
http://sourceforge.net/projects/openart/

YAAFE
Description: YAAFE means “yet another audio features extractor“; a software designed for
efficient computation of many audio features simultaneously. Audio features are usually based
on same intermediate representations (FFT, CQT, envelope, etc.), YAAFE automatically
organizes computation flow so that these intermediate representations are computed only
once. Computations are performed block per block, so YAAFE can analyze arbitrarily long audio
files. The YAAFE framework and most of its core feature library are released in source code
under the GNU Lesser General Public License (LGPL) and is available online
(http://www.tsi.telecom-paristech.fr/aao/en/software-and-database/).
Other
extraction
software also exists. jAudio (McEnnis et al. 2005) is a java-based audio feature extractor library,
whose results are written in XML format. Maaate is a C++ toolkit that has been developed to
analyze audio in the compressed frequency domain, http://maaate.sourceforge.net/. FEAPI
(Lerch et al. 2005) is a plugin API similar to VAMP. MPEG7 also provides Matlab and C code for
feature extraction.
http://yaafe.sourceforge.net/

DESAM Toolbox
Description: The DESAM Toolbox, which draws its name from the collaborative project
“Décomposition en Eléments Sonores et Applications Musicales” funded by the French ANR, is
a set of Matlab functions dedicated to the estimation of widely used spectral models from,
85
potentially musical, audio signals. Although those models can be used in music information
retrieval tasks, the core functions of the toolbox do not focus on any application. It is rather
aimed at providing a range of state-of-the-art signal processing tools that decompose music
files according to different signal models, giving rise to different “mid-level” representations.
This toolbox is therefore aimed at the research community interested in the modeling of
musical audio signals. The Matlab code is distributed under the GPL and is available online:
http://www.tsi.telecom-paristech.fr/aao/en/software-and-database/
3.5. Emotional Behaviours
Computational models of emotional expressions are gaining a growing interest. The models of
expressive behaviours are crucial for the believability of virtual characters (Aylett, 2004). Agents
portraying emotional behaviours and different emotional strategies such as empathy are perceived
as more trustworthy and friendly; users enjoy interacting with them more (Brave et al., 2005;
Partala and Surakka, 2004; Prendinger et al., 2005, Ochs et al., 2008).
Early computational models followed mainly the discrete emotion approaches that provide
concrete predictions on several emotional expressions (Ruttkay et al, 2003). The idea of universality
of the most common expressions of emotions was particularly sought after to enable the generation
of “well recognizable” facial displays. However easy to categorize in terms of evoked emotions, the
expressions based on discrete theory are still oversimplified. One method to enrich the emotional
behaviour of a virtual character, while relying on discrete facial expressions, is to introduce blends.
In the works of Bui (2004), Niewiadomski and Pelachaud (2007a, 2007b) and Mao et al. (2008),
these blend expressions are modeled using fuzzy methods.
Several models of emotional behaviour link separate facial actions with some emotional
dimensions like valence. Interestingly most of them use the PAD model, which is a 3D model
defining emotions in terms of pleasure (P), arousal (A) and dominance (D) (Mehrabian, 1980).
Among others, Zhang et al. (2007) proposed an approach for the synthesis of facial expressions from
PAD values. Another facial expression model based on the Russell and Mehrabian 3D model was
proposed by Boukricha et al. (2009). A facial expressions control space is thus constructed with
multivariate regressions, which enables the authors to associate a facial expression to each point in
the space. A similar method was applied previously by Grammer and Oberzaucher (2006), whose
work relies only on the two dimensions of pleasure and arousal. Their model can be used for the
creation of facial expressions relying on the action units defined in the FACS (Ekman et al. 2002) and
situated in the 2D space. Arya et al. (2009) propose a perceptually valid model for emotion blends.
Fuzzy values in the 3D space are used to activate the agent's face. Recently, Stoiber et al. (2009)
proposed an interface for the generation of facial expressions of a virtual character. The interface
allows one to generate facial expressions of the character using the 2D custom control space. The
underlying graphics model is based on the analysis of the deformation of a real human face. Some
researchers were inspired by the Componential Process Model (CPM) (Scherer, 2001), which states
that different cognitive evaluations of the environment lead to specific facial behaviours. Paleari
and Lisetti (2006) and Malatesta et al. (2009) focus on the temporal relations between different
facial actions predicted by the Sequential Evaluation Checks (SECs) of the CPM model. Lance and
Marsella (2007, 2008) propose a model of gaze shifts towards an arbitrary target in emotional
displays. The model presented by Niewiadomski et al. (2011) generates emotional expressions that
may be composed of non-verbal behaviours displayed over different modalities, of a sequence of
signals or of expressions within one modality that can change dynamically. Signal descriptions are
gathered into two sets: the behaviour set and constraint set. Each emotional state has its own
behaviour set, which contains signals that might be used by the virtual agent to display that
emotion.
86
Computational models of emotional expression provide virtual agents with a large repertoire of
multimodal behaviours. Agents can express their emotional states. They can communicate their
attitude to others agents or humans. They seem also much more lively and believable.
The inclusion of emotional behaviour is a very important task within the REVERIE project. This task
will develop a computational model to enable the virtual human or avatar to display a large range of
emotional responses, as a reaction to external stimuli. The work will guide the autonomous agents
towards a natural and credible pattern of contextual responses, such as a startle reflex, as well as
emotion in context such as displaying sadness and joy in response to visual and aural prompts.
Focus will typically be on multimodal emotional behaviour including facial expression, body
movement and behaviour expressivity.

EmotionML
Description: Emotion Markup Language (EmotionML) 1.0, is a markup language designed to be
usable in a broad variety of technological contexts while reflecting concepts from the affective
sciences. EmotionML allows a technological component to represent and process data, and
enables interoperability between different technological components processing the data. It
provides a manual annotation of material involving emotionality, automatic recognition of
emotions from sensors and generation of emotion-related system responses. The latter may
involve reasoning about the emotional implications of events, emotional prosody in synthetic
speech, facial expressions and gestures of embodied agents or robots, the choice of music and
colours of lighting in a room, etc.
http://www.w3.org/TR/emotionml/

SAIBA framework
Description: Developed to ease the integration of autonomous agent technologies. Three main
processes have been highlighted: Intent Planner, Behaviour Planner and Behaviour Realizer.
Two representation languages have been designed to link these processes. FML (Heylen et al,
2008) represents high level information of what the agent aims to achieve: its intention, goals
and plans. BML (Vilhjalmsson et al, 2007) describes non-verbal communicative behaviours at a
symbolic level.
http://wiki.mindmakers.org/projects:saiba:main/

EMA (Gratch & Marsella, 2004)
Description: Modelling of the emotion regulation process and of the effects of emotions on the
mental and affective state of the agent; a model of the adaptation process identifies the
behaviour that a virtual agent should adopt to cope with high intensity emotions. There are
different coping strategies.

FatiMa (Dias et al, 2011)
Description: This is an open-source generic model of emotion. It relies on appraisal theory. It
can output discrete emotions from the OCC model as well as continuous ones as in the PAD
representation.
http://sourceforge.net/projects/fatima-modular/

ALMA (Gebhard, 2005)
Description: This model adopts the discrete and continuous representation of emotions: it uses
the 24 kinds of emotions from the OCC model. Every emotion has an associated value from the
PAD representation. ALMA is freely available.
87
http://www.dfki.de/~gebhard/alma/index.html

OSSE (Ochs et al, 2009)
Description: This model embeds the discrete representation of emotions from the OCC model
(joy, hope, disappointment, sadness, fear, relief, pride, admiration, shame and anger). It uses a
continuous representation of their intensity. Events are described by triples <agent, action,
patient>. Emotions triggered by this event are calculated for each agent. This computation
depends on the values of these three elements and the preferences of the agents, OSSE is
freely available.
https://webia.lip6.fr/svn/OSSE
3.6. Virtual Worlds
As for the real world, the bare essence of a virtual world consists of the availability of (1) a virtual
landscape or terrain and (2) virtual characters or avatars. Evidently, these bare essences are hardly
sufficient to create a plausible immersive virtual environment for obvious reasons that often find
their equivalent in the real world.











A short list of the most common denominators is provided below.
Terrain: a complex scene may consist of hills, rocks, water, other
 Terrain editor: to allow the user to [graphically] design the terrain
Avatar: usually created when a new user visits the virtual world for the 1st time
 Avatar editor: to customize clothes, skin, hair, facial and body characteristics, other
 Avatar animation: walking, flying, dancing, path finding, crowd control, other
Sky: a skybox is often used to encapsulate the complete virtual world
 Sky [timeline] editor: to allow the user to [graphically] design the skybox
Scene: the terrain is packed with objects like houses, structures, vegetation, other
 Scene/Object editor: to allow the user to [graphically] design or import new object
Scripting: to automate actions or reactions, to animate objects
 Script editor: to allow the user to [graphically] design or import new animations
Physics: basic physics engines will keep avatars from falling through the ground or from
running through objects but many other dedicated engines exist like for example
 Water physics: to simulate the behaviour of water
 Particle physics: to rupture glass or vegetation, to simulate explosions, other
Lighting: to cast shadows and reflections from one object or avatar to the next using spot
lights, environment lights, other
 Transparency, mirroring, features may require special engines
Sound: to support music, talking (man-to-machine)
Video: to support in-world movie playback, multiscopic video, other
Security: authentication, authorization, privacy, ownership
Economy: currency, convertibility to real currency, fair trade, other
Then there’s the financial and programming side of the virtual worlds that induce a couple of
other factors to take into account.
 Shaders [editor]: for HW accelerated graphics using OpenGL, DirectX, other
 Networking: to communicate information between the different elements inside the world and
in-between different worlds
 Licensing: commercial, free, GPL, LGPL, BSD, MIT, CCL, other
 Source code availability: open, closed
 Platform: Windows, Linux, iOS, Android, OSX, Solaris, Playstation, Xbox, Wii, other
88






Programming language: C, C++, C#, Java, Flash, Delphi, Python, Ruby, JavaScript, other
Interoperability: import and export capabilities to/from other worlds
Architecture: client-server, peer-to-peer, cloud, other
Deployment: scalability, distributability, reliability, extensibility, maintainability, other
2D: text, fonts, cursors, panels, menus, other
Support: documentation, large and active development community, other
A reasonably good but still basic overview of existing game engines is already available on this
page: http://content.gpwiki.org/index.php/Game_Engines. The provided list of game engines does
however not include a vast amount of other virtual worlds that are mostly already on-line. These
kinds of virtual worlds either use one of the listed game engines above or provide their own
proprietary or open source implementation. They are not targeting the fast interaction that is often
of highest importance to gaming like in first person shooter games but is rather targeting the social
interaction where people can meet up, talk, travel, dance, create, self-enhance, give presentations,
make money, do business, earn a reputation, or do other things together. Evidently, the
requirements for these kinds of virtual worlds are inherently different. Particle physics may for
instance be a make-or-break requirement for a virtual world dedicated to gaming but may be
completely irrelevant for a virtual world dedicated to socializing.
The virtual worlds for socializing can arguably be further divided in Metaverses and Mirror
Worlds. Well know Metaverses are for example The Sims, IMVU, Second Life, Blue Mars, Kaneva,
HiPiHi, and Active Worlds. A more comprehensive short list of Metaverses is available on this page:
http://arianeb.com/more3Dworlds.htm. Mirror Worlds on the other hand try to simulate the real
world. Well known Mirror Worlds are for example Google [Maps] StreetView and Microsoft [Bing]
StreetSide but there are also others less known like MapJack, EveryScape and Earthmine. A more
comprehensive
list
of
street
views
is
available
on
this
page:
http://en.wikipedia.org/wiki/Competition_of_Google_Street_View.
In line with the introduction on the referenced page for game engines, when picking a virtual
world, attention should be paid to whether or not it satisfies the needs of the use case. It is
therefore unreasonable to list all pros and cons of every virtual world but rather list the financial,
licensing, deployment, technical and other requirements of the partners and the use cases. This list
can then be used to start the search for the best suited virtual worlds. In the best case, a single
world can be selected for all partners and both use cases.
For example, the intelligent agent platform Greta of ParisTech uses OGRE and OpenGL. If the
consortium is willing to deeply integrate this technology into a virtual world, the choice of engines is
already significantly narrowed to e.g. Axiom, Diamonin, YAKE, OGE, RealmForge, and YVision.

89
Open Source Virtual World Server OpenSimulator
Description: OpenSimulator is an open source multi-platform, multi-user 3D application server.
It can be used to create a virtual environment (or world) which can be accessed through a
variety of clients, on multiple protocols. OpenSimulator allows virtual world developers to
customize their worlds using the technologies they feel work best - we've designed the
framework to be easily extensible. OpenSimulator is written in C#, running both on Windows
over the .NET Framework and on Unix-like machines over the Mono framework. The source
code is released under a BSD License, a commercially friendly license to embed OpenSimulator
in products. OpenSimulator is the open source version of Linden Labs proprietary Second Life.
OpenSimulator was used as virtual world environment in the ITEA2 Metaverse1 project to
implement the PresenceScape conceptual demonstrator.
http://www.opensimulator.org
Dependencies on other technologies: .NET (Windows) or Mono (Linux) Framework

Linden Labs Virtual World Server Second Life
Description: Second Life ® is a single global 3D virtual world created by its residents (people like
you) that’s bursting with entertainment, experiences, and opportunity. Second Life offers a
uniquely immersive experience where you can create, buy, and sell anything you can imagine;
socialize with people across the world; and enjoy live events and gaming activities. In-world
virtual object creation tools support community creation of animated (scripted) virtual goods
and designer artifacts while in-world avatar creation tools support creation of basic
personalized avatars. Second Life also supports virtual currency for in-world commercial
activities. A number of external tools can be used to create basic avatar animations that can be
imported into the world. Second Life was used as virtual world environment in the ITEA2
Metaverse1 project to implement the Mixed Reality conceptual demonstrator.
http://www.secondlife.com

Open Source Virtual World Client Hippo
Description: The Hippo OpenSimulator client is a modified Second Life client, targeted at
OpenSimulator users. The client is written in C++, running on Linux. It allows its users to
navigate through and interact with objects and avatars in virtual environments via a Graphical
User Interface (GUI). The Hippo OpenSimulator Viewer works seamlessly together with Linden
Labs Virtual World Second Life. Hippo was used in the ITEA2 Metaverse1 project as a GUI and
to implement stereoscopic (3D) video streaming in the Mixed Reality conceptual demonstrator.
http://mjm-labs.com/viewer
http://sourceforge.net/projects/opensim-viewer
Dependencies on other technologies: Linux OS (or Windows OS using Cygwin)

Open Source Virtual World client Metabolt
Description: The Metabolt client allows its user to navigate through and interact with objects
and avatars in virtual environments via a command line interface. The Metabolt Client works
seamlessly together with OpenSimulator and Linden Labs Virtual World Second Life. Metabolt
was used in the ITEA2 Metaverse1 project to implement the autonomous agents in the
PresenceScape conceptual demonstrator.
http://www.metabolt.net

Linden Labs Virtual World Client Second Life
Description: The Linden Labs Virtual World Client Second Life allows its users to navigate
through and interact with objects and avatars in virtual environments via a GUI. The client
works seamlessly together with OpenSimulator and Linden Labs Virtual World Second Life. The
Second Life client was used in the ITEA2 Metaverse1 project as a GUI for the PresenceScape
conceptual demonstrator for virtual camera control.
http://www.secondlife.com
3.7. User-system interaction
User-system interaction is a topic that is much older than the computer itself. Indeed, depending
on the meaning of ‘system’, we can say that an analogue watch also has a user-system interface,
more specifically the wristband of the watch, the look-and-feel of the watch, the hands of the watch
revealing the current time and the dial buttons to mechanically charge the internal spring or change
the hand positions. We can easily go further back in time to the Belgian “Pot van Olen” of the 16 th
90
century, the crossbow or catapult of the middle ages or the first Oldowan chopping tool of the
Lower Palaeolithic period. For obvious reasons however, we will narrow the scope of the usersystem interaction to computer or computer-related systems.
Within this narrowed scope, many books have been written that can arguably be categorized
into two main areas. The first one is more user-system related where we can look at the plurality of
options with which the user can interact with the system. These options can vary from the usual
suspects like keyboard or mouse to the most sophisticated ones like direct brain-computer
interfaces. The second area is more interaction related where a given user-system can be applied in
different ways to interact with the system. The various Windows interface designs have for example
evolved from the simple non-user-friendly design of Windows 1.0 back in 1981 to the more userfriendly Metro design of Windows 8. Since it is at present not clear which interaction related area or
areas are of interest to the project, we will further narrow the scope of the user-system interaction
to the user-system related area.
The availability of a plurality of devices and related software that can be used by a user to
interact with a system has grown significantly since the introduction of the screen as an output
device, and the keyboard and mouse as input means. Apple’s iPhone for instance kept the screen as
output device but replaced the keyboard and mouse input with just a few buttons, a camera, a
microphone, a touch screen, motion sensors, and a GPS, all of which can be used by the user to
interact with the system. The output means of the iPhone have been further complemented with a
speaker and a vibration motor. Experiments have been done with mice carrying a small device
directly connected to the mouse’s brain on the one side and wirelessly connected to a water tap on
the other side. Eventually, the mice learned to open the tap with their thoughts only. In the
following part, a limited list of examples of available types of user-system devices is given.
Arguably the first wave of input devices for the general public consisted of electrical contactbased devices like the keyboard, the mouse, the trackball, the joystick, the gamepad, and the
steering wheel.
The second wave could be determined as consisting of more sophisticated devices like GPS,
RFIDs, touch pads, touch screens, motion sensors, the user’s voice derived from the microphone
and the user’s pose and gestures derived from the camera. Devices like the iPhone, Wii and Kinect
clearly belong to this category as well as game cards and access badges based on RFID or similar
technologies. Probably in the same category is face or face features detection, a field of technology
that is already incorporated in many domestic photo or film cameras today. Variations to these
devices are, e.g. projections of input capabilities on ordinary objects where after video analysis are
used to determine the intent of the user. Probably belonging in the same category are the eye
trackers in combination with face trackers that are able to deduce where exactly the user is looking
at.
Devices in the third wave are using even more sophisticated technologies that are sometimes
already in use in specialized areas like the army or in healthcare and these are slowly finding their
way to the general public. Examples of such devices are [wearable] brain caps using contactless
electroencephalogram measurements for the gaming industry, attention trackers that can be used
in education, direct nerve interfaces for control over prostatic limbs, tongue interface for people
with severe motor disabilities, and all kinds of bio and chemical sensors for lie detectors.
Gradual technological advances in hardware or software technologies used by these user-system
interaction devices as well as the combination of two or more such devices constantly allow us to
improve the control of the user over the system. For instance, advanced facial analysis optionally
combined with more advanced voice analysis allows us to better understand the emotion, intent or
interest of the interacter.
91
3.7.1. User-System Interaction for Virtual Characters
Virtual characters are distinct by whether they are intended to be controlled by a human user
(often referred to as an avatar), or being autonomous. Avatars can be controlled through invoking
simple discrete sets of commands, like “walk forward” or “jump”, through any interface device
(such as a keyboard or a mouse). Such control is typical in any modern video game. More
sophisticated forms of input are made possible by introducing more diverse input devices, such as
analog joysticks, microphones, cameras and other tracking devices. In the latter cases, the avatar is
obliged to modify the rig according to the observed movement.
Autonomous characters on the other hand rely on sophisticated AI techniques in order to
assume control over their behaviour. Nevertheless, autonomous virtual humans (or agents) are
usually required to interact with users in a natural and believable way. Communication is handled
by a number of uni-modal or multimodal channels, specific to the input device used to
communicate signals. Since Audio-based HCI was addressed in Section 3.4, the remainder of this
Section will focus on Visual and Sensor based HCI techniques, as well as multimodal approaches,
which fuse elements of the above methods for better results.
Visual-based HCI is probably the most wide-spread area in the area of man-machine interaction,
in which researchers have tried to address the different aspects of human response that can be
visually recognized as signals. Such research topics include Facial Expression Analysis (de la Torre
and Cohn, 2011), which aims at recognizing emotion information through facial expression display,
Gesture recognition (Just and Marcel, 2009; Kirishima et al. 2005) has also provided auxiliary
information about the user's emotional state, as well as complementing Body Movement Tracking
(Gavrila, 1999; Aggarwal and Cai, 1999) for direct interaction in the context of control, described in
the previous paragraph. Gaze tracking and estimation (Sibert and Jacob, 2000) is another indirect
form of interaction, suited for recognizing the user’s focus of attention, as well as providing low
level direct input (by using an eye-controlled mouse pointer, for example).
A third modality for HCI is provided via physical sensors to communicate data between user and
machine. Such hardware devices include many sophisticated technologies, such as motion tracking,
haptic, pressure and taste/smell sensors. Motion tracking sensors usually consist of wearable
clothes and joint sensors which allow computers to track human skeleton joint movements and
reproduce the effect on virtual characters. Haptic and pressure sensors are more common in the
robotics and virtual reality areas (Robles-De-La-Torre, 2006; Hayward et al. 2004; Iwata, 2003), in
which machines are being made aware of contact. Smell and taste sensors also exist, although their
applicability has been limited (Legin et al. 2005). Sensors may also however concern simpler and
more common devices such as pen-based sensors, keyboard/mouse devices and joysticks. Penbased sensors are of specific interest to mobile devices and are more commonly related to
handwriting and pen gesture recognition (Oviatt et al. 2000), while keyboards, mice and joysticks
have been around for decades.
Multi-modal HCI systems refer to the combination of the aforementioned uni-modal user inputs
in order for one modality to assist the other in its shortcomings. Known multi-modal methods used
in the literature fuse the visual and audio channels to improve recognition rates. For example, lip
movement tracking has been shown to assist speech recognition, which in turn has been shown to
assist command acquisition in gesture recognition. Applications of these multi-modal systems
include smart video conferencing, intelligent homes, driver monitoring, intelligent games, ecommerce and aiding tools for disabled people.
Understanding the user’s emotion, intent or interest is crucial for the development of an
interactive cognitive automated system. In an e-learning setting, the user’s level of attention is most
probably a valuable characteristic that may be used to maximize the focus of the user on the subject
at hand. For autonomous agents to respond in a cognitive way to user interaction through voice,
92
gesture, facial feedback or other means, the user’s intent is evidently a crucial parameter. Not only
the intent of a first user may be valuable input but rather the combined input of all users in a
certain environment is probably crucial for the autonomous agent in order to be able to understand
and adequately respond to group interactions. Group interaction may also be important for an
intelligent camera system charged with real-time directing of the recordings of the group. Social
translucency may also be improved by incorporating the obtained user information somehow in the
rendering chain. Last but not least, the obtained information can be combined with a variety of
stored and real-time contextual information.
Simple examples of the above could be (1) the handshake of the autonomous agent with the
user’s avatar on first encounter and (2) the [mutual] acknowledgment of a second user’s avatar
approaching the autonomous agent and the first user’s avatar.
REVERIE should provide multiple means of control interaction, with respect to user hardware
setup. Avatars should be controllable through minimal input devices, such as the keyboard or
mouse, as well as tracking devices such as Microsoft's Kinect sensor. Furthermore, natural social
interaction between users and autonomous agents should be supported for increased realism and
believability.
In the list below, a list of useful user tracking tools and devices is provided.

Watson
Description: Watson is a real-time visual feedback recognition library for interactive interfaces
that can recognize head gaze, head gestures, eye gaze and eye gestures using the images of a
monocular or stereo camera.
http://vhtoolkit.ict.usc.edu/index.php/Watson

Open Source Natural Interaction Framework OpenNI
Description: The OpenNI organization is an industry-led, not-for-profit organization formed to
certify and promote the compatibility and interoperability of Natural Interaction (NI) devices,
applications and middleware. As a first step towards this goal, the organization has made
available an open source framework, the OpenNI framework, which provides an Application
Programming Interface (API) for writing applications utilizing NI. This API covers communication
with both low level devices (e.g. vision and audio sensors), as well as high-level middleware
solutions (e.g. for visual tracking using computer vision).
http://www.openni.org
Dependencies on other technologies: PrimeSense NiTE middleware

Open Source NI Middleware PrimeSense NiTE
Description: The NI Middleware from PrimeSense is a lot like your brain; what yours does for
you, NiTE does for computers and digital devices. It allows them to perceive the world in 3D so
they can comprehend, translate and respond to your movements, without any wearable
equipment or controls. Including computer vision algorithms, NITE identifies users and tracks
their movements, and provides the framework API for implementing NI UI controls based on
gestures. Hand Control allows you to control digital devices with your bare hands and as long
as you're in control, NITE intelligently ignores what others are doing. Full Body Control lets you
have a totally immersive, full-body video game experience; the kind that gets you moving.
Being social, NITE middleware supports multiple users, and is designed for all types of action.
http://www.primesense.com/Nite
Dependencies on other technologies: PrimeSensor Module, Asus Xtion Firmware

Seeing Machines Face Tracker Product
93
Description: faceLAB provides full face and eye tracking capabilities. Its Automatic Initialization
feature provides one-click subject calibration, generating data on (1) Eye movement; (2) Head
position and rotation; (3) Eyelid aperture; (4) Lip and Eyebrow movement, and (5) Pupil size.
faceAPI provides a suite of image-processing modules created specifically for tracking and
understanding faces and facial features. These tracking modules are combined into a complete
API toolkit that delivers a rich stream of information that can be incorporated into products or
services. Seeing Machines faceAPI provides a comprehensive, integrated solution for
developing products that leverage real-time face tracking. All image-processing for face
tracking is handled internally, removing the need for any computer vision experience.
http://www.seeingmachines.com

iMotions Attention Tool Eye Tracker Software
Description: iMotions Attention Tool is a robust eye tracking software platform for scientific
and market research. iMotions technology is proven and has several patents pending on
emotion measurements and reading pattern recognition. It allows merging eye tracking data
from a diversity of models of eye trackers from Tobii, EyeTech and SensoMotoric Instruments.
http://www.imotionsglobal.com

Tobii Technology Eye Tracker Products
Description: Tobii is the world’s leading vendor of eye tracking and eye control: a technology
that makes it possible for computers to know exactly where users are looking. The Tobii eye
trackers estimate the point of gaze with extreme accuracy using image sensor technology that
finds the user’s eyes and calculates the point of gaze with mathematical algorithms. Tobii has a
wide range of eye trackers (X1 Light, T60, T120, X60, and X120) for an equally wide range of
applications (assistive technologies, human research, marketing, gaming, other).
http://www.tobii.com

SensoMotoric Instruments (SMI)’s Eye Tracker Products
Description: SensoMotoric Instruments (SMI) is a world leader in dedicated computer vision
applications, developing and marketing eye & gaze tracking systems and OEM solutions for a
wide range of applications. Founded in 1991 as a spin-off from academic research, SMI was the
first company to offer a commercial, vision-based 3D eye tracking solution. SMI products
combine a maximum of performance and usability with the highest possible quality, resulting in
high-value solutions for their customers. Their major fields of expertise are (1) Eye & gaze
tracking systems in research and industry, (2) High speed image processing, and (3) Eye
tracking and registration solutions in ophthalmology. SMI has a wide range of eye trackers
(RED, RED250, RED500, IVIEW X, other) for an equally wide range of applications (assistive
technologies, human research, marketing, gaming, other).
http://www.smivision.com

EyeTech Digital Systems Eye Tracker Products
Description: EyeTech Digital Systems designs and develops eye tracking hardware and software
since 1996. They provide both off-the-shelf eye tracking systems and a host of customized
solutions. Their Quick Glance software enables cursor control using eye tracking and is 32/64bit compatible. It includes software for direct eye-tracking gaze data and third-party benefits
such as heat maps, areas of interest, 3D heat maps, landscapes, and focus maps on static
content, along with gaze plots on video content.
http://www.eyetechds.com

MiraMetrix Eye Tracker Product
94
Description: The MiraMetrix S2 Eye Tracker is an easy to calibrate, tripod mounted, portable
eye tracker that comes with a software API and viewer application. The API uses standard
TCP/IP for communication and provides XML data output. The viewer application collects eye
gaze and other data in real time and provides important analysis tools. It records a video of onscreen activity as subjects are interacting and having their eye movements tracked. A second
monitor can be used to show what people are doing on-the-fly.
http://mirametrix.com

Alea Technologies Eye Tracker Product
Description: The IntelliGaze IG-30 system is a commercial European eye tracking system that
has been designed from the ground up with an Augmentative and Alternative Communication
(AAC) application in mind. The open software architecture of the IntelliGaze system allows easy
integration with most specialized communication packages as well as standard Windows
applications.
http://www.alea-technologies.de

Open Source openEyes Eye Tracker Software
Description: openEyes is an open-source open-hardware toolkit for low-cost real-time eye
tracking. The openEyes toolkit includes algorithms to measure eye movements from digital
videos, techniques to calibrate the eye-tracking systems, and example software to facilitate
real-time eye-tracking application development. They make use of the Starburst algorithm,
which is Matlab software that can be used to measure the user's point of gaze in video
recorded from eye trackers that use dark-pupil IR or visible spectrum illumination. They provide
the cvEyeTracker which is a real-time eye-tracking application using the Starburst algorithm
written in C for use with inexpensive, off-the-shelf hardware.
http://thirtysixthspan.com/openEyes

Open Source Opengazer Eye Tracker Software
Description: Opengazer is an open-source gaze tracker for ordinary webcams that estimates
the direction of a user’s gaze. This information can then be passed to other applications. For
example, used in conjunction with Dasher, Opengazer allows you to write with your eyes.
Opengazer aims to be a low-cost software alternative to commercial hardware-based eye
trackers. The latest version of Opengazer is very sensitive to head-motion variations. To rectify
this problem the open source community is currently focusing on head tracking algorithms to
correct head pose variations before inferring the gaze positions. A subproject of Opengazer
involves the automatic detection of facial gestures to drive a switch-based program. Three
gestures have been trained to generate three possible switch events: a left smile, right smile,
and upwards eyebrow movement. All the software is written in C++ and Python. The
Opengazer project is supported by Samsung and the Gatsby Foundation and by the European
Commission in the context of the AEGIS project (Accessibility Everywhere: Groundwork,
Infrastructure, Standards).
http://www.inference.phy.cam.ac.uk/opengazer

Open Source TrackEye Eye Tracker Software
Description: TrackEye is a real-time tracking application of human eyes for Human Computer
Interaction (HCI) using a webcam. The application features the following capabilities: (1) realtime face tracking with scale and rotation invariance, (2) tracking the eye areas individually, (3)
tracking eye features, (4) eye gaze direction finding, and (5) remote controlling using eye
movements.
www.codeproject.com/KB/cpp/TrackEye.aspx
Dependencies on other technologies: OpenCV Library v3.1
95

Freer Logic (and Unique Logic and Technology) Attention Tracker Products
Description: Freer Logic’s patent pending technology BodyWave, in the form of a sports
armband, reads and reacts to brainwaves through the extremities of the body. BodyWave
reads brain activity through the human body via a uniquely innovative arm band that houses
brainwave sensors that attach to the arm or wrist. BodyWave monitors the brains physiological
signals through the body. Dry sensors acquire brain signals and transfer them wirelessly via
Bluetooth or WiFi to a mobile device or PC. When BodyWave is used with Freer Logic’s 3D
computer simulations, it can teach stress control, increase attention, and facilitate peak mental
performance. Freer Logic partners with Unique Logic and Technology provide a multitude of
Play Attention applications that can be used for feedback technology, attention training,
memory training, cognitive skill training, social skills training, motor skills training, behaviour
shaping, and more.
http://www.freerlogic.com, http://www.playattention.com
4. Related Tools to WP7: Composition and visualisation
The aim of WP7 is to render avatars (use case 1) as well as
visually highly realistic 3D representations of humans (use case 2)
into a common virtual room. The latter case poses a major
challenge. Besides refining already known methods, e.g. image
based (single/multi-view with/without depth maps) and geometry
based (polygon meshes with/without textures), a major part of
this work package is to investigate new approaches.
HHI intends to research a hybrid representation of humans for
telecommunication scenarios (use case 2 in REVERIE). Image based
rendering techniques will be combined with 3D geometry
processing thereby mixing the benefits of both approaches. The
goal is to achieve high quality human representations at low bit
rates and low computational complexity. Such techniques allow
real-time video conferencing in 3D with stereoscopic replay,
different viewpoints etc.
4.1. Rendering of human characters
Rendering of human characters started in the 1970's with the
advent of video games. At that time, only very abstract
representations, which hardly resembled their real counterparts,
were used. After the first decade the 2D figures in computer
games became more human like with simple characteristic
features like facial features, clothes and recognizable extremities.
Another decade made the transition from 2D to 3D and increased
further the graphics resolution, which made facial expressions
readable. With the new millennium the richness in details
increased further to a degree that realistically appearing
renderings were achieved, when single frames are inspected.
Today's renderings of human beings are quite close to photo
realistic appearance. The major challenge nowadays lies in the Figure 46: Depicts (from top to
reproduction of biodynamics to achieve natural movement as well bottom): 1978's Basketball by
Atari, 1985's Super Mario
as psychological plausible behaviour. Although modern animations
Brothers by Nintendo, 1996's
Quake by id Software, 2004's
96
Half Life 2 by Valve Software.
of human characters can be quite close to photo realism, there is the effect called "uncanny valley",
which states that human replicas looking and acting almost, but not perfectly, like human beings,
causes a response of revulsion among human observers.
Figure 47: Depicts the game L.A. Noire, which features depth analysis's newly developed
technology for the film and video game industries called MotionScan that utilizes 32 cameras to
record an actor's every wince, swallow, and blink which is then transferred to in-game animation.
Inspired by the realistic rendering capabilities of computer animated human characters in video
games, a trend is to also use such techniques in the film industry. The first photorealistic computer
animated feature film was Final Fantasy: The Spirits Within, released in July 2001. On one hand, the
actors are perfectly designed and look very natural and realistic. On the other, the movements and
dynamics are clearly recognizable as computer generated. One of the latest films featuring the most
advanced human rendering techniques is The Adventures of Tintin, which was released in October
2011.
Figure 48: Final Fantasy: The Spirits Within (left) and The Adventures of Tintin (right).
From a scientific point of view, realistic rendering of human characters is no longer a pure
graphics rendering problem, as today's graphics cards and libraries are capable of rendering very
sophisticated 3D models. The realistic appearance depends on the model representing the human
character. Since models of realistically appearing human characters are highly complex to achieve, it
is no longer common to model humans by hand, but instead to generate them out of 3D scans of
real persons. For that reason, 3D reconstruction (Section 1.5), Motion Capture (Section 1.2) and
Performance Capture (Section 1.3) are sufficient prerequisites to model and animate realistically
appearing models of human characters. For the purpose of photo-realistic rendering markerless
methods are preferable, because the captured texture can be used directly for rendering (which
might be not needed in every case, e.g. when retargeting the motion commands). That is why we
focus in this section on some few works, which are particularly interesting from the rendering
perspective.
Due to the enormous complexity of the human body, human motion, and human biomechanics,
realistic simulation of human beings remains largely an open problem in the scientific community. It
is one of the "holy grails" of computer animation research. For the moment it looks like 3D
computer animation can be divided into two main directions:
 non-photorealistic and
97

photorealistic rendering, which can be further subdivided into
o real and
o stylized photorealism.
In order to achieve realistic rendering of human characters the following aspects have to be
addressed, besides rendering of a suitable geometry and texture:
Light: reflection (direct and subsurface), refraction and shadows,
e.g. d’Eon and Irving (2011) and Alexander et al. (2010).
Surface structure: hair (e.g. Koster et al. (2004)), wrinkles, pimples, etc.
Dynamic: changes of color and shape because of movement.
Naturalness: e.g. breath, pupils, eyelid movement.
Psychology: plausible behaviour
For simple humanoid models it might be sufficient to design the model directly out of colored
geometric primitives. But, in order to create realistic models, it is common to base the work on 3D
scans of humans.
In his PhD thesis (Anguelov, 2005) the author presents a (now well known) chain of algorithms
(also published as separate conference papers) to:
 Recover an articulated skeleton model as well as a non-rigid deformation model from 3D
range scans of a person in different poses (Anguelov et al. 2004a; Anguelov et al. 2004b) in
order to interpolate between these poses.
 Calculate a direct surface deformation model to adapt surface regions of a mean 3D
template model depending on neighboring joint states, and additionally a PCA-based model
to inter- and extrapolate between scans of differently shaped people in order to reflect the
variability among human shapes (gender, height, weight ...) (Anguelov et al. 2005)
Figure 49: Shapes of different people in different poses, synthesized from the SCAPE (Shape
Completion and Animation of PEople) framework (Anguelov, 2005)
In (Hasler et al. 2009) the authors present a unified model that describes both human pose and
body shape, which allows the accurate modeling of muscle deformations not only as a function of
pose but also dependent on the physiques of the subject. Coupled with the models ability to
generate arbitrary human body shapes, it severely simplifies the generation of highly realistic
character animations. A learning based approach is trained on 550 full body 3D laser scans taken of
114 different subjects15. Scan registration is performed using a non-rigid deformation technique.
Then, a rotation invariant encoding of the acquired exemplars permits the computation of a
statistical model that simultaneously encodes pose and body shape. Finally, morphing or generating
meshes according to several constraints simultaneously can be achieved by training semantically
meaningful regressors.
15
Source code as well as data of 3D scans, registrations and the extracted PCA model are available at
http://www.mpi-inf.mpg.de/resources/scandb/
98
Figure 50: The registration pipeline of (Hasler et al. 2009) (from left to right): The template model,
the result after pose fitting, the result after non-rigid registration, transfer of captured surface
details, and the original scan annotated with manually selected landmarks are shown.
Figure 51: Derived from a dataset of prototypical 3D scans of faces, the morphable face model
contributes to two main steps in face manipulation: (1) deriving a 3D face model from a novel
image, and (2) modifying shape and texture in a natural way.
Figure 52: Matching a morphable model to a single image (1) of a face results in a 3D shape (2)
and a texture map estimate. The texture estimate can be improved by additional texture
extraction (4). The 3D model is rendered back into the image after changing facial attributes, such
as gaining (3) and losing weight (5), frowning (6), or being forced to smile (7).
99
Since the human visual system is highly optimized to read and interpret facial expressions of
people, lots of research is focused on realistic face expression and head pose rendering. A very
successful approach is presented in (Blanz and Vetter, 1999). Starting from an example set of 3D
face models, a morphable face model is derived by transforming the shape and texture of the
examples into a vector space representation. New faces and expressions are modeled by forming
linear combinations of the prototypes. Shape and texture constraints derived from the statistics of
example faces are used to guide manual modeling or automated matching algorithms. This
approach allows 3D face reconstructions from single images and their applications for photorealistic image manipulations. The authors demonstrate face manipulations according to complex
parameters such as gender, fullness of a face or its distinctiveness (illustrated in Figures 56 and 57).
A different approach is to mix geometry-based and image-based approaches. An augmented
reality example worth mentioning is (Eisert and Hilsmann, 2011). The authors present a virtualmirror system for virtual try-on scenarios. A highly realistic visualization is realized by modifying only
the relevant regions of the camera images while preserving the rest. The selected piece of cloth is
tracked in consecutive frames and exchanged in the output images with a virtual piece of cloth
which is adapted to the detected current lighting and deformation conditions.
Figure 53: Hierarchical image-based cloth representation (left) and sample input camera images
(upper row right), retextured results with modified color and logo (lower row right).
In the case of free-view-point video, hybrid representations (image- and geometry-based) have
been proven successful, providing user selected views onto an actor with highly realistic
visualizations. In (Carranza et al. 2003) such a method is presented. Here, the actor’s silhouettes are
extracted from synchronized video frames via background segmentation and then used to
determine a sequence of poses for a 3D human body model. By employing multi-view texturing
during rendering, time-dependent changes in the body surface are reproduced in high detail. The
motion capture subsystem runs offline, is non-intrusive, yields robust motion parameter estimates,
and can cope with a broad range of motion. The rendering subsystem runs at real-time frame rates
using ubiquitous graphics hardware, yielding a highly naturalistic impression of the actor. The actor
can be placed in virtual environments to create composite dynamic scenes. Free-viewpoint video
allows the creation of camera fly-throughs or viewing the action interactively from arbitrary
perspectives.
Figure 54: Novel views onto an actor, generated with (Carranza et al. 2003) from multi-view
footage via image-based texturing.
100
Also noteworthy are the two scientific publications mentioned below in the tools-and-literature
section, (Xu et al., 2011) and (Hilsmann and Eisert, 2012). Both approaches have in common, that
they support the synthesis of motion sequences of an actor after having captured a multi-view
multi-pose database of him/her. This approach constitutes the basis of HHIs hybrid rendering
research for REVERIE.
Rendering of human characters is obviously a main interest of the REVERIE project, because its
goal is the development of a communication platform where its participants are projected into a
common virtual room. For use case 1, simply rendered human representations are sufficient. The
goal of use case 2 is to provide photo realistic 3D representations of the users (limited to three for
the project).

Video-based Characters - Creating New Human Performances from a Multi-view Video
Database (Xu et al., 2011)
Description: This method synthesizes plausible video sequences of humans according to userdefined body motions and viewpoints. A small database of multi-view video sequences is
captured of an actor performing various basic motions. This database needs to be captured
only once and serves as the input to the synthesis algorithm. A marker-less model-based
performance capture approach is applied to the entire database to obtain pose and geometry
of the actor in each database frame. To create novel video sequences of the actor from the
database, a user animates a 3D human skeleton with novel motion and viewpoints. The
technique then synthesizes a realistic video sequence of the actor performing the specified
motion based only on the initial database. The first key component of this approach is a new
efficient retrieval strategy to find appropriate spatio-temporally coherent database frames
from which to synthesize target video frames. The second key component is a warping-based
texture synthesis approach that uses the retrieved most-similar database frames to synthesize
spatio-temporally coherent target video frames. For instance, this enables us to easily create
video sequences of actors performing dangerous stunts without them being placed in harms
way. It is shown through a variety of videos and a user study that realistic videos of people can
be synthesized, even if the target motions and camera views are different from the database
content.
Figure 55: Animation of actor created from a multi-view video database. The motion was
designed by an animator and the camera was tracked from the background with a commercial
camera tracker.

101
Image-based Animation of Clothes (Hilsmann and Eisert, 2012)
Description: A pose-dependent image-based rendering approach for the visualization of
clothes with very high rendering quality. The image-based representation combines body-posedependent geometry and appearance. The geometric model accounts for low-resolution shape
adaptation, e.g. animation and/or view interpolation, while small details (e.g. fine wrinkles), as
well as complex shading/reflection properties are accounted for through numerous images
captured in an offline process. The images contain information on shading, texture distortion
and silhouette at fine wrinkles. The image-based representations are estimated in advance
from real samples of clothes captured in an offline process, thus shifting computational
complexity into the training phase. For rendering, pose dependent geometry and appearance
are interpolated and merged from the stored representations.
Figure 56: Details of an arm bending sequence interpolating between the left and right most
images. The second and third images are synthetically generated in-between poses. Note how the
wrinkling behaviour is perceptually correct.

Analyzing Facial Expressions for Virtual Conferencing (Eisert and Girod, 1998)
Description: A method for the estimation of 3D motion from 2D image sequences showing
head and shoulder scenes typical for video telephone and teleconferencing applications. A 3D
model specifies the color and shape of the person in the video. Additionally, the model
constrains the motion and deformation in the face to a set of facial expressions which are
represented by the facial animation parameters defined by the MPEG-4 standard. Using this
model, a description of both global and local 3D head motion as a function of the unknown
facial parameters is obtained. Combining the 3D information with the optical flow constraint
leads to a robust and linear algorithm that estimates the facial animation parameters from two
successive frames with low computational complexity. To overcome the restriction of small
object motion, which is common to optical flow based approaches, a multi-resolution
framework is used. Experimental results on synthetic and real data confirm the applicability of
the presented technique and show that image sequences of head and shoulder scenes can be
encoded at bit-rates below 0.6 kbit/s.
Figure 57: Left: original frames; 2nd column: wireframe of the animated head model; 3rd column:
textured model rendered with estimated expression parameters; Right: facial expression applied
to a different model.
102

Geometry-Assisted Image-based Rendering for Facial Analysis and Synthesis (Eisert and
Rurainsky, 2006)
Description: An image-based method for the tracking and rendering of faces. The algorithm is
used in an immersive video conferencing system where multiple participants are placed in a
common virtual room. This requires viewpoint modification of dynamic objects. Since hair and
uncovered areas are difficult to model by pure 3D geometry-based warping, image-based
rendering techniques are added to the system. By interpolating novel views from a 3D image
volume, natural looking results can be achieved. The image-based component is embedded
into a geometry-based approach in order to limit the number of images that have to be stored
initially for interpolation. Also, temporally changing facial features are warped using the
approximate geometry information. Both geometry and image cube data are jointly exploited
in facial expression analysis and synthesis.
Figure 58: Different head positions - all generated from a monocular video sequence. Hair is
correctly reproduced if the head is turned.
4.2. Scene recomposition with source separation
The aim of this task is to provide 3D audio rendering and compositing tools. The main goal of this
task will be to enable the simulation of real or enhanced virtual acoustic environments for multiple
sound sources and possibly multiple user locations in the virtual scene. For this task, two main
relevant technologies/components are of primary interest and include 1) the scene recomposition
or remixing from existing real acoustic scenes, which would imply the 3D audio rendering of
imperfect sources obtained from source separation; 2) Flexible 3D audio rendering components of
natural sources. A specific challenge is to bridge the gap between complex but high quality acoustic
room simulation methods based on the use of pre‐stored impulse responses of pre‐defined
rooms/halls and the less accurate but efficient statistical or parametric room simulation methods.
An extensive literature exists in source separation for scene recomposition. The following four
studies represent state of the art works in this domain and research within audio rendering and
composition in REVERIE will build on these prior works:
Figure 59: Beat spectrogram
103
The separation of the lead vocals from the background accompaniment in audio recordings is a
challenging task. Recently, an efficient method called REPET (REpeating Pattern Extraction
Technique) has been proposed to extract the repeating background from the non-repeating
foreground (Liutkus et al., 2012). While effective on individual sections of an audio document,
REPET does not allow for variations in the background (e.g. verse vs. chorus), and is thus limited to
short excerpts only. This limitation was overcome and REPET was generalized to permit the
processing of complete musical tracks. The proposed algorithm tracks the period of the repeating
structure and computes local estimates of the background pattern. It uses for that the beat
spectrogram, a 2D representation of the sound that reveals the rhythmic variations over time.
Separation is performed by soft time-frequency masking, based on the deviation between the
current observation and the estimated background pattern. Evaluation on a dataset of 14 complete
tracks shows that this method can perform at least as well as a recent competitive music/voice
separation method, while being computationally efficient.
A two-stage blind source separation algorithm for robot audition was developed by Maazaoui et
al. (2012). The first stage consists of fixed beamforming preprocessing to reduce the reverberation
and the environmental noise. The manifold of the sensor array was difficult to model due to the
presence of the head of the robot, so pre-measured Head Related Transfer Functions (HRTFs) to
estimate the beamforming filters were used. The use of the HRTF to estimate the beamformers
allows capturing the effect of the head on the manifold of the microphone array. The second stage
is a blind source separation algorithm based on a sparsity criterion which is the minimization of the
l1 norm of the sources. Different configurations of the algorithm are presented and promising
results are shown, i.e. the fixed beamforming preprocessing improves the separation results.
When designing an audio processing system, the target tasks often influence the choice of a data
representation or transformation. Low-level time-frequency representations such as the Short-Time
Fourier transform (STFT) are popular, because they offer a meaningful insight on sound properties
for a low computational cost. Conversely, when higher level semantics, such as pitch, timbre or
phoneme, are sought after, representations usually tend to enhance their discriminative
characteristics, at the expense of their invertibility. They become so-called mid-level
representations. In (Durrieu et al. 2011), a source/filter signal model which provides a mid-level
representation is proposed. This representation makes the pitch content of the signal as well
as some timbre information available, hence keeping as much information from the raw data
as possible. This model is successfully used within a main melody extraction system and a lead
instrument/accompaniment separation system. Both frame-works obtained top results at several
international evaluation campaigns.
Ozerov et al. (2011) considered the Informed Source Separation (ISS) problem where, given the
sources and the mixtures, any kind of side-information can be computed during a so-called
encoding stage. This side-information is then used to assist source separation, given the mixtures
104
only, at the so-called decoding stage. State of the art ISS approaches do not really consider ISS as a
coding problem and rely on some purely source separation-inspired strategies, leading to
performances that can at best reach those of oracle estimators. On the other hand, classical
multichannel source coding strategies are not optimal either, since they do not benefit from the
mixture availability. We introduce a general probabilistic framework called Coding-Based ISS (CISS)
that consists of quantizing the sources using some posterior source distribution from those usually
used in probabilistic model-based source separation. CISS benefits from both source coding, due to
the source quantization, and source separation, due to the use of the posterior distribution that
depends on the mixture. Our experiments show that CISS based on a particular model considerably
outperforms both the conventional ISS approach and the source coding approach based on the
same model.
Simulation of real or enhanced virtual acoustic environments constitutes an important objective
of the project. Compatibility with the characteristics of the room acoustics plays a fundamental role
in ensuring a good interaction between users and the REVERIE virtual world and hence a plausible
virtual immersion. Scene recomposition is one of the two main approaches dedicated to this task.
The following section draws up an overview of the second approach, namely 3D audio rendering.
The tools from the literature presented in the next section, developed by IT/TPT, are all available
for the project.
4.3. 3D audio rendering of natural sources
There is an extensive literature in existence in this area, and a subset of particularly interesting
studies has been selected and included below both for statistically and physically based approaches.
The plane wave decomposition is an efficient analysis tool for multidimensional fields, particularly
well fitted to the description of sound fields, whether these ones are continuous or discrete,
obtained by a microphone array. A beamforming algorithm to estimate the plane wave
decomposition of the initial sound field was developed by Guillaume and Grenier (2007). The
algorithm aims to derive a spatial filter which preserves only the sound field component coming
from a single direction and rejects the others. The originality of the approach is that the criterion
uses a continuous instead of a discrete set of incidence directions to derive the tap vector. Then, a
spatial filter bank is used to perform a global analysis of sound fields. The efficiency of the approach
and its robustness to sensor noise and position errors are demonstrated through simulations.
Head-Related Impulse Response (HRIR) measurement systems are quite complex and present
long acquisition times for an accurate sampling of the full 3D space. Therefore HRIRs customization
has become an important research topic. In HRIRs customization, some parameters (generally
anthropometric measurements) are obtained from new listeners and ad-hoc HRIRs can be retrieved
from them. Another way to get new listeners’ parameters is to measure a subset of the full 3D
space HRIRs and extrapolate them in order to obtain a full 3D database. This partial acquisition
system should be rapid and accurate. Fontana et al. (2006) present a system which allows for rapid
acquisition and equalization of HRIRs for a subset of the 3D grid. A technique to carry out HRIR
customization based on the measured HRIRs is described.
Grenier and Guillaume (2006) described array processing to improve the quality of sound field
analysis, which aims to extract spatial properties of a sound field. In this domain, the notion of
spatial aliasing inevitably occurs due to the finite number of microphones used in the array. It is
linked to the Fourier transform of the discrete analysis window, which is constituted of a mainlobe,
fixing the resolution achievable by the spatial analysis, and also from sidelobes which degrade the
105
quality of spatial analysis by introducing artifacts not present in the original sound field. A method
to design an optimal analysis window with respect to a particular wave vector is presented, aiming
to realize the best localization possible in the wave vector domain. Then the efficiency of the
approach is demonstrated for several geometrical configurations of the microphone array, on the
whole bandwidth of sound fields.
Moglie and Primiani (2011) describe the development and testing of FDTD code to simulate the
whole Reverberation Chamber (RC). In order to reduce computer processing, some approximations
were introduced. The results were validated with the experimental ones measured in a RC.
Simulated and measured results were compared using the same statistical software. In addition, the
computations easily provide other results that cannot be obtained by measurements like the ones
that regard field distribution inside the cavity. The developed FDTD code is able to simulate the
statistical properties of a RC as a function of its dimensions and stirrer geometry. Many numerical
techniques have been proposed to simulate RCs. Every method requires very large computer
resources if a full 3D simulation is done. The developed FDTD code is able to simulate different
geometries and movements of the stirrer(s) allowing the designer to obtain the best configuration
using the simulator, and saving time for experimental tests. Simulations integrate experimental
measurements when long measurement time or destructive tests are required. Sound rendering is
analogous to graphics rendering when creating virtual auditory environments. In graphics, we can
create images by calculating the distribution of light within a modeled environment. Illumination
methods such as ray tracing and radiosity are based on the physics of light propagation and
reflection. Similarly, sound rendering is based on physical laws of sound propagation and reflection.
Lokki et al. (2002) clarify real-time sound rendering techniques by comparing them to visual image
rendering. They also describe how to perform sound rendering, based on the knowledge of sound
source(s) and listener locations, radiation characteristics of sound sources, geometry of 3D models,
and material absorption data, i.e., the congruent data used for graphics rendering. In several
instances, the authors use the Digital Interactive Virtual Acoustics auralization system, developed at
the Helsinki University of Technology, as a practical example to illustrate a concept. In the context
of sound rendering, the term auralization, making audible, corresponds to visualization.
Applications of sound rendering vary from film effects, computer games, and other multimedia
content to enhancing experiences in Virtual Reality (VR).
Raghuvanshi et al. (2010) describe a method for real-time sound propagation that captures all
wave effects, including diffraction and reverberation, for multiple moving sources and a moving
listener in a complex, static 3D scene. It performs an offline wave-based numerical simulation over
the scene and extracts perceptually-salient information. To obtain a compact representation, the
scenes acoustic response is broken into two phases: ER and Late Reverberation (LR), based on a
threshold on the temporal density of arriving sound peaks. The LR representation is computed and
stored once per room in the scene, while the ER accounts for a more detailed spatial variation by
recording multiple simulations over a uniform grid of source locations. ER data is then compactly
stored at each source/receiver point pair as a set of peak delays/amplitudes and a residual
frequency response sampled in octave bands. An efficient, real-time technique that uses this precomputed representation to perform binaural sound rendering based on frequency-domain
convolutions is described. They also introduce a new technique to perform artifact-free spatial
interpolation of the ER data. The system demonstrates realistic, wave-based acoustic effects in real
time, including diffraction low-passing behind obstructions, hollow reverberation in empty rooms,
sound diffusion in fully furnished rooms, and realistic LR.
106
Mehra et al. (2012) describe an efficient algorithm for a time-domain solution of the acoustic
wave equation for the purpose of room acoustics. It is based on adaptive rectangular decomposition
of the scene and uses analytical solutions within the partitions that rely on spatially invariant speed
of sound. This technique is suitable for auralizations and sound field visualizations, even on coarse
meshes approaching the Nyquist limit. It is demonstrated that by carefully mapping all components
of the algorithm to match the parallel processing capabilities of Graphics Processing Units (GPUs),
significant improvement in performance is gained compared to the corresponding Central
Processing Unit (CPU)-based solver, while maintaining the numerical accuracy. Substantial
performance gain over a high-order finite-difference time-domain method is observed. Using this
technique, a 1 s long simulation can be performed on scenes of air volume 7500 m3 till 1650 Hz
within 18 min compared to the corresponding CPU-based solver that takes around 5 h and a highorder finite-difference time-domain solver that could take up to three weeks on a desktop
computer. To the best of the authors’ knowledge, this is the fastest time-domain solver for
modeling the room acoustics of large, complex-shaped 3D scenes that generates accurate results
for both auralization and visualization.
Like scene recomposition, 3D audio rendering is of real importance in the global process of
realistic rendering and 3D immersion.

Romeo-HRTF
Description: Romeo-HRTF is a HRTF database. Here the usual concept of binaural HRTF is
extended to the context of the audition of a humanoid robot where the robot head is equipped
with an array of microphones. A first version of the HRTF database is proposed, that is based
upon a dummy that simulates the robot (called Theo). A second version of the database is
based upon the head and torso of the prototype robot Romeo. The corresponding HRIRs are
also proposed. The heads of Theo and Romeo are equipped with 16 microphones. The
databases are recorded for 72 azimuth angles and 7 elevation angles. The typical use of this
database is for the development of algorithms for robot audition.
http://www.tsi.telecom-paristech.fr/aao/en/2011/03/31/romeo-hrtf-a-multimicrophone-headrelated-transfer-function-database/
4.4. Composition and synchronization
Composition and synchronization deals with the live management of data streams that have to
be passed to the rendering facilities of the REVERIE platform.
107
A consistent final view of the user on the virtual room of REVERIE is the result of merging different
data streams of object representations. REVERIE is characterized by its requirement to be able to
model a scene made out of objects of vastly different representations. Examples of representations
that are suited for REVERIE are image or video based representations that can be combined with
depth maps, point-cloud representations, and polygon meshes that can be textured and optionally
may have depth maps.
In the virtual REVERIE world we may encounter avatars, highly realistic representations of humans
and objects that represent the walls, doors, and windows of the virtual room and the home
furnishings and props. All objects that act in the virtual world have to be positioned and field-ofview related object data of appropriate level-of-detail have to be collected and transmitted to the
appropriate renderers.
The representation of human characters can be animated, controlled by a limited set of parameters.
This can be done for instance by deformation of the geometrical model by applying a rig to the
polygonal mesh or by control of the joints of a skeleton. As a result of this parameterization the
amount of data needed to advance to the next frame of an animation is reduced considerably.
Examples of parameterization are the Facial Animation Parameters (FAP) and H-Anim, a VRML
based standard for virtual human representation defined by the MPEG-4 standard. In Section 3.1
more formats are mentioned. An even higher level of control that affects the behavior via the
emotional state of the character is possible using the Markup languages mentioned in Section 3.2.
The importance of these techniques is the reduction in bit rate needed for scene updates that is
essential for the efficiency and real time characteristics required by the REVERIE framework.
Each of the types of data that specify the REVERIE virtual world, from the low level geometry based
representation up to the level of parameters of the virtual human representation, will have a type
specific renderer associated to it. These renderers will in parallel process incoming data streams to
produce the image data from which the final image will be composited. To make sure that the
renderers produce output synchronously, the incoming network streams need synchronization
control.
Synchronization control needs to happen at the receiver side and in some cases at the server side
because in the Internet packets can arrive late and out of order. In multimedia applications, proper
temporal synchronization of media streams is essential for the user experience. It is a welldeveloped research field. Work package T5.6. provides network support for synchronization by
adding proper timestamps, sequence numbers, and providing an architecture for exchanging
control messages (network support for synchronization). The aim of T7.5. is to use this information
to achieve synchronized rendering at clients. First we define the types of synchronization
operations/goals that need to be performed at the renderer:
Intra-stream synchronization: Proper re-ordering of streams of a single media object (intrastream synchronization) is necessary. Lost or delayed and out of order packets need to be dealt with
to maintain the original timing structure of the single media stream.
Inter-stream synchronization: Second, inter-stream synchronization to maintain temporal
properties between different streams, usually audio and video is needed. In general one stream
media object is chosen as the master and forces other streams to synchronize to its timeline. In
common case of A/V transmission the audio is often selected as the master, and when video frames
are lost or damaged they are concealed by skipping/pausing/duplicating to maintain temporal
synchronization. A comprehensive survey that overviews both requirements for inter and intra
media synchronization, and the common approaches to synchronize are presented by Blakowski
and Steinmetz in 1996. In case of video conferencing and more interactive distributed media
108
applications, inter-destination synchronization becomes relevant.
Inter-destination media synchronization:
This type of synchronization, aims for equal output timing of the same stream (presentation) at
different receivers. It is beneficial to achieve fairness in competitive scenarios, such as a video
conferencing quiz. Also it can be useful when users are watching the same video content while
talking over an audio connection, so that the video content is synchronized. Employment of IDMS
for a shared media application is described by (Boronat, Mekuria, Montagud and Cesar 2012).
Inter-sender media synchronization:
Streams, or groups of streams from different senders should also be rendered synchronously
compared to when they were generated at different senders. This is needed to achieve fairness in
some scenarios. For example a quiz where two users have to answer to a third participant, it would
be unfair if this third participant rendered streams from the first user earlier. I.e., in REVERIE intersender synchronization needs to be taken into account.
Synchronization and 3D Tele-Immersion in the REVERIE Context
Huang Z et al. 2011 summarized the different types of synchronization relevant for 3D teleimmersion. First Inter-stream synchronization, often referring to synchronization between audio
and video streams, can in tele-immersion refer to multiple different video/audio streams (bundle of
streams). In that study network based support for synchronization was given, while this task focuses
on the client/receiver. The transmission properties of bundles of streams were investigated in
Argawal et al. 2010. It showed that when the number of streams in a bundle increases, the overall
inter-stream skew (the difference between the earliest and latest arrival) increases. As these
streams are similar and represent different angles of the same scene, it is important that they are
synchronized. The basic buffering technique requires large buffering times, which is undesirable due
to the interactivity requirements for 3D Tele-immersion. For these reasons Inter-stream
synchronization is a big challenge in 3D tele-immersion and in the REVERIE framework we have to
develop algorithms to overcome this challenge to render multiple streams synchronously at the
client without invoking large delays.
These types of synchronization, required in most multimedia applications are relevant in REVERIE.
T5.6 sets the first step and provides some network/sender-based support for synchronization.
However, to achieve these goals at the receiver the renderer needs to perform control techniques
on the media/streams. Such renderer adaptations are similar for the different types of media
synchronization. A (not mutually exclusive) classification of control techniques is given below based
on the work of Ishibashi and Tasaka 2000. A Basic technique of client-side synchronization is
buffering. Based on timestamps and sequence numbers the effects of jitter and delay can be
smoothened out by buffering. Examples of preventive control are when buffer underflow
asynchrony is about to occur (but before asynchrony occurs). To prevent asynchrony, the play-outrate can be decreased, preventive pauses can be introduced and the size of the buffers can be
increased. Reactive control techniques are techniques that are employed to recover after
asynchrony occurs. In such cases skipping or pauses can be used to recover or the play-out rate can
be increased/decreased. Another technique is the use of a virtual timebase that can be expanded or
contracted based on the synchronization need. Another option is master slave switching, i.e.,
switching to the stream lagging behind may increase overall synchronization. Another simple
technique that is employed very often is discarding late events (dropping media with too much
delay). Common control techniques are techniques that can be used both for preventive and
reactive control, such as play-out rate adaption and interpolation of data (estimation of missing
frames based on correctly received frames).
109
Table 2 Techniques for client based control techniques
Technique
Basic Control
Preventive
Control
Technique
Buffering techniques
Preventive skips of MDUs (eliminations or discarding) and/or preventive pauses of MDUs
(repetitions, insertions or stops)
Change the buffering waiting time of the MDUs
Reactive skips (eliminations or discarding) and/or reactive pauses (repetitions, insertions
or stops)
Reactive
Control
Make play-out duration extensions or reductions (play-out rate adjustments)
Use of a virtual time with contractions or expansions (VTR)
Master/slave scheme switching
Late event discarding (Event-based)
Common
Control
Adjustment of the play-out rate
Data interpolation
Combinations of applying client-based techniques and server/network-based techniques are
referred to as synchronization algorithm and are further described in Ishibashi and Tasaska 2000.
Quality of Experience for synchronization:
The study by Blakowski and Steinmetz presented the perceptual aspects and requirements for
media synchronization. For example for audio video synchronization a maximum skew of
approximately 80 ms is allowed, with audio ahead being easier to notice. The values presented in
that study are generally taken into account in the development of synchronization for multimedia
systems.
To test the effect of inter-destination synchronization schemes in networked games with virtual
avatars and videoconferencing Ishibashi Y, Nagasaka M and Noriyuki F 2006 conducted experiments
that showed that difference in latency (between the questioners and examiner) over 300 ms lead to
perceived unfairness between participants. The game simulated was a 3-way conference were 2
questioners tried to answer simple questions posed by a third examiner. The person that raises its
hand first was given the turn to answer. As can be expected when the delay between on the link
between the examiner and questioner introduced more latency, unfairness was introduced. Hosoya
K, Ishibashi Y, Sugawara S, Psannis in 2008 then conducted similar tests but with applying group
synchronization (inter-destination media synchronization) and showed that it improves fairness.
REVERIE is a framework of which the scene representation of a virtual 3D environment is
characterized by different types of objects (realistic representations of humans, avatars, and the
virtual objects of the environment in which these humanoids act), different types of object
representations (image or video based, point-clouds, textured polygons meshes, etc.) and different
temporal characteristics of the entities (dynamic, static). To be able to present a consistent view on
this complex scenery, the visuals of all these types of objects (and the audio that may be involved as
well) has to be managed and synchronized.
The humanoids that act in the scenery may navigate independently, controlled by individual
110
users of the REVERIE system. Their position and gaze influences the view the user ‘behind’ that
humanoid may get presented and should be consistent with the view that other participants have.
Composition and synchronization is essential for the user experience, the feeling of being immersed
in the dream world of REVERIE.

SMIL 3.0
Description: Synchronized Multimedia Integration language (SMIL) for choreographing
multimedia presentations for combination of audio, video, text and graphics in real-time.
Dependencies on other technology: XML

Ambulant
Description: The AMBULANT Open SMIL Player is an open-source media player with support
for SMIL 3.0. AMBULANT is a player upon which higher-level systems solutions for authoring
and content integration can be built or within which new or extended support for networking
and media transport components can be added. The AMBULANT player may also be used as a
complete, multi-platform media player for applications that do not need support for closed,
proprietary media formats. Focused on distributed synchronization. Modifications are possible
in order to act as a signalling component regarding
Dependencies on other technology: SMIL

Videolat (Video latency)
Description: Videolat is a standalone tool that generates bar codes representing the current
time. It serves to measure rendering times so that latencies of components can be analyzed.
http://sourceforge.net/projects/videolat/

Open Source Multimedia Framework GStreamer
Description: GStreamer is a framework for constructing graphs of media-handling components.
The applications it supports range from simple Ogg/Vorbis playback, audio/video streaming to
complex audio (mixing) and video (non-linear editing) processing. Applications can take
advantage of advances in codec and filter technology transparently. Developers can add new
codecs and filters by writing a simple plug-in with a clean, generic interface. GStreamer is
released under the LGPL. The 0.10 series is API and ABI stable. GStreamer was used in the FP7
TA2 project for 2D video composition.
http://gstreamer.freedesktop.org

GPAC is an Open Source multimedia framework for research and academic purposes. The
project covers different aspects of multimedia, with a focus on presentation technologies
(graphics, animation and interactivity).
MPEG-4 is a method of defining compression of digital data. Use of MPEG-4 for REVERIE will
primarily be compression of AV data. MPEG-4 has VRML support for 3D rendering, objectoriented composite files (including audio, video and VRML objects) and support for various
types of interactivity. The standard includes the concept of "profiles" and "levels", allowing a
specific set of capabilities to be defined in a manner appropriate for a subset of applications.
Apart from the efficient coding, the ability to encode mixed media data (video, audio, speech
etc.) and the ability to interact with the audio-visual scene generated at the receiver may be of
interest for the REVERIE framework.

In REVERIE we will develop a media client that can synchronize networked streams of different
media content that composes both modeled data as full video/audio clips. Achieving
111
synchronization between these different media types is the first challenge in T7.5. Moreover, interstream skew becomes larger when the number of streams increases, increasing the need for interstream synchronization. Developing suitable inter-stream synchronization based on both network
and client side techniques is the second challenge. Moreover techniques for inter-destination and
inter-sender synchronization are useful to improve the Quality of Experience and provide
consistency and fairness. The deployment of such synchronization in REVERIE is desirable. Overall
the Quality of Experience of interactive 3D multi streaming video is not well understood and can be
studied based on the renderer developed in task 7.5.
4.5. Stereoscopic and autostereoscopic display
Visualization on stereoscopic and auto-stereoscopic displays is crucial for immersion of
participants into the REVERIE system. Seamless integration of such 3D presentation into the users’
real environment and interaction between elements in both is widely unresolved. 3D displays have
serious limitations regarding depth range, perception, and negative impact on the user. This
interplay between technology, perception, and envisaged application scenarios creates challenging
research questions to be solved in the REVERIE project.
Extensions of the following algorithms will make sure that pleasant tele-immersion is achieved
by 3D presentation which satisfies bounds on perceptual comfort, and integrates real environment
with displayed scenery in an optimum and interactive way.

Algorithms for nonlinear disparity mapping and rendering for stereoscopic 3D (Lang et al.
2010; Oskam et al. 2011)
Description: These algorithms allow automatic adaptation of stereo 3D content to a particular
display size and user settings. Disturbing errors such as window violations can be corrected.
Corresponding rendering software will be developed.

Algorithms for optimum depth range adaptation on autostereoscopic displays (Smolic et al.
2011; Zwicker et al. 2007; Smolic et al. 2008b; Farre et al. 2011)
Description: The usable depth range on autostereoscopic displays is even more limited
compared to glasses-based systems. Content has to be adapted to avoid alias and blur. Display
properties will be evaluated on theoretical basis and optimum rendering software will be
developed.
112
5. References
Agarwal, P., Rivas, R., Wu, W ., Arefin, A., Huang, Z., Nahrstedt, K. 2011, SAS kernel: streaming as a
service kernel for correlated multi-streaming. Proceedings of the 21st international workshop on
Network and operating systems support for digital audio and video (NOSSDAV '11) pp. 81-86
Agarwal, P., Toledano, R.R., Wanmin, Wu., Nahrstedt, K., Arefin, 2010. A. Bundle of Streams:
Concept and Evaluation in Distributed Interactive Multimedia Environments. IEEE International
Symposium on Multimedia (IEEE ISM’10), pp. 25-32.
Agarwal, S., Snavely, N., Seitz, S. and Szeliski, R., 2009a. Bundle adjustment in the large. Computer
Vision (ICCV), pp. 29—42.
Agarwal, S., Snavely, N., Simon, I., Seitz, S. and Szeliski, R., 2009b. Building Rome in a day. Computer
Vision (ICCV), pp. 72—79.
Aggarwal, J.K., and Cai, Q., 1999. Human motion analysis: a review. Computer Vision and Image
Understanding, 73(3), pp. 428-440.
Aggarwal, J. K. and Ryoo, M. S., 2011. Human activity analysis: A review. ACM Computing Surveys,
42(3), pp. 90—102.
Alexander, O., Rogers, M., Lambeth, W., Chiang, J., Ma, W., Wang, C. and Debevec, P., 2010. The
digital Emily project: Achieving a photoreal digital actor. IEEE Computer Graphics and Applications,
30, pp. 20—31.
Alighanbari, M., and How, J.P., 2006. An unbiased Kalman consensus algorithm. American Control
Conference, pp. 3519—3524.
Alonso, M., Richard, G. and David, B., 2007. Accurate tempo estimation based on harmonic+ noise
decomposition. EURASIP Journal on Applied Signal Processing, (1), p. 161.
Ang, J., Dhillon, R., Krupski, A., Shriberg, E., and Stolcke, A., 2002. Prosody-based automatic
detection of annoyance and frustration in human-computer dialog. Proceedings of the 7th Int’l
Conference on Spoken Language Processing (ICSLP).
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Pang, H. and Davis, J. 2004a. The correlated
correspondence algorithm for unsupervised registration of non-rigid surfaces. 18th Neural
Information Processing Systems Conference (NIPS), 17, pp. 33—40.
Anguelov, D., Koller, D., Pang, H., Srinivasan, P. and Thrun, S., 2004b. Recovering articulated object
models from 3D range data. 20th Uncertainty in Artificial Intelligence (UAI) Conference, pp. 18—26.
Anguelov, D., 2005. Learning models of shape from 3D range data. Ph. D. Stanford University.
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J. and Davis, J., 2005. SCAPE: Shape
completion and animation of people. ACM Transactions on Graphics (TOG), 24(3), pp. 408—416.
Arinbjarnar, M., and Kudenko, D., 2010. Bayesian networks: Real-time applicable decision
mechanisms for intelligent agents in interactive drama. 2010 IEEE Symposium on Computational
Intelligence and Games (CIG), pp. 427—434.
Arnason, B. and Porsteinsson, A., 2008. The CADIA BML realizer. http://cadia.ru.is/projects/bmlr/.
Arya, A., DiPaola, S. and Parush, A., 2009. Perceptually valid facial expressions for character-based
applications. International Journal of Computer Games Technology, pp. 1—14.
113
Aylett, R. S., 2004. Agents and affect: Why embodied agents need affective systems. Methods and
Applications of Artificial Intelligence, pp. 496—504.
Bajramovic, F. and Denzler, J., 2008. Global uncertainty-based selection of relative poses for multicamera calibrations. British Machine Vision Conference (BMVC), 2, pp. 745—754.
Ballard, D., 1981. Generalizing the Hough transform to detect arbitrary patterns. Pattern
Recognition, 13(2), pp. 111-122.
Bari, M.F., Haque, M.R., Ahmed, R., Boutaba, R. and Mathieu, B., 2011. Persistent naming for P2P
web hosting. IEEE International Conference on Peer-to-Peer Computing (P2P), pp. 270—279.
Barinova, O., Lempitsky, V. and Kohli, P., 2010. On the detection of multiple object instances using
Hough transforms. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2233—
2240.
Bartlett, M.S., Littlewort, G.C., Frank, M.G., Lainscsek, C., Fasel, I. and Movellan, J.R., 2006.
Automatic recognition of facial actions in spontaneous expressions. Journal of Multimedia, 1(6), pp.
22—35.
Batliner, A., Fischer, K., Hubera, R., Spilkera, J. and Noth, E. 2003. How to find trouble in
communication. Speech Communication, 40, pp. 117—143.
Bay, H., Tuytelaars, T. and Van Gool, L., 2006. SURF: Speeded up robust features. Computer Vision
(ICCV), pp. 404—417.
Beale, R. and Creed, C., 2009. Affective interaction: How emotional agents affect users.
International Journal of Human-Computer Studies, 67(9), pp. 755—776.
Beigi, H., 2011. Fundamentals of speaker recognition. New York: Springer.
Benbasat, A.Y. and Paradiso, J.A., 2001. An inertial measurement framework for gesture recognition
and applications. International Gesture Workshop on Gesture and Sign Languages in HumanComputer Interaction, pp. 77-90.
Bevacqua, E., de Sevin, E., Hyniewska, S. J. and Pelachaud, C., To Appear. A listener model:
Introducing personality traits. Journal on Multimodal User Interfaces, Special Issue: Interacting
ECAs.
Blakowski, G., Steinmetz, R. 1996. A media synchronization Survey: Reference, Model, Specification
and Case Studies. IEEE Journal on selected areas in communication Vol. 16 No. 6 pp. 5-35
Blanz, V. and Vetter, T., 1999. A morphable model for the synthesis of 3D faces. 26th Annual
Conference on Computer Graphics and Interactive Techniques, pp. 187—194.
Blunsden, S., Fisher, R.B. and Andrade, E.L., 2006. Recognition of coordinated multi agent activities,
the individual vs. the group. Workshop on Computer Vision Based Analysis in Sport Environments
(CVBASE), pp. 61—70.
Boella, G., Tosatto, S.C., Garcez, A.A., Genovese, V., Ienco, D., and van der Torre, L., 2011. Neural
symbolic architecture for normative agents. 10th International Conference on Autonomous Agents
and Multiagent Systems, 3, pp. 1203—1204.
Boronat, F., Mekuria, R., Montagud, M., Cesar, P. 2012. Distributed media synchronization for
shared video watching: issues, challenges, and examples. Social Media Retrieval, Springer Computer
Communications and Networks series.
Boukricha, H., Wachsmuth, I., Hofstaetter, A. and Grammer, K., 2009. Pleasure-arousal-dominance
driven facial expression simulation. 3rd International Conference on Affective Computing and
Intelligent Interaction (ACII), pp. 119—125.
114
Boykov, Y. and Kolmogorov, V., 2003. Computing geodesics and minimal surfaces via graph cuts.
International Conference on Computer Vision (ICCV), pp. 26—33.
Brandstein, M. and Ward, D., 2001. Microphone Arrays: Signal Processing Techniques and
Applications. Springer-Verlag, Berlin.
Braathen, B., Bartlett, M.S., Littlewort, G., Smith, E. and Movellan, J.R., 2002. An approach to
automatic recognition of spontaneous facial actions. 5th IEEE International Conference on Automatic
Face and Gesture Recognition, pp. 360—365.
Brave, S., Nass, C. and Hutchinson, K., 2005. Computers that care: Investigating the effects of
orientation of emotion exhibited by an embodied computer agent. International Journal of HumanComputer Studies, 62, pp. 161—178.
Brooks, C.H., Fang, Y., Joshi, K., Okai, P., and Zhou, X. , 2007. Citepack: An autonomous agent for
discovering and integrating research sources. AAAI Workshop on Information Integration on the
Web.
Bui, T.D., 2004. Creating emotions and facial expressions for embodied agents. Ph. D. University of
Twente.
Bulterman, D.C.A. and Rutledge, L.W., 2008. SMIL 3.0: Flexible Multimedia for Web, Mobile Devices
and Daisy Talking Books. Springer.
Burley, B. and Lacewell, D., (2008). Ptex: per-face texture mapping for production rendering.
Computer Graphics Forum, 27(4), pp. 1155– 1164.
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U. and
Narayanan, S., 2004. Analysis of emotion recognition using facial expressions, speech and
multimodal information. 6th International Conference on Multimodal Interfaces (ICMI), pp. 205—
211.
Cappé, O., 1994. Elimination of musical noise phenomenon with the Ephraim and Malah noise
suppressor. IEEE Transactions on Speech and Audio Processing, 2(2), pp. 345—349.
Carranza, J., Theobalt, C., Magnor, M.A. and Seidel, H., 2003. Free-viewpoint video of human actors.
ACM Transactions on Graphics (TOG), 22(3), pp. 569—577.
Catanese, S., Ferrara, E., Fiumara, G., and Pagano, F., (2011). A framework for designing 3d virtual
environments. Proceedings of the 4th International ICST Conference on Intelligent Technologies for
Interactive Entertainment, ACM.
Cerekovic, A., Pejsa, T. and Pandzic, I., 2009. RealActor: Character animation and multimodal
behaviour realization system. 9th International Conference on Intelligent Virtual Agents (IVA), pp.
486—487.
Chai, J. and Hodgin, J.K., 2005. Performance animation from low-dimensional control signals. ACM
Transactions on Graphics, 24(3), pp. 686—696.
Chen, L.S., 2000. Joint processing of audio-visual information for the recognition of emotional
expressions in human-computer interaction. PhD Thesis, UIUC.
Chum, O. and Matas, J., 2008. Optimal randomized RANSAC. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 30(8), pp. 1472—1482.
Chung, T.H., Kress, M. and Royset, J.O., 2009. Probabilistic search optimization and mission
assignment for heterogeneous autonomous agents. IEEE International Conference on Robotics and
Automation (ICRA), pp. 939—949.
Cignoni, P., Montani, C., Rocchini, C., and Scopigno, R., (1998). A general method for preserving
115
attribute values on simplified meshes. Proceedings of the Conference on Visualization ‘98.
Computer Society Press, pp. 59–66.
Codognet, P., 2011. A simple language for describing autonomous agent behaviours. 7th
International Conference on Autonomic and Autonomous Systems (ICAS), pp. 105—110.
Cohen, J., Olano, M., and Manocha, D., (1998). Appearance-preserving simplification. Proceedings of
the 25th Annual Conference on Computer Graphics and Interactive Techniques, ACM, pp. 115–122.
Comon, P. and Jutten, C., 2010. Handbook of Blind Source Separation, Independent Component
Analysis and Applications. Academic Press, Elsevier.
Crandall, D., Owens, A., Snavely, N. and Huttenlocher, D., 2011. Discrete-continuous optimization
for large-scale structure from motion. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 3001—3008.
Craven, P.G. and Gerzon, M.A., 1975. Coincident microphone simulation covering three dimensional
space and yielding various directional outputs. US. Pat. 4042779.
Cui, Y., Li, B. and Nahrstedt, K., 2004. oStream: Asynchronous streaming multi-cast in applicationlayer overlay networks. IEEE Journal on Selected Areas in Communications, 22(1), pp. 91—106.
Curless, B. and Levoy, M., 1996. A volumetric method for building complex models from range
images. 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303—312.
d’Eon, E. and Irving, G., 2011. A quantized-diffusion model for rendering translucent materials. ACM
Transactions on Graphics (TOG), 30(4), p. 56.
Darch, J., Milner, B. and Vaseghi, S., 2008. Analysis and prediction of acoustic speech features from
mel-frequency cepstral coefficients in distributed speech recognition architectures. Journal of the
Acoustic Society of America, 124, pp. 3989—4000.
Davis, S. and Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and
Signal Processing, 28(4), pp. 357—366.
De La Torre, F. and Cohn, J.F., 2011. Facial expression analysis. Guide to Visual Analysis of Humans:
Looking at People, Springer.
Demeure, V., Niewiadomski, R. and Pelachaud, C., To Appear. How believability of virtual agent is
related to warmth, competence, personification and embodiment? MIT Presence.
Dias, J., Mascarenhase, S. and Paiva, A., 2011. FAtiMA modular: Towards an agent architecture with
a generic appraisal framework. Standards in Emotion Modelling (SEM).
Drescher, C. and Thielscher, M., 2011. ALPprolog – A new logic programming method for dynamic
domains. Theory and Practice of Logic Programming, 11(4—5), pp. 451—468.
Durrieu, J.L., Richard, G., David, B. and F’evotte, C., 2010. Source/filter model for unsupervised main
melody extraction from polyphonic audio signals. IEEE Transactions on Audio, Speech and Language
Processing, 18(3), pp. 564—575.
Durrieu, J.L., David, B. and Richard, G., 2011. A musically motivated mid-level representation for
pitch estimation and musical audio source separation. IEEE Journal on Selected Topics in Signal
Processing, 5(6), pp. 1180—1191.
Eide, E. and Gish, H., 1996. A parametric approach to vocal tract length normalization. IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 346—348.
Eisert, P. and Girod, B., 1998. Analyzing facial expressions for virtual conferencing. IEEE Computer
Graphics and Applications, 18(5), pp. 70—78.
116
Eisert, P., 2002. Model-based camera calibration using analysis by synthesis techniques. 7th
International Workshop on Vision, Modeling and Visualization (VMV), p. 307.
Eisert, P. and Hilsmann, A., 2011. Realistic virtual try-on of clothes using real-time augmented reality
methods. IEEE ComSoc MMTV E-Letter, 6(8), pp. 37—48.
Eisert, P. and Rurainsky, J., 2006. Geometry-assisted image-based rendering for facial analysis and
synthesis. Elsevier Signal Processing: Image Communication, 21(6), pp. 493—505.
Ekman, P., Friesan, W. V. and Hager, J.C., 2002. Facial Action Coding System (FACS). Consulting
Psychologists Press, Stanford University, Palo Alto.
Ephraim, Y. and Malah, D., 1984. Speech enhancement using a MMSE short-time spectral amplitude
estimator, IEEE ASSP-32, pp. 1109—1121.
Ephraim, Y. and Malah, D., 1985. Speech enhancement using a MMSE error log-spectral amplitude
estimator, IEEE ASSP-33, pp. 443—445.
Ephraim, Y. and Malah, D., 1998. Noisy speech enhancement using discrete cosine transform.
Speech Communication, 24, 249—257.
Farre, M., Wang, O., Lang, M., Stefanoski, N., Hornung, A. and Smolic, A., 2011. Automatic content
creation for multiview autostereoscopic displays using image domain warping. IEEE International
Conference on Multimedia and Expo (ICME), pp. 1—6.
Fechteler, P. and Eisert, P., 2011. Recovering articulated pose of 3D point clouds. 8th European
Conference on Visual Media Production (CVMP), London, UK.
Fehn, C., Kauff, P., de Beeck, M.O., Ernst, F., Ijsselsteijn, W., Pollefeys, M., Van Gool, L., Ofek, E. and
Sexton, I., 2002. An evolutionary and optimized approach on 3D-TV. IBC, 2, pp. 357—365.
Fischler, M. and Bolles, R., 1981. Random sample consensus: A paradigm for model fitting with
applications to image analysis and automated cartography. Communications of the ACM, 24(6), pp.
381—395.
Fontana, S., Grenier, Y. and Farina, A., 2006. A system for head related impulse responses rapid
measurement and direct customization. 120th Convention Audio Engineering Society (AES), pp. 1—
20.
Frahm, J., 2010. Fast robust large-scale mapping from video and internet photo collections. ISPRS
Journal of Photogrammetry and Remote Sensing, 60(6), pp. 538—549.
Franco, J.S. and Boyer, E., 2009. Efficient polyhedral modeling from silhouettes. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 31(3), pp. 414—427.
Furui, S., 2001. Digital Speech Processing, Synthesis and Recognition, 2nd Edition, New York: Marcel
Dekker.
Furukawa, Y. and Ponce, J., 2009a. Carved visual hulls for image-based modeling. International
Journal of Computer Vision, 81, p. 5367.
Furukawa, Y. and Ponce, J., 2009b. Dense 3D motion capture for human faces. IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1674—1681.
Gales, M. and Young, S., 2007. The application of Hidden Markov Models in speech recognition.
Foundatinos and Trends in Signal Processing, 1, pp. 195—304.
Garrett-Glaser, J., 2011. Diary of an x264 Developer. http://x264dev.multimedia.cx/.
117
Gauvain, J.L. and Lee, C.H., 1994. Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), pp.
291—298.
Gavrila, D.M., 1999. The visual analysis of human movement: A survey. Computer Vision and
Understanding, 73(1), pp. 82—98.
Gebhard, P., 2005. ALMA – A layered model of affect. 4th International Joint Conference on
Autonomous Agents and Multi-agent Systems (AAMAS), pp. 29—36.
Gillet, O. and Richard, G., 2008. Transcription and separation of drum signals from polyphonic
music. IEEE Transactions on Audio, Speech and Language Processing, 16(3), pp. 529—540.
Godsill, S. and Rayner, P., 1998. Digitial audio restoration – A statistical model-based approach.
Applications of Digital Signal Processing to Audio and Acoustics, pp. 133—194.
Graciarena, M., Shriberg, E., Stolcke, A., Enos, F., Hirschberg, J. and Kajarekar, S., 2006. Combining
prosodic, lexical and cepstral systems for deceptive speech detection. International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 1033—1036.
Grammer, K. and Oberzaucher, E., 2006. The reconstruction of facial expressions in embodied
systems: New approaches to an old problem. ZIF Mitteilungen, 2, pp. 14—31.
Gratch, J. and Marsella, S., 2004. A domain independent framework for modeling emotion. Journal
of Cognitive Systems Research, 5(4), pp. 269—306.
Gratch, J., Marsella, S. and Petta, P., 2009. Modeling the cognitive antecedents and consequences of
emotion. Cognitive Systems Research, 10(1), pp. 1—5.
Grenier, Y. and Guillaume, M., 2006. Sound field analysis based on generalized prolate spheroidal
wave sequences. 120th Convention of the Audio Engineering Society (AES), pp. 1—7.
Grosz, B.J. and Kraus, S., 1996. Collaborative plans for complex group action. Artificial Intelligence,
86(2), pp. 269—357.
Guillaume, M. and Grenier, Y., 2007. Sound field analysis based on analytical beamforming. EURASIP
Journal on Advances in Signal Processing, (1), p. 189.
Ha, S., Bai, Y. and Liu, C.K., 2011. Human motion reconstruction from force sensors. ACM
SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 129—138.
Hashimoto. Y., Ishibashi, Y. 2006, Influences of network latency on interactivity in networked rockpaper-scissors. Proceedings of 5th ACM SIGCOMM workshop on Network and system support for
games (NetGames '06) Art No. 23
Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B. and Seidel, H., 2009. A statistical model of human
pose and body shape. 30th Conference of the European Association for Computer Graphics
(EUROGRAPHICS), 28(2), pp. 337—346.
Hayward, V., Astley, O.R., Cruz-Hernandez, M., Grant, D. and Robles-De-La-Torre, G., 2004. Haptic
interfaces and devices. Sensor Review, 24(1), pp. 16—29.
Heloir, A., and Kipp, M., 2009. EMBR - a real time animation engine for interactive embodied agents.
9th International Conference on Intelligent Virtual Agents (IVA), pp. 393—404.
Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic
Society of America, 87(4), pp. 1738—1752.
Hermansky, H. and Morgan, N., 1994. RASTA processing of speech. IEEE Transactions on Speech and
Audio Processing, 2(4), pp. 578—589.
118
Hervieu, A., Bouthemy, P. and Le Cadre, J., 2009. Trajectory-based handball video understanding.
ACM International Conference on Image and Video Retrieval (CIVR), 43, pp. 1—8.
Heylen, D., Kopp, S., Marsella, S., Pelachaud, C. and Vilhjalmsson, H., 2008. The next step towards a
functional markup language. Intelligent Virtual Agents (IVA), Tokyo.
Hilsmann, A. and Eisert, P., 2012. Image-based animation of clothes. 33rd Conference of the
European Association for Computer Graphics (EUROGRAPHICS), Cagliari, Italy.
Hirschberg, J., Benus, S., Brenier, J.M., Enos, F. and Friedman, S., 2005. Distinguishing deceptive
from non-deceptive speech. 9th European Conference on Speech Communication and Technology
(INTERSPEECH), pp. 1833—1836.
Hosoya K., Ishibashi Y., Sugawara S., Psannis K. 2008. Effects of Group Synchronization Control in
Networked Virtual Environments with Avatars. 12th IEEE/ACM International Symposium on
Distributed Simulation and Real-Time Applications. (DS-RT '08)
Huang, L., Morency, L.P. and Gratch, J., 2010. Learning back channel prediction model from
parasocial consensus sampling: A subjective evaluation. 10th International Conference on Intelligent
Virtual Agents (IVA), Philadelphia.
Huang, Z., Wu, W., Nahrstedt, K., Rivas, R. and Arefin, A., 2011. SyncCast: Synchronized
dissemination in multi-site interactive 3D tele-immersion. 2nd Annual ACM Conference on
Multimedia Systems, pp. 69—80.
Intille, S.S. and Bobick, A.F., 2001. Recognizing planned, multiperson action. Computer Vision and
Image Understanding, 81(3), pp. 414—445.
Ishibashi, Y., Tasaka, S. 2000 A comparative survey of synchronization algorithms for continuous
media in network environment. 25th Annual IEEE Conference on Local Computer Networks (IEEE
LCN), pp. 337 - 348
Ishibashi, Y., Nagasaka, M., Noriyuki, F. 2006, Subjective Assessment of Fairness among Users
in Multipoint Communications. Proceedings of the ACM SIGCHI international conference on
Advances in computer entertainment technology (ACE '06), Art. No. 69
Iwata, H., 2003. Haptic interfaces. The Human-Computer Interaction Handbook: Fundamentals,
Evolving Technologies, and Emerging Application, Lawrence Erlbaum Associates, Mahwah, NJ.
Izadi, S., Newcombe, R., Kim, D., Hilliges, O., Molyneaux, D., Hodges, S., Kohli, P. and Fitzgibbon,
A.D.A., 2011. Kinect Fusion: Real-time dynamic 3D surface reconstruction and interaction. ACM
SIGGRAPH, p. 23.
Jadbabaie, A., Lin, J. and Morse, A.S., 2003. Coordination of groups of mobile autonomous agents
using nearest neighbor rules. IEEE Transactions on Automatic Control, 48(6), pp. 988—1001.
Jancosek, M. and Pajdla, T., 2011. Multi-view reconstruction preserving weakly-supported surfaces.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3121—3128.
Jebara, T. and Pentland, A., 2002. Statistical imitative learning from perceptual data. 2nd
International Conference on Development and Learning, pp. 191—196.
Jiang, X. and Xu, D., 2005. Violin: Virtual internet working on overlay infrastructure. Parallel and
Distributed Processing and Applications, pp. 937—946.
Jin, H., Soatto, S. and Yezzi, A., 2005. Multi-view stereo reconstruction of dense shape and complex
appearance. International Journal of Computer Vision, 60(3), pp. 175—189.
Junker, H., Amft, O., Lukowicz, P. and Troster, G., 2008. Gesture spotting with body-worn inertial
sensors to detect user activities. Pattern Recognition, 41(6).
119
Just, A. and Marcel, S., 2009. A comparative study of two state-of-the-art sequence processing
techniques for hand gesture recognition. Computer Vision and Image Understanding, 113(4), pp.
532—543.
Kanade, T., Rander, P., Vedula, S. and Saito, H., 1999. Mixed Reality, Merging Real and Virtual
Worlds, ch. Virtualized reality: Digitizing a 3D time varying event as is and in real time. Springer
Verlag.
Kettern, M., Schneider, D.C. and Eisert, P., 2011. Multiple view segmentation and matting. 8th
European Conference on Visual Media Production (CVMP), London, UK.
Kim, D.K., Gales, M.J.F. and Member, S., 2010. Noisy constrained maximum-likelihood linear
regression for noise-robust speech recognition. Transform, 19, pp. 315—325.
Kirishima, T., Sato, K. and Chihara, K., 2005. Real-time gesture recognition by learning and selective
control of visual interest points. IEEE Transactions on Pattern Analysis and Machine Intelligence,
27(3), pp. 351—364.
Kolev, K., Klodt, M., Brox, T. and Cremers, D., 2009. Continuous global optimization in multi-view 3D
reconstruction. International Journal of Computer Vision, 84, pp. 80—96.
Koster, M., Haber, J. and Seidel, H., 2004. Real-time rendering of human hair using programmable
graphics hardware. Computer Graphics International (CGI), pp. 248—256.
Kuhn, R., Nguyen, P., Junqua, J.C., Goldwasser, L., Niedzielski, N., Fincke, S., Field, K. and Contolini,
M., 1998. Eigenvoices for speaker adaptation. International Conference on Spoken Language, pp.
1771—1774.
Kulakov, A., Lukkanen, J., Mustafa, B. and Stojanov, G., 2009. Inductive logic programming and
embodies agents. International Journal of Agent Technologies and Systems, 1(1), pp. 34—49.
Kumar, M., 2008. A motion graph approach for interactive 3D animation using low-cost sensors. M.
Sc. Virginia Polytechnic Institute and State University.
Kutulakos, K. and Seitz, S., 2000. A theory of shape by space carving. International Journal of
Computer Vision, 38(3), pp. 199—218.
Kwon, O.W., Chan, K., Hao, J. and Lee, T.W., 2003. Emotion recognition by speech signals. 8th
European Conference on Speech Communication and Technology (EUROSPEECH), pp. 125—128.
Lance, B.J. and Marsella, S.C., 2007. Emotionally expressive head and body movements during gaze
shifts. 7th International Conference on Intelligent Virtual Agents (IVA), pp. 72—85.
Lance, B.J. and Marsella, S.C., 2008. A model of gaze for the purpose of emotional expression in
virtual embodied agents. 7th International Conference on Autonomous Agents and Multi-agent
Systems, Estoril, Portugal, pp. 199—206.
Lang, M., Hornung, A., Wang, O., Poulakos, S., Smolic, A. and Gross, M., 2010. Non-linear disparity
mapping for stereoscopic 3D. ACM Transactions on Graphics (SIGGRAPH).
Larson, D.A., 2010. Artificial intelligence: Robots, avatars and the demise of the human mediator.
Ohio State Journal on Dispute Resolution, 25(1), pp. 150—163.
Lazebnik, S., Schmid, C. and Ponce, J., 2004. Semi-local affine parts for object recognition. British
Machine Vision Conference.
Lee, J. and Ha, I., 2001. Real-time motion capture for a human body using accelerometers. Robotica,
19(6), pp. 601—610.
Legin, A., Rudnitskaya, A., Seleznev, B. and Vlasov, Y., 2005. Electronic tongue for quality assessment
of ethanol, vodka and eau-de-vie. Analytica Chimica Acta, 534(1), pp. 129—135.
120
Lehmann, A., Leibe, B. and Van Gool, L. Fast PRISM: Branch and bound hough transform for object
class detection. International Journal of Computer Vision, 94(2), pp. 175—197.
Lerch, A., Eisenberg, G. and Tanghe, K., 2005. FEAPI, a low level features extraction plugin API.
International Conference on Digital Audio Effects (DAFx), p. 73.
Li, X., Wu, C., Zach, C., Lazebnik, S. and Frahm, J., 2008. Modeling and recognition of landmark
image collections using iconic scene graphs. 10th European Conference on Computer Vision: Part I,
pp. 427—440.
Liscombe, J., Hirschberg, J. and Venditti, J.J., 2005. Detecting certainness in spoken tutorial
dialogues. 9th European Conference on Speech Communication and Technology (INTERSPEECH), pp.
1—4.
Liu, X.L., Du, H.P., Wang, H.R., and Yang, G.J., (2009). Design and development of a distributed
virtual reality system. 8th International Conference on Machine Learning and Cybernetics, 2 pp.
889—894.
Liutkus, A., Rafii, A., Badeau, R., Pardo, B. and Richard, G., 2012. Adaptive filtering for music/voice
separation exploiting the repeating musical structure. International conference on Acoustics,
Speech and Signal Processing (ICASSP).
Lokki, T., Savioja, L., Väänänen, R., Huopaniemi, J. and Takala, T., 2002. Creating interactive virtual
auditory environments. IEEE Computer Graphics and Applications, pp. 49—57.
Lorensen, W.E. and Cline, H.E., 1987. Marching cubes: A high resolution 3D surface construction
algorithm. Computer Graphics, 21(4), pp. 163—169.
Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. International Journal of
Computer Vision, 20(2), pp. 91—110.
Lyons, M.J., Haehnel, M. and Tetsutani, N., 2003. Designing, playing and performing with a visionbased mouth interface. Conference on New Interfaces for Musical Expression, pp. 116—121.
Maazaoui, M., Grenier, Y. and Abed-Meraim, K., 2012. Blind source separation for robot audition
using fixed HRTF beamforming. EURASIP Journal on Advances in Signal Processing, (Under Minor
Revision).
Maiano, C. and Therme, P. and Mestre, D., (2011). Affective, anxiety and behavioural effects of an
aversive stimulation during a simulated navigation task within a virtual environment: a pilot study.
Computers in Human Behavior 27(1), pp. 169—175.
Malatesta, L., Raouzaiou, A., Karpouzis, K. and Kollias, S.D., 2009. Towards modeling embodied
conversational agent character profiles using appraisal theory predictions in expression synthesis.
Applied Intelligence, 30(1), pp. 58—64.
Mao, X., Xue, Y., Li, Z. and Bao, H., 2008. Layered fuzzy facial expression generation: Social,
emotional and physiological. Affective Computing, Focus on Emotion Expression, Synthesis and
Recognition, pp. 185—218.
Matos, S., Birring, S.S., Pavord, I.D. and Evans, D.H., 2006. Detection of cough signals in continuous
audio recordings using HMM. IEEE Transactions on Biomedical Engineering, 53(6), pp. 1078—1083.
Matsuyama, T., Wu, X., Takai, T. and Wada, T., 2004. Real-time dynamic 3D object shape
reconstruction and high-fidelity texture mapping for 3D video. IEEE Transactions on Circuits and
Systems for Video Technology, 14(3), pp. 357—369.
Matusik, W., Buehler, C. and McMillan, L., 2001. Polyhedral visual hulls for real-time rendering. 12th
Eurographics Workshop on Rendering Techniques, pp. 115—125.
121
McDonald Jr, J., and Burley, B.,(2011). Per-face texture mapping for real-time rendering. ACM
SIGGRAPH Studio Talks, pp. 3.
McEnnis, D., McKay, C., Fujinaga, I. and Depalle, P., 2005. jAudio: A feature extraction library.
International Conference on Music Information Retrieval, pp. 600—603.
Mehra, R., Raghuvanshi, N., Savioja, L., Lin, M.C. and Manocha, D., 2012. An efficient GPU-based
time domain solver for the acoustic wave equation. Applied Acoustics, 73(2), pp. 1—13.
Mehrabian, A., 1980. Basic Dimensions for a General Psychological Theory: Implications for
Personality, Social, Environmental and Developmental Studies. Cambridge.
Merkle, P., Muller, K., Smolic, A. and Wiegand, T., 2006. Efficient compression of multi-view video
exploiting inter-view dependencies based on H.264/MPEG4-AVC. IEEE International Conference on
Multimedia and Expo, pp. 1717—1720.
Merrell, P., Akbarzadeh, A., Wang, L., Mordohai, P., Frahm, J.M., Yang, R., Nister, D. and Pollefeys,
M., 2007. Real-time visibility-based fusion of depth maps. International Conference on Computer
Vision and Pattern Recognition, pp. 1—8.
Merritt, L. and Vanam, R., 2004. X264: A high performance H.264/AVC
http://wweb.uta.edu/ee/Dip/Courses/EE5359/overview_x264_v8_5%5B1%5D.pdf.
encoder.
Mikolajczyk, K. and Schmid, C., 2004. A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(10), pp. 1615—1630.
Moe, M.E.G., Tavakolifard, M. and Knapskog, S.J., 2008. Learning trust in dynamic multiagent
environments using HMMS. 13th Nordic Workshop on Secure IT Systems (NordSec), pp. 135—146.
Moeslund, T.B., Hilton, A. and Kruger, V., 2006. A survey of advances in vision-based human motion
capture and analysis. Computer Vision and Image Understanding, 104(2—3), pp. 90—126.
Moglie, F. and Primiani, V.M., 2011. Reverberation chambers: Full 3D FDTD simulations and
measurements of independent positions of the stirrers. IEEE International Symposium on
Electromagnetic Compatibility (EMC), pp. 226—230.
Moller, T., (1996). Radiosity techniques for virtual reality – faster reconstruction and support for
levels of detail. Computer Graphics and Visualization’96 (WSCG’96).
Moore, D.C., 2002. The IDIAP Smart Meeting Room. IDIAP Communication, 7, pp. 1—13.
Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F. and Sayd, P., 2009. Generic and real-time
structure from motion using local bundle adjustment. Image and Vision Computing, 27, pp. 1178—
1193.
Mueller, M., Klapuri, A., Ellis, D. and Richard, G., 2011. Signal processing for music analysis. IEEE
Journal on Selected Topics in Signal Processing, 5(6), pp. 1088—1110.
Mulligan, J. and Daniilidis, K., 2001. Real time trinocular stereo for tele-immersion. International
Conference on Image Processing, pp. 959—962.
Myszkowski, K. and Kunii, T.L., (1994). Texture mapping as an alternative for meshing during
walkthrough animation. Photorealistic Rendering Techniques, pp. 389–400.
Nahrstedt, K., Arefin, A., Rivas, R., Argawal, P., Huang, Z., Wu, W. and Yang, Z., 2011a. QoS and
resource management in distributed interactive multimedia environments. Multimedia Tools and
Applications, pp. 1—34.
Nahrstedt, K., Yang, Z., Wu, W., Arefin, A. and Rivas, R., 2011b. Next generation session
management for 3D tele-immersive interactive environments. Multimedia Tools and Applications,
51, pp. 1—31.
122
Ni, K., Steedly, D. and Dellaert, F., 2007. Out-of-core bundle adjustment for large-scale 3D
reconstruction. IEEE 11th International Conference on Computer Vision (ICCV), pp. 1—8.
Niewiadomski, R. and Pelachaud, C., 2007a. Model of facial expressions management for an
embodied conversational agent. 2nd International Conference on Affective Computing and
Intelligent Interaction (ACII), pp. 12—23.
Niewiadomski, R. and Pelachaud, C., 2007b. Fuzzy similarity of facial expressions of embodied
agents. 7th International Conference on Intelligent Virtual Agents, pp. 86—98.
Niewiadomski, R. and Pelachaud, C., 2010. Affect expression in ECAs: Application to politeness
displays. International Journal on Human-Computer Studies, 68, pp. 851—871.
Niewiadomski, R., Hyniewska, S. and Pelachaud, C., 2011. Constraint-based model for synthesis of
multimodal sequential expressions of emotions. IEEE Transactions on Affective Computing, 2(3), pp.
134—146.
Ochs, M., Pelachaud, C. and Sadek, D., 2008. An empathic virtual dialog agent to improve human-m
achine interaction. Autonomous Agent and Multi-Agent Systems (AAMAS), pp. 89—96.
Ochs, M., Sabouret, N. and Corruble, V., 2009. Simulation of the dynamics of non-player characters’
emotions and social relations in games. IEEE Transactions on Computational Intelligence and AI in
Games, 1(4), 281—297.
Ochs, M., Niewiadomski, R., Brunet, P. and Pelachaud, C., 2012. Smiling virtual agent in social
context. Cognitive Processing, pp. 1—14.
Oros, N., Adams, R.G., Davey, N. Cañamero, L. and Steuber, V., 2008. Encoding sensory information
in spiking neural network for the control of autonomous agents. 19th European Meeting on
Cybernetics and Systems Research, pp. 1—6.
Oskam, T., Hornung, A., Bowles, H., Mitchell, K. and Gross, M., 2011. OSCAM – Optimized
stereoscopic camera control for interactive 3D. ACM Transactions on Graphics, 30(6), pp. 1—8.
Ott, D.E. and Mayer-Patel, K., 2004. Coordinated multi-streaming for 3D tele-immersion. 12th Annual
ACM International Conference on Multimedia, pp. 596—603.
Oudeyer, P.Y., 2003. The production and recognition of emotions in speech: Features and
algorithms. International Journal of Human-Computer Studies, 59(1—2), pp. 157—183.
Oviatt, S.L., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T.,
Landay, J., Larson, J. and Ferro, D., 2000. Designing the user inferface for multimodal speech and
pen-based gesture applications: State-of-the-art systems and future research directions. Human
Computer Interaction, 15, pp. 263—322.
Ozerov, A., Philippe, P., Bimbot, F. and Gribonval, R., 2007. Adaptation of Bayesian models for
single-channel source separation and its application to voice/music separation in popular songs.
IEEE Transactions on Audio, Speech and Language Processing, 15(5), pp. 1564—1578.
Ozerov, A., Liutkus, A., Badeau, R. and Richard, G., 2011. Informed source separation: Source coding
meets source separation. IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, pp. 1—4.
Paik, K, and Iwerks, L., (2007). To Infinity and Beyond! The Story of Pixar Animation Studios, Virgin
Books.
Pal, P., Iyer, A.N. and Yantorno, R.E., 2006. Emotion detection from infant facial expressions and
cries. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, pp.
721—724.
123
Paleari, M. and Lisetti, C., 2006. Psychologically grounded avatars expressions. 1st Workshop on
Emotion and Computing at the 29th Annual Conference on Artificial Intelligence, pp. 1—4.
Paris, S., Sillion, F. and Quan, L., 2006. A surface reconstruction method using global graph cut
optimization. International Journal of Computer Vision, 66(2), pp. 141—161.
Parke, F.I., 1972. Computer generated animation of faces. ACM Annual Conference, 1, pp. 451—
457.
Parker, G.B. and Probst, M.H., 2010. Using evolution strategies for the real-time learning of
controllers for autonomous agents in Xpilot-AI. IEEE Congress on Evolutionary Computation (CEC),
pp. 1—7.
Partala, T. and Surakka, V., 2004. The effects of affective interventions in human-computer
interaction. Interacting with Computers, 16, pp. 295—309.
Patil, R.A., Sahula, V. and Mandal, A.S., 2010. Automatic recognition of facial expressions in image
sequences: A review. International Conference on Industrial and Information Systems, pp. 408—
413.
Perse, M., Kristan, M., Pers, J., Music, G., Vuckovic, G. and Kovacic, S., 2010. Analysis of multi-agent
activity using petri nets. Pattern Recognition. 43(4), pp. 1491—1501.
Pertaub, D.P., Slater, M., and Barker, C., (2001) an experiment on fear of public speaking in virtual
reality. Medicine Meets Virtual Reality 2001, pp.372–378.
Pertaub, D.P., Slater, M., and Barker, C., (2002). An experiment on public speaking anxiety in
response to three different types of virtual audience. Presence: Teleoperators and Virtual
Environments, 11(1), pp. 68–78.
Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, A., 2007. Object retrieval with large
vocabularies and fast spatial matching. IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1—8.
Pokorny, L.R. and Ramakrishnan, C.R., 2005. Modeling and verification of distributed autonomous
agents using logic programming. Declarative Agent Languages and Technologies II, pp. 148—165.
Pollefeys, M., Koch, R. and Van Gool, L., 1999. Self-calibration and metric reconstruction in spite of
varying and unknown intrinsic camera parameters. International Journal of Computer Vision, 32, pp.
7—25.
Pons, J., Keriven, R. and Faugeras, O., 2007. Multi-view stereo reconstruction and scene flow
estimation with a global image-based matching score. International Journal of Computer Vision,
72(2), pp. 179—193.
Poppe, R., 2010. A survey on vision-based human action recognition. Image Vision Computing,
28(6), pp. 976—990.
Prendinger, H. and Ishizuka, M., 2005. The empathic companion: A character-based interface that
addresses users’ affective states. International Journal of Applied Artificial Intelligence, 19, pp.
285—297.
Prendinger, H., Yingzi, J., Kazutaka, K. and Ma, C., 2005. Evaluating the interaction with synthetic
agents using attention and affect tracking. 4th International Joint Conference on Autonomous
Agents and Multi-agent Systems (AAMAS), pp. 1099—1100.
Rabiner, L.R., 1993. Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs.
124
Raghuvanshi, N., Snyder, J., Mehra, R., Lin, M. and Govindaraju, N., 2010. Precomputed wave
simulation for real-time sound propagation of dynamic sources in complex scenes. ACM
Rehm, M. and André, E., 2005. Catch me if you can – Exploring lying agents in social settings.
International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pp.
937—944.
REVERIE (2011). Description of work.
Rich, S.C. and Sidner, C.L., 2009. Robots and avatars as hosts, advisors, companions and jesters. AI
Magazine, 30(1), pp. 29—41.
Ripeanu, M., 2001. Peer-to-peer architecture case study: Gnutella network. 1st International
Conference on Peer-to-Peer Computing, pp. 99—100.
Robillard, G., Bouchard, S., Fournier, T., and Renaud, P., (2003). Anxiety and presence during VR
immersion: a comparative study of the reactions of phobic and non-phobic participants in
therapeutic virtual environments derived from computer games. Cyberpsychology & Behavior, 6(5),
pp. 467–476.
Robles-De-La-Torre, G., 2006. The importance of the sense of touch in virtual and real
environments. IEEE Multimedia, 13(3), Special Issue on Haptic User Interfaces for Multimedia
Systems, pp. 24—30.
Rossi, M., 1986. Electroacoustique. Dunod, Paris.
Roy, S., Klinger, E., Legeron, P., Lauer, F., Chemin, I., and Nugues, P., (2003). Definition of a VR-based
protocol to treat social phobia. Cyberpsychology & Behavior, 6(4), pp. 411–420.
Ruttkay, Z., Noot, H. and Ten Hagen, P., 2003. Emotion disc and emotion squares: Tools to explore
the facial expression face. Computer Graphics Forum, 22(1), pp. 49—53.
Sand Castle Studios LLC (2012). Hunger in Los Angeles debuts at Sundance. Retrieved 22 Feb 2012,
from http://changingworldsbuildingdreams.com/hunger-in-los-angeles-debuts-at-sundance, Sand
Castle Studios Press Release, 20 Jan 2012.
Schaffalitzky, F. and Zisserman, A., 2002. Multi-view matching for unordered image sets, or How do I
organize my holiday snaps. Computer Vision, pp. 414—431.
Scherer, K.R., 2001. Appraisal considered as a process of multilevel sequential checking. Appraisal
Processes in Emotion: Theory, Methods, Research, pp. 92—119.
Schneider, D.C., Kettern, M., Hilsmann, A. and Eisert, P., 2011. A global optimization approach to
high-detail reconstruction of the head. 16th International Workshop on Vision, Modeling and
Visualization (VMV), pp. 1—7.
Schröder, M., Heylen, D. and Poggi, I., 2006. Perception of non-verbal emotional listener feedback.
Speech Prosody, pp. 43—46.
Semertzidis, T., Daras, P., Moore, P., Makris, L., and Strintzis, M.G., (2011). Automatic creation of 3D
environments from a single sketch using content-centric networks. Communications Magazine,
IEEE, 49(3), pp. 152-157.
Serra, X., 1989. A system for sound analysis/transformation/synthesis based on a deterministic plus
stochastic decomposition. Ph. D. Stanford University.
Shao, W. and Terzopoulos, D., 2005. Autonomous pedestrians. ACM SIGGRAPH/Eurographics
Symposium on Computer Animation, pp. 19—28.
125
Shiratori, T., Soo Park, H., Sigal, L., Sheikh, Y. and Hodgins, J.K., 2011. Motion capture from bodymounted cameras. ACM Transactions on Graphics (TOG), 30(4), p. 31.
Sibert, L.E. and Jacob, R.J.K., 2000. Evaluation of eye gaze interaction. Conference of Human-Factors
in Computing Systems, pp. 281—288.
Sidner, C. L., Kidd, C.D., Lee, C. and Lesh, N., 2004. Where to look: A study of human-robot
interaction. Intelligent User Interfaces Conference, pp. 78—84.
Sidner, C.L., Lee, C., Kidd, C.D., Lesh, N. and Rich, C., 2005. Explorations in engagement for humans
and robots. Artificial Intelligence, 166(1—2), pp. 140—164.
Slater, M., Pertaub, D.P., and Steed, A., (1999). Public speaking in virtual reality: facing an audience
of avatars. Computer Graphics and Applications, IEEE, 19(2), pp 6–9.
Slyper, R., and Hodgins, J., 2008. Action capture with accelerometers. ACM SIGGRAPH/Eurographics
Symposium on Computer Animation, pp. 193—199.
Smolic, A., Merkle, P., Müller, P., Fehn, C., Kauff, P. and Wiegand, T., 2008a. Compression of multiview video and associated data. 3D Television, pp. 313—350.
Smol;ic, A., Müller, K., Dix, K., Merkle, P., Kauff, P. and Wiegand, T., 2008b. Intermediate view
interpolation based on multiview video plus depth for advanced 3D video systems. IEEE
International Conference on Image Processing (ICIP), pp. 2448—2451.
Smolic, A., Sondershaus, R., Stefanoski, N., Vasa, L., Müller, K., Ostermann, J. and Wiegand, T.,
2008c. A survey on coding of static and dynamic 3D meshes. 3D Television, pp. 239—311.
Smolic, A., 2011. ETH Zeurich Lecture Notes on Multimedia
http://graphics.ethz.ch/teaching/mmcom11/notes.php, Accessed: 2011.
Communication,
Smolic, A., Kauff, P., Knorr, S., Hornung, A., Kunter, M., Mueller, M. and Lang, M., 2011. 3D video
postproduction and processing. IEEE Special Issue on 3D Media and Displays.
Snavely, N., Seitz, S. and Szeliski, R., 2006. Photo tourism: Exploring photo collections in 3D. ACM
Transactions on Graphics (TOG), 25(3), pp. 835—846.
Snavely, N., Seitz, S. and Szeliski, R., 2008. Skeletal graphs for efficient structure from motion. IEEE
Computer Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1—8.
Soucy, M. and Laurendeau, D., 1995. A general surface approach to the integration of a set of range
views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(4), pp. 344—358.
Steidl, S., Levit, M., Batliner, A., Noth, E. and Niemann, H., 2005. Of all things the measure is man:
Automatic classification of emotions and inter-labeler consistency. International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 317—320.
Stockhammer, T., 2011. Dynamic adaptive streaming over HTTP: Design principles and standards.
2nd Annual ACM Conference on Multimedia Systems, pp. 133—144.
Stoiber, N., Seguier, r. and Breton, G., 2009. Automatic design of a control interface for a synthetic
face. 13th International Conferenece on Intelligent User Interfaces, pp. 207—216.
Tautges, J., Zinke, A., Kruger, B., Baumann, J., Weber, A., Helten, T., Muller, M., Seidel, H. and
Eberhardt, B., 2011. Motion reconstruction using sparse accelerometer data. ACM Transactions on
Graphics, 30(3), pp. 1—12.
Terzopoulos, D. and Waters, K., 1990. Physically-based facial modeling, analysis and animation.
Journal of Visualization and Computer Animation, 1(2), pp. 73—80.
126
Thiébaux, M., Marsella, S., Marshall, A. and Kallmann, M., 2008. SmartBody: Behavior realization for
embodied conversational agents. 7th Conference on Autonomous Agents and Multi-Agent Systems,
pp. 151—158.
Tiesal, J.P. and Loviscach, J., 2006. A mobile low-cost motion capture system based on
accelerometers. International Symposium on Visual Computing, pp. 437—446.
Tola, E., Lepetit, V. and Fua, P., 2010. DAISY: An efficient dense descriptor applied to wide-baseline
stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5), pp. 815—830.
Tola, E., Strecha, C. and Fua, P., 2011. Efficient large scale multi-view stereo for ultra-high resolution
image sets. Machine Vision and Applications, pp. 1—18.
Towles H. Kum S, Sparks T, Sinha S, Larsen S, Beddes N.2003. Transport and Rendering Challenges of
Multi-Stream, 3D Tele-Immersion Data in NSF Lake Tahoe Workshop on Collaborative Virtual Reality and
Visualization (CVRV’03) pp. 50-56.
Tran, S. and Davis, L., 2006. 3D surface reconstruction using graph cuts with surface constraints.
European Conference on Computer Vision (ECCV), pp. 218—231.
Triggs, B., McLauchlan, P.F., Harley, R. and Fitzgibbon, A.W., 1999. Bundle adjustment – A modern
synthesis. Vision Algorithms: Theory and Practice, pp. 298—375.
Truong, K.P. and van Leeuwen, D.A., 2007. Automatic discrimination between laughter and speech.
Speech Communication, 49, pp. 144—158.
Turk, G. and Levoy, M., 1994. Zippered polygon meshes from range images. 21st Annual Conference
on Computer Graphics and Interactive Techniques, pp. 311—318.
Van Welbergen, H., Reidsma, D., Ruttkay, Z. and Zwiers, J., 2010. Elkerlyc – A BML realizer for
continuous, multimodal interaction with a virtual human. Journal on Multimodal User Interfaces, 4,
pp. 271—284.
Varcholik, P.D., LaViola Jr, J.J., and Hughes, C., (2009). The bespoke 3DUI XNA framework: a low-cost
platform for prototyping 3D spatial interfaces in video games. Proceedings of the 2009 ACM
SIGGRAPH Symposium on Video Games, ACM, pp. 55– 61.
Vasudevan, R., Kurillo, G., Lobaton, E., Bernardin, T., Kreylos, O., Bajcsy, R. and Nahrstedt, K., 2011.
High quality visualization for geographically distributed 3D tele-immersive applications. IEEE
Transactions on Multimedia, 13(3), 573—584.
Verges-Llahi, J., Moldovan, D. and Wada, T., 2008. A new reliability measure for essential matrices
suitable in multiple view calibration. 3rd International Conference on Computer Vision Theory and
Applications (VISAPP), 1, pp. 114—121.
Vilhjálmsson, H., Cantelmo, N., Cassell, J., Chafai, N.E., Kipp, M., Kopp, S., Mancini, M., Marsella, S.,
Marshall, A., Pelachaud, C., Ruttkay, Z., Thórisson, K., Van Welbergen, H. and Van Der Werf, R.,
2007. The behavior markup language: Recent developments and challenges. 7th International
Conference on Intelligent Virtual Agents, 4722, pp. 99—111.
Vincent, E., Gribonval, R. and Fevotte, C., 2005. Performance measurement in blind audio source
separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), pp. 1462—1469.
Vincent, E., Bertin, N. and Badeau, R., 2010. Adaptive harmonic spectral decomposition for multiple
pitch estimation. IEEE Transactions on Audio, Speech and Language Processing, 18(3), pp. 528—537.
Virtanen, T., 2006. Unsupervised learning methods for source separation in monaural music signals.
Signal Processing Methods for Music Transcription, pp. 267—296.
127
Virtanen, T., 2007. Monaural sound source separation by non-negative matrix factorization with
temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech and Language
Processing, 15(3), pp. 1066—1074.
Vlasic, D., Adelsberger, R., Vannucci, G., Barnwell, J., Gross, M., Matusik, W. and Popović, J., 2007.
Practical motion capture in everyday surroundings. ACM Transactions on Graphics, 26(3), p. 35.
Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S. and Matusik, W., 2009.
Dynamic shape capture using multi-view photometric stereo. 28(5), p. 174.
Vogiatzis, G., Hernandez, C., Torr, P. and Cipolla, R., 2007. Multiview stereo via volumetric graphcuts and occlusion robust photo-consistency. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29(12), pp. 2241—2246.
Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L. and Schilling, A., 2007. Reconstruction of
deforming geometry from time-varying point clouds. 5th Eurographics Symposium on Geometry
Processing, pp. 49—58.
Wand, M., Adams, B., Ovsjanikov, M., Berner, A., Bokeloh, M., Jenke, P., Guibas, L., Seidel, H. and
Schilling, A., 2009. Efficient reconstruction of non-rigid shape and motion from real-time 3D scanner
data. ACM Transactions on Graphics (TOG), 28, p. 15.
Ward, J.A., Lukowicz, P. and Troster, G., 2005. Gesture spotting using wrist worn microphone and 3axis accerlerometer. Joint Conference on Smart Objects and Ambient Intelligence: Innovative
Context-Aware Services: Usages and Technologies, pp. 99—104.
Weinland, D., Ronfard, R. and Boyer, E., 2011. A survey of vison-based methods for action
representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2),
pp. 224—241.
Weise, T., Bouaziz, S., Li, H. and Pauly, M., 2011. Realtime performance-based facial animation. ACM
Welch, G. and Foxlin, E., 2002. Motion tracking: No silver bullet, but a respectable arsenal. IEEE
Computer Graphics and Applications, 22(6), pp. 24—38.
Wenger, S., 2003. H.264/AVC over IP. IEEE Transactions on Circuits and Systems for Video
Technology, 13(7), pp. 645—656.
Wiegand, T., Sullivan, G.J., Bjontegaard, G. and Luthra, A., 2003. Overview of the H.264/AVC video
coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), pp. 560—
576.
Wilson, C.A., Ghosh, A., Peers, P., Chiang, J., Busch, J. and Debevec, P., 2010. Temporal unsampling
of performance geometry using photometric alignment. ACM Transactions on Graphics (TOG), 29, p.
17.
Wu, W., Yang, Z. and Nahrstedt, K., 2009. Dynamic overlay multicast in 3D video collaborative
systems. 18th International Workshop on Network and Operating Systems Support for Digitial Audio
and Video, pp. 1—6.
Xu, F., Liu, Y., Stoll, C., Tompkin, J., Bharaj, G. Dai, Q., Seidel, H., Kautz, J. and Theobalt, C., 2011.
Video-based characters – Creating new human performances from a multi-view video database.
ACM Transactions on Graphics (TOG), 30(4), p. 32.
Yang, R., Pollefeys, M. and Welch, G., 2003. Dealing with textureless regions and specular highlights
– A progressive space carving scheme using a novel photo-consistency measure. 9th IEEE
International Conference in Computer Vision (ICCV), pp. 576—584.
128
Yang, Z., Wu, W., Nahrstedt, K., Kurillo, G. and Bajcsy, R., 2009. Enabling multi-party 3D teleimmersive environments with ViewCast. ACM Transactions on Multimedia Computing,
Communications and Applications (TOMCCAP), 6(2), p.7.
Zeng, G., Paris, S., Quan, L. and Sillion, F., 2007. Accurate and scalable surface representation and
reconstruction from images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1),
pp. 141—158.
Zhang, T., Hasegawa-Johnson, M. and Levinson, S.E., 2004. Children’s emotion recognition in an
intelligent tutoring scenario. 8th European Conference on Speech Communication and Technology
(INTERSPEECH), pp. 1—4.
Zhang, X., Liu, J., Li., B. and Yum, Y.S.P., 2005. CoolStreaming/DONet: A data-driven overlay network
for peer-to-peer live media streaming. 24th Annual Joint Conference of the IEEE Computer and
Communications Society (INFOCOM), 3, pp. 2102—2111.
Zhang, S., Wu, Z., Meng, H.M. and Cai, L., 2007. Facial expression synthesis using PAD emotional
parameters for a Chinese expressive avatar. 2nd Affective Computing and Intelligent Interaction
Conference (ACII), Lecture Notes in Computer Science, pp. 24—35.
Zhang, X., Biswas, D.S., and Fan, G., (2010). A software pipeline for 3D animation generation using
mocap data and commercial shape models. Proceedings of the ACM International Conference on
Image and Video Retrieval, pp. 350–357.
Zollhoefer, M., Martinek, M., Greiner, G., Stamminger, M. and Suessmuth, J., 2011. Automatic
reconstruction of personalized avatars from 3D face scans. Computer Animation and Virtual Worlds,
22(2—3), pp. 195—202.
Zwicker, M., Vetro, A., Yea, S., Matusik, W., Pfister, H. and Durand, F., 2007. Signal processing for
multi-view 3D displays: Resampling, antialiasing and compression. IEEE Signal Processing Magazine,
pp. 88—96.
129

D3.1-Report on available cutting-edge tools

Transcription

Similar documents

Reverie Diamond - International Jewellery London

Sistem Waktu Nyata Review Definisi Real-time Variasi Real

CasparCG 2.O

Bticino 334202 (5 Wire)

Toopy and Binoo`s Cubic Puzzles Set up the game… Ask a grown

Step # 13 Open up TMPGEnc 4.0 Express and click on “Start a new

TV ART DESIGN

Cleveland Institute of Music

Saenger Theater - SoundCom Systems