Performance Evaluation of an OpenCL

Transcription

Performance Evaluation of an OpenCL
Performance Evaluation of an OpenCL-based
Visual SLAM Application on an Embedded
Device
Olise-Emeka Charles Okpala
NI VER
S
E
R
G
O F
H
Y
TH
IT
E
U
D I
U
N B
Master of Science
Computer Science
School of Informatics
University of Edinburgh
2014
Abstract
The use of GPUs as accelerators for general purpose applications is standard practice in
high performance computing. Their highly parallel architecture and very high memory
bandwidth are two key reasons for their wide adoption. Embedded devices like mobile
phones and tablets now come equipped with GPUs as well as other special purpose
accelerators. As with their desktop counterparts, mobile GPUs have started to attract
attention as platforms for executing computationally intensive applications. Executing
such applications has traditionally been impractical on mobile devices due to their low
processing power occasioned by energy and thermal considerations.
Due to the difference in their target operating environments, mobile and desktop
GPUs have very different architectures. Differences include number of processing
cores and clock speed. While desktop GPUs typically possess thousands of high frequency cores, mobile devices have significantly less number of compute cores operating at much lower frequencies.
These and other limitations of embedded GPUs necessitate the evaluation of candidate computationally intensive applications on the mobile devices. Research have
focused on image processing and augmented reality applications. This is motivated by
today’s use of mobile devices for media consumption.
One area attracting research interest is real time 3D scene reconstruction using
mobile phones. This project used the OpenCL programming standard to evaluate the
performance of a representative application ( KinectFusion) on the ARM Mali GPU.
Optimisation opportunities investigated include better utilisation of the device’s memory hierarchy and use of vector instructions to exploit the SIMD architecture of the
GPU. Also considered were means of using the CPU and GPU cooperatively.
Optimisation efforts were hindered by the nature in which computations were expressed in the algorithm. Where possible, restructuring the implementation helped
to overcome these challenges. The resulting version of the project had 1.5 and 3.3
speedups over the initial implementation and a CPU only version respectively.
i
Acknowledgements
I would like to express my gratitude my supervisor, Mike O’Boyle for his guidance
and assistance in the course of carrying out this project.
Special thanks to my family and friends for all the encouragement. I am indebted
to the Nigeria LNG limited for graciously sponsoring my MSc programme.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Olise-Emeka Charles Okpala)
iii
Table of Contents
1
2
3
Introduction
1
1.1
Project Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Background
4
2.1
Review of Heterogeneous Computing . . . . . . . . . . . . . . . . .
4
2.2
OpenCL Framework . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2.1
Platform Model . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2.2
Execution Model . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.3
Memory Model . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.4
Programming Model . . . . . . . . . . . . . . . . . . . . . .
9
2.3
The Mali T600 GPU Series . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
Visual SLAM and KinectFusion . . . . . . . . . . . . . . . . . . . .
13
2.5
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.5.1
Visual Computing Acceleration . . . . . . . . . . . . . . . .
15
2.5.2
OpenCL-based Embedded Vision Systems . . . . . . . . . .
17
Application Analysis
18
3.1
Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1.1
Dependence Analysis . . . . . . . . . . . . . . . . . . . . . .
22
3.1.2
Task Assignment . . . . . . . . . . . . . . . . . . . . . . . .
22
Kernels Development and Profiling . . . . . . . . . . . . . . . . . . .
23
3.2
4
Program Optimisations
26
4.1
General Optimisations . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.1.1
Reduction of OpenCL Runtime Overheads . . . . . . . . . .
27
4.1.2
Memory Optimisations . . . . . . . . . . . . . . . . . . . . .
27
Kernel Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.2
iv
4.3
5
6
4.2.1
Unit Conversion Kernel . . . . . . . . . . . . . . . . . . . .
29
4.2.2
Bilateral Filter Kernel . . . . . . . . . . . . . . . . . . . . .
30
4.2.3
Volume Integration Kernel . . . . . . . . . . . . . . . . . . .
36
4.2.4
Optimisation of the Track and Reduce Kernels . . . . . . . .
39
Execution Optimisations . . . . . . . . . . . . . . . . . . . . . . . .
41
Overall Evaluation
43
5.1
45
Energy Considerations . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions
46
Bibliography
47
v
Chapter 1
Introduction
The last few years have seen remarkable increase in the level of sophistication present
in mobile processors. Advancements in hardware design in the desktop/server computer space have been successfully translated to embedded systems like mobile phones
and tablets. Today’s embedded devices feature multi core CPUs, Graphics Processing
Units (GPUs) and high speed DRAM. The result is that in addition to traditional applications in telephony, smart phones are now being used to both generate and consume
media rich content. Applications include social networking, spreadsheets, music/video
playback and mobile gaming. As has been successfully done in the desktop/server
computing space, developers are beginning to utilise the heterogeneous resources of
mobile devices (notably CPU and GPU) in a coordinated manner to execute computationally intensive applications. However, two reasons make this challenging.
Firstly, embedded devices have lacked support for general purpose heterogeneous
programming languages like CUDA [2] requiring solutions to be recast as graphics
programs. This has limited the range of general purpose applications developed. The
recent development of the Open Computing Language (OpenCL) and its adoption by
mobile device manufacturers mean that true general purpose heterogeneous computing
is now possible. An even more challenging issue pertains to how embedded devices
are designed. Specifically, power constraints limit the number of compute cores available. For example, while thousands of cores are common in desktop GPUs, embedded
devices typically have no more than eight cores. Another consequence of the power
limitation is that clock frequencies are much lower in embedded systems. Additionally, mobile GPUs are usually integrated with the CPU on a single chip and share the
same memory. They therefore lack the dedicated high bandwidth memory that make
desktop GPUs suitable for throughput computing [33, 34].
1
Chapter 1. Introduction
2
These constraints mean that direct ports of desktop implementations of computationally intensive algorithms are not guaranteed to work on embedded devices since
mobile architectures are usually not able to take advantage of optimisations that work
on desktop systems [34]. Several researchers have evaluated the performance of image
processing and basic computer vision algorithms on mobile devices.
Driven in part by advancements in mobile camera technology, one class of vision
applications that is attracting interest is the use of a mobile device for generating geometrically accurate 3D models of a scene in real time as it is being scanned by the
camera. Examples include Google’s project Tango [30] and EPSRC PAMELA [10].
These projects have the potential to provide more immersive augmented reality experience, better navigation assistance for the disabled etc.
This project assesses the performance of a representative 3D reconstruction algorithm on embedded devices and explores opportunities for application optimisation.
The algorithm chosen is the KinectFusion [25] and an OpenCL based implementation
was developed for the ARM Mali T604 mobile GPU.
1.1
Project Contributions
Previous OpenCL based evaluation work on mobile and embedded devices have focused on algorithms for processing a single static image. This project extends these
prior research efforts by considering an application that processes a stream of image
under real time constraints. These applications place much greater demands on the
limited computing resources available on mobile devices.
The KinectFusion application and the algorithms it uses were developed specifically for execution on massively parallel GPUs. The limited processing power of
embedded devices motivates the consideration of different reformulations of the algorithm. This project presents a method of interleaving different iterations of the algorithm to use both the CPU and GPU concurrently.
1.2
Outline
Chapter 2
provides background information relevant to this project including a dis-
cussion of related work
Chapter 1. Introduction
Chapter 3
3
analyses the KinectFusion algorithm and provides a motivation for the
OpenCL implementation. Issues considered include parallelism identification and task
partitioning between the CPU and GPU. The results obtained from profiling the initial
implementation are discussed.
Chapter 4
presents the optimisations applied. Code, memory and execution optimi-
sations are evaluated.
Chapter 5
conducts an evaluation of the final version of the algorithm.
Chapter 6
concludes this dissertation. A review of the profitability of the optimisa-
tions is presented as well as recommendations for future work.
Chapter 2
Background
This chapter discusses concepts relevant to this project. A brief overview of heterogeneous computing is presented. This is followed by coverage of the OpenCL standard
and its implementation on the Mali GPU. An evaluation of related research concludes
this chapter.
2.1
Review of Heterogeneous Computing
Heterogeneous computing as a problem solving approach has existed in various forms.
An early use was in supercomputing environments composed of different classes of
machines connected via a network. In this context, heterogeneous computing is defined as the ‘the well-orchestrated and coordinated effective use of a suite of diverse
high-performance machines (including parallel machines) to provide superspeed processing for computationally demanding tasks with diverse computing needs’ [17]. The
objective is to efficiently solve problems possessing different types of embedded parallelism by using the machines most suitable for each kind of parallelism.
The first manifestation of heterogeneity in desktop computers was the use of specialised floating point coprocessors to provide or accelerate arithmetic computations.
Other dedicated accelerators include IO processors and graphics processing units (GPUs).
Using these specialised processors allowed the CPU’s performance on general purpose
code to improve.
Processor manufacturers have traditionally increased performance by exploiting
advances in manufacturing technology to increase CPU clock frequencies. Power utilisation has been kept low by reducing the chip’s operating voltage. Recently, it became impossible to reduce voltage and still obtain accurate performance (as it will no
4
Chapter 2. Background
5
longer be possible to reliably distinguish between 1 and 0 voltage levels). Thus, further increase in operating frequency would result in greater power utilisation with the
attendant thermal management challenges [12]. The solution adopted has been to rely
on many low frequency CPU cores to deliver improved performance and to exploit the
heterogeneous nature of today’s machines.
The GPU while initially designed exclusively for rendering images has always possessed characteristics that make it attractive for general purpose computing. These
attributes include its highly parallel nature (thousands of cores) and very high speed
graphics memory. Two factors that prevented the use of GPUs for general purpose parallel processing were the lack of support for floating point arithmetic and the fact that
they could only be programmed using graphics paradigms. The inclusion of floating
point support and development of languages like CUDA [2] meant that GPUs could be
used for general purpose computing [26]. Several researchers have demonstrated the
performance benefits that result from using general purpose GPU (GPGPU) computing
[28, 8]. In many respects, heterogeneous computing today is used exclusively to refer
to the CPU + GPU combination.
In addition to the GPU, other specialised hardware accelerators like DSPs and FPGAs exist in modern computer systems and have been considered for use in general
purpose parallel programming [22].
Proper exploitation of the increased computing capabilities require the use of well
defined programming languages and standards. The next section presents OpenCL, a
framework for explicitly programming heterogeneous computing systems.
2.2
OpenCL Framework
OpenCL [23] is a standard for cross-platform heterogeneous systems development.
Applications written in OpenCL can run on general purpose processors like CPUs as
well as specialized processors like GPUs, FPGAs and DSPs. The OpenCL framework
is composed of a programming language, a runtime system and API.The specification
defines a core set of APIs that all implementers must support in addition to providing
facility for vendor specific extensions. OpenCL programs are written using a restricted
version of the C language, augmented to support parallel programming. Parallel execution is expressed using kernels which are similar to functions in the C language. A
consequence of OpenCL’s goal of application portability is that hardware abstractions
are low level. A programmer is required to dynamically determine the types and capa-
Chapter 2. Background
6
bilities of accelerators available in a given system and select the ones most suitable for
the tasks. The kernels are then compiled for the specific device families discovered.
The usual way of achieving this is to compile the kernels online (at application startup).
Offline compilation is also possible, with binaries being scheduled for execution on the
devices. However, this would restrict the application to working on specific vendors’
implementations of OpenCL.
The OpenCL specification uses a set of abstract models to describe the architecture
of heterogeneous devices and application execution. These models are as follows:
2.2.1
Platform Model
This is an abstract description of an heterogeneous system at the hardware level. This
model is made up of a single host and one or more devices. The host is responsible
for coordinating program execution. It interfaces with the environment external to the
OpenCL program to perform tasks like IO and interacting with users. The host is a
general purpose processor like an CPU. The devices provide acceleration for OpenCL
code. A device is hierarchically divided into independent compute units that are themselves made up of processing elements. OpenCL kernels are executed on the processing elements. Examples of OpenCL devices include GPUs, DSPs and CPUs (both
whole processors or individual cores in a multi core chip.)
A concrete implementation of this model contains devices from a single vendor and
maps the model to vendor-specific hardware architecture. Fig 2.1 presents a description
of this model.
Figure 2.1: OpenCL platform model [23]
Chapter 2. Background
2.2.2
7
Execution Model
OpenCL applications are composed of two distinct parts: kernels that execute on devices and a host program. The host program is responsible for setting up the environment as well as initiating execution of kernels. The execution model defines the
interaction between the host and the devices as well as how kernels are executed on the
devices. The responsibilities of the host program are described in terms of the context
and command queues:
A context provides the definition of the environment within which kernels execute.
It is setup and managed by the host program. In addition to the kernels, a context
comprises the following:
• A list of devices on which kernels will be executed. All the devices in a given
context must be from the same platform i.e. the same hardware vendor. A context can be created for a specific class of devices (e.g. GPU) or for all devices in
a system.
• Program Objects encapsulating source code and executables that define the kernels. As previously mentioned, OpenCL kernels are usually built as part of application start up. This is done by the host using the compiler provided by the
platform for which the context has been created.
• Memory Objects which are data structures accessible to devices and transformed
by instances of kernel execution. The host program creates these memory objects
to serve as either inputs or outputs to the kernels. There are two kinds of memory
objects in OpenCL - Buffers and Images. Depending on how memory objects are
defined and the specifics of the runtime, explicit data movement may be required
to and from host to device memory
Host-device interaction occurs via command queues. The host program submits
commands for execution on a device. Note that a command queue can be attached to
only one device. Functions of OpenCL commands are: scheduling kernel execution;
managing transfer of memory objects to and from devices; constraining order in which
kernels are executed.
Commands placed in a queue are executed in FIFO order by default. Queues can be
configured to behave in out of order mode. In this case, the programmer is responsible
for explicitly managing dependencies among commands in the queue using events.
Chapter 2. Background
2.2.2.1
8
Semantics of OpenCL Parallel Execution
Each kernel enqueue command results in the creation of independent parallel threads
of execution by the OpenCL runtime. Each instance of the kernel is referred to as a
work item. The number of work items created is determined by the size of an integer
index space known as an NDRange specified by the programmer. A maximum of
3 dimensions are supported for the index space. One work item is created for each
point in this index space. Work items are clustered into equally sized work groups.
Work groups are the unit of concurrent execution in OpenCL and provide the only
means of synchronising the activities of work items. All work items in the same work
group execute concurrently on the same compute unit of an OpenCL device and share
resources of the device.
It should be noted that the OpenCL specification does not provide guarantees about
parallel execution of work items within a work group. An implementation may schedule a work group in smaller batches or serialise the work items so long as the semantics
of concurrent execution are preserved [23, 24]. However, the existence of more than
one compute units in a device means that work groups can be run in parallel.
2.2.3
Memory Model
This model specifies the components and organisation of memory in heterogeneous
systems. It also defines the consistency model supported by OpenCL. As previously
mentioned, OpenCL contains two kinds of memory objects: Buffers and Images.
Buffers are contiguous blocks of memory to which any kind of data structure can
be mapped and are accessible via pointers. They are comparable to arrays in the C
language. Images are specialised data types holding 1, 2 or 3 dimensional graphics
images designed to take advantage of the texture hardware of GPUs. Thus, Images
are opaque objects with respect to the programmer and can only be manipulated via
OpenCL calls. Access to specific parts of an Image are not allowed. Images are useful
in situations requiring edge detection and interpolation as they provide these services
automatically in an optimised manner.
OpenCL memory is divided into parts: Host Memory directly available to the
host processor and managed outside of OpenCL using regular OS facilities and Device Memory accessible to kernels running on OpenCL devices. Device memory is
partitioned into four address spaces as follows:
Chapter 2. Background
Global Memory
9
This region is available to both the host and devices. All work items
from all available devices have read-write access to objects in this address space. This
memory region has the highest access latency. Caching of reads and writes to this
region may be supported depending on the capabilities of the device.
Constant Memory
This is a subset of global memory used to store read only items
accessed by all work items simultaneously. Objects in this region are placed by the
host processor. Depending on the capabilities of the device and OpenCL runtime, the
complier generates specialised instructions to optimise access to items in this address
space. An example optimisation might be issuing only one load instruction just before
the kernel is executed.
Local Memory
This address space is shared by items in a work group. It is unique to
each device and depending on the capabilities of the device, will be implemented using
on chip memory. It usually provides access latencies much less than global memory.
Private Memory
This address space is unique to individual work items and is the
default scope of kernel variables (excluding pointers). It is normally implemented
using registers and thus may provide the lowest access latency.
Fig. 2.2 gives a visual illustration of how these address spaces are mapped between
a host and a single device.
OpenCL defines a relaxed consistency model for memory objects. Ordering guarantees depend on the address space being accessed. For local memory shared by items
in a work group, consistency is guaranteed only at synchronisation points within the
kernel(using OpenCL barriers). Global memory consistency is not defined between
work groups in the same kernel. The runtime only guarantees synchronisation between commands in the queue (i.e at the end of one kernel and the start of the next
kernel.)
2.2.4
Programming Model
This model defines how parallel algorithms are formulated using OpenCL. Two programming styles are supported by the standard. They are:
Data Parallelism
where the problem is decomposed using data to be processed and
independent partitions are handled in parallel. This style of programming directly
Chapter 2. Background
10
Figure 2.2: Memory regions available to OpenCL kernels [24]
aligns with OpenCL’s execution model with each point in the NDRange independently
processing data elements. Depending on the structure of the kernel, computation may
proceed in Single Instruction Multiple Data (SIMD) or Single Program Multiple Data
(SPMD) fashion. Different levels of data parallelism are supported. Work items in a
work group may process data items in parallel. A single work item may use explicit
SIMD instructions to manipulate multiple items at a time.
Task Parallelism
which involves partitioning a problem into functional modules that
can be executed in parallel. The straightforward way of expressing task parallelism in
OpenCL is the use of an out of order command queue. In this case multiple kernels
can be in execution at the same time on different compute devices.
It is possible to combine both forms of parallelism in a single application. For
example, tasks in a task-parallel algorithm can contain data parallel instructions.
These models ensure portability of OpenCL programs as all implementations are
required to conform to them. While code portability is guaranteed, performance is not
as it depends on architecture/device family specific optimisations [9, 11].
In order to address the wide difference in architecture and capabilities of devices,
the OpenCL specification defines two levels of conformance required of implementations. They are full and embedded profile. The embedded profile relaxes guidelines about floating point arithmetic, availability of data structures like 3D Images,
mathematical accuracy of functions etc. This profile is meant for embedded devices
Chapter 2. Background
11
with more stringent memory constraints [24]. The current version of the standard is
OpenCL 2.0.
2.3
The Mali T600 GPU Series
The Mali T600 series [4] is the first of a family of GPUs manufactured by ARM targeting the high end embedded device market designed for both graphics and general
compute applications. It uses a unified shader architecture - all computational cores
are identical and can perform all kinds of programmable shader tasks [15]. Depending
on the use case, between 1 and 8 cores exist on a chip. Each core has a dedicated L1
cache. A centralised memory management unit coordinates access to main memory
through a single L2 cache shared among the cores. The chip has a task management
unit that distributes workload among the cores. Fig 2.3 shows the block diagram of the
T604 used in this project.
Figure 2.3: The ARM Mali T604 GPU System Architecture [4]
Each core is composed of a tri-pipe execution component along with supporting
fixed function hardware units. The parts of the tri-pipe are:
• One or more Arithmetic pipelines for computation. This pipeline has a SIMD
design and works on 128 bit vector registers. Up to two arithmetic instructions
can be executed per clock cycle.
• A single Texture pipeline that handles all memory accesses related to processing
textures with a throughput of one instruction per clock cycle.
Chapter 2. Background
12
• A single Load/Store pipeline that takes care of all other memory accesses. This
is a vector based pipeline capable of loading four 32 bit words in one cycle.
The cores have a VLIW design. Fig 2.4 shows the architecture of a core.
Figure 2.4: Architecture of the Mali T600 Shader Core [15]
With respect to the OpenCL standard, the platform model is implemented with
each shader core corresponding to a compute device and the tri-pipes corresponding to
processing elements. Work items execute as hardware threads on the cores. Up to 256
threads can be executed concurrently on each core [14]. The actual value depends on
the number of registers required by the kernel. OpenCL kernels typically only use the
Arithmetic and Load/Store pipelines. The Texture pipeline is only used for memory
access related to Image processing and for barrier operations. The hardware scheduler
assigns work groups to cores in batches. All the work groups in a batch execute on the
Chapter 2. Background
13
same core in round robin fashion. At every clock cycle, a thread is chosen to execute
one instruction. Threads in a work group are selected in order of increasing identifiers.
In an index space with more than 1 dimension, the lower dimensions are incremented as
the inner indices in the sequence. For example, a 2D index space would be sequenced
as [1,0], [2,0], . . . , [1,1], [2,1], . . . Threads in the next adjacent work group are then
scheduled in the same fashion. All threads have their own program counter and do not
execute in lockstep like warp/wavefront based architectures [5].
The Mali GPU is built as part of a System on Chip (SoC). Thus, it has a host-unified
memory model. This implies that both the GPU and the host processor share the same
physical memory. OpenCL local and global memory spaces are implemented using
RAM backed by L1 and L2 caches.
The Mali T600 GPUs conform to the OpenCL 1.1 full profile.
2.4
Visual SLAM and KinectFusion
Simultaneous Localisation and Mapping (SLAM) is a term used to describe the process
by which a robot acquires a spatial map of a previously unexplored environment while
keeping track of its position in the environment. A global model of the environment
is repeatedly updated using cues obtained as the robot moves around. Visual SLAM
refers to solutions that work using measurements obtained from camera sensors. The
SLAM problem is challenging due to the statistically dependent nature of measurement
noise. Inaccuracies in earlier estimates accumulate over time and their effect becomes
amplified over time. This issue as well as the dynamic nature of the environment makes
solutions to the SLAM problem very computationally intensive [31].
KinectFusion [25, 16] is a visual SLAM algorithm for producing geometrically
accurate 3D models of a physical scene using commodity depth cameras (Kinect). it
leverages techniques from computer vision and robotics. A single global model of the
scene is maintained. The model used is a 3D volumetric representation based on [7]
where each point (voxel) stores a Truncated Signed Distance Function (TSDF) value
representing the current estimate of the distance of the point relative to the surface. The
global model is updated using new depth data obtained from the camera. New views
of the scene result in a more accurate model in the classic Bayesian manner. At each
point in time, data from the depth camera are used to predict the current pose of the
camera. TSDF values for points that lie within the current camera frustum are updated.
Camera pose estimation is carried out using the Iterative Closest Point (ICP) algorithm
Chapter 2. Background
14
[27].
Example applications of KinectFusion include augmented reality systems, 3D printing and physics simulation. Fig 2.5 shows an example scene reconstruction.
Figure 2.5: Example 3D scene reconstruction using KinectFusion [16]
Processing of the depth data occurs in several stages. Each stage contains a high degree of parallelism. Detailed analysis of theses stages will be presented in section 3.1.
KinectFusion as well as the algorithms it leverages have been designed with execution
on the GPU in mind. Fig 2.6 shows the processing flow of the system.
Figure 2.6: KinectFusion Execution Workflow [25]
2.5
Related Work
The recent increase in the amount of processing power available on mobile and embedded devices means that they have been actively considered as tools for visual computing. The works relevant to this project are presented below.
Parallel Tracking and Mapping (PTAM) [18] is a keyframe based SLAM algorithm
designed to work with monocular RGB cameras and used in Augmented Reality applications. The concepts introduced by this algorithm serve as a foundation for the
KinectFusion algorithm. PTAM provides real time tracking of the camera’s position
by repeatedly running two procedures in parallel: Tracking and Mapping. Its effectiveness in generating room scale maps and efficient camera tracking have been demonstrated on desktop computers. Klein and Murray [19] investigated the applicability of
Chapter 2. Background
15
the algorithm on mobile phones using the Apple iPhone 3G. Their implementation was
designed to work on the phone’s CPU. Owing to the single threaded nature of the device’s processor, the two aspects of the algorithm were performed in alternate manner
- a foreground thread handled camera tracking while map optimisation was performed
in the background. The process of optimising the map was performed only when no
new frame was available for camera tracking.
Due to its dependence on a large number of data points per frame, the PTAM algorithm is computationally intensive and relies on the high clock rate of the processor to
deliver real time results. The limited processing power available on the iPhone compared with a desktop machine meant that a straightforward implementation would not
perform effectively on the mobile phone. This problem was tackled by a reformulation of the algorithm. The mapping procedure, known as bundle adjustment works by
constructing a 4 level image pyramid to which all possible landmarks are added. This
may result in data duplication. The mobile phone implementation uses only a subset of
the available data and a different pyramid construction procedure. Two other modifications to the mapping process are done. Firstly measurements of any given map point
observed by multiple keyframes are ranked in order of usefulness and only the most
useful ones are retained. This has the effect of creating a less dense map in contrast
with the desktop implementation that actively seeks to construct the densest possible
map. The other modification involves erasing redundant keyframes once a predefined
keyframe count threshold is exceeded.
The tracking procedure was also reformulated. The modification involved omitting
the stage that ensures tracking accuracy through large feature measurements. This
omission was occasioned by the limited bandwidth of the phone’s CPU and the sparse
map used. The end result of these adjustments is that while the PTAM algorithm
worked on the phone, it was significantly less accurate compared with the desktop
version.
2.5.1
Visual Computing Acceleration
The limited processing power of mobile CPUs have led researchers to consider the
use of other special purpose processors present on the devices as accelerators. Several
possibilities have been investigated.
An early approach involved the use of Field Programmable Gate Arrays (FPGAs).
An FPGA is an Integrated Circuit whose function can be altered by changing the in-
Chapter 2. Background
16
terconnectivity of its components. This reconfiguration is achieved by downloading
a bitstream that specifies the functionality to be implemented. Maclean [22] evaluated their suitability for vision applications. FPGAs make it possible to exploit the
parallelism inherent in vision algorithms. However, their lack of support for floating
point operations and even more importantly, the very low level hardware programming
required to configure them greatly reduce their applicability.
A slightly different approach was pursued by Seung et al [20]. Their work combined a software implementation with hardware level acceleration. This was done
in the context of augmented reality applications on an Intel Atom powered Mobile
Internet Device. The application runs on the CPU and offloads the computationally
intensive parts to custom built hardware accelerators. The work focused on accelerating image recognition and matching algorithms. The accelerators contained elements
like Static RAM memory, custom computation pipeline and control unit. Experimental
results show that depending on the size of input images, the use of custom accelerators
resulted in 14 times speedup with respect to a heavily optimised CPU only implementation. However, the use of purpose built hardware to accelerate parts of an application
prevents this method from being generally applicable. A better approach is to have a
single device capable of accelerating all applications.
The GPU on mobile devices serves this purpose. Advantages include floating point
support, programmable compute core and parallel architecture. Kwang-Ting and YiChu [6] implemented a face recognition algorithm on a smartphone. The use of the
embedded GPU as an accelerator resulted in speedup and energy efficiency compared
with a CPU only implementation. Additionally, it was concluded that for small workloads, mobile GPUs provide greater energy efficiency than their desktop counterparts.
The loss of efficiency after a threshold was attributed to the limited cache sizes of mobile GPUs. Singhal et al [29] evaluated several image processing algorithms (Speeded
up robust feature detection; Non-photorealistic rendering and stereo matching) on an
embedded GPU. The results presented showed successful acceleration compared with
a CPU only implementation. Despite being demonstrably suitable for general purpose processing, the fact that mobile GPUs could only be programmed using graphics
standards like the OpenGL ES API limited their use as not all algorithms could be
expressed in terms of graphics primitives. Fortunately, several embedded device manufacturers have begun providing support for OpenCL as general purpose programming.
Chapter 2. Background
2.5.2
17
OpenCL-based Embedded Vision Systems
Wang et al [35] reported the first use of OpenCL for implementing computer vision applications on a real mobile device. Previous work by Leskela et al [21] used OpenGL to
emulate the OpenCL embedded profile on mobile devices. Wang et al used the object
removal algorithm as a case study. All the computations in the algorithm were implemented as OpenCL kernels and executed on the GPU, with the CPU only serving as
application coordinator. The primary optimisation employed was the use of GPU local
memory as a software managed cache for frequently accessed data. The authors presented results of their implementation’s performance for various problem sizes. While
no comparison was made with a base implementation, the authors concluded that executing the class of computer vision problems typified by object removal was feasible
on mobile devices.
A follow up work by Wang et al [34] evaluated the performance of the ScaleInvariant Feature Transform (SIFT) algorithm on a smartphone. The implementation
was done on the Qualcomm Snapdragon S4 chipset that supports OpenCL Embedded
Profile acceleration for both the CPU and GPU. Profiling results were used to guide efficient partitioning of tasks between the two processors. Optimisations applied include
appropriate data structure selection (OpenCL Images to take advantage of implicit vectorisation) and elimination of branches in kernel code. Additionally, prior knowledge
about the data distribution was used to limit the search space of the application. Local memory was also used to reduce the effect of memory latency. Results presented
demonstrated superior performance of this heterogeneous approach compared with a
CPU only implementation.
A common theme of these reports is that while harnessing the heterogeneous capabilities of embedded devices makes more computing power available, there is usually
the need to restructure vision algorithms to obtain acceptable results. While speedups
are obtained relative to a mobile CPU only implementation, performance still falls
below that of desktop systems.
Observe that while mobile and embedded devices contain more accelerators than
just GPUs, at this time only the GPU is supported by OpenCL implementations.
Chapter 3
Application Analysis
This chapter presents an overview of the KinectFusion algorithm focusing on how its
structure maps to the OpenCL programming model. This is followed by a discussion
of the results of profiling the application.
3.1
Algorithm Overview
KinectFusion operates by repeatedly refining a model of the scene being reconstructed
using new depth data. Fig 3.1 gives a high level view of the algorithm. The steps in
the process are described in detail below:
Initialisation
At application startup, the required memory buffers are allocated. Most of the buffers
are allocated by the host using OS memory management commands. OpenCL is only
used to allocate space for the data structure used to represent the global model of the
scene. This volume is assigned as contiguous linear memory on the GPU.
Frame Acquisition
Each iteration of the algorithm begins with a new depth frame being collected from the
Kinect camera. A series of preprocessing actions are carried out on this raw depth map
as follows:
1. Given the difference in units between the live depth frame and that used by
the application (millimetres and metres respectively), each pixel in the frame
18
Chapter 3. Application Analysis
Figure 3.1: Execution flow of the KinectFusion application
19
Chapter 3. Application Analysis
20
is scaled by dividing by a constant factor - 1000. This is a classic case of an
embarrassingly parallel process and directly maps to the SIMD capabilities of
the GPU.
2. The depth data received from the Kinect camera is noisy and would reduce reconstruction accuracy if used directly. A noise correcting bilateral filter is thus
applied to each frame before further processing. This is basically a convolution
using a Gaussian mask that computes each output pixel by aggregating the contributions of input pixels that lie within the radius of the filter centred on the
current input position. Given that computation of each point in the output frame
is independent, a straightforward implementation is to assign each output point
to a work item. The overlap of input pixels used by consecutive threads creates
opportunities for data sharing and vectorisation.
3. A three level depth pyramid is constructed using the filtered depth image as its
base with each level is computed from the preceding level. Each value in the
output depth map is the result of block averaging and sampling (at half resolution) from the values in the input depth map. Processing of each output value is
independent and can be executed in parallel.
4. Each value in the filtered depth frame is converted from image coordinates to the
Kinect sensor’s reference frame to generate a vertex map. This is computed as a
product of the depth measurement, inverted camera calibration matrix and pixel
coordinate for each point in the depth map. All computations are independent
and can be computed in parallel. A normal map is generated from the resulting
Vertex map. Each vector in the Normal map is computed using neighbouring
points in the Vertex map. Once again, all output computations are independent
and proceed in parallel. The computation of vertex and normal maps is repeated
for each level in the pyramid using the corresponding depth map for each level.
Camera Tracking
This step uses the just computed vertex and normal maps to estimate the current pose
of the camera relative to the acquired depth map. The pose is represented as a 4 × 4
matrix. Pose estimation involves the use of ICP algorithm to determine the relative
transformation that most closely aligns points in the current frame with those of the
Chapter 3. Application Analysis
21
previous frame. The process involves two steps executed in order for each pyramid
level. The steps are:
1. Track: A process known as perspective data association is used to find correspondences between points in the current frame and the previous frame. It uses a
combination of the previous camera pose and raycasted pose (if any). Each point
in the previous vertex map is converted to camera coordinates and perspective
projected to image coordinates. The resulting point is used to access the current
vertex and normal maps. The output of this step is all points that lie within a
threshold (specified using Euclidean distance and angle).
2. Reduce: The output points of the track step are summed up using tree based reduction. This computation is carried out in parallel using barrier synchronisation
between reduction stages to ensure the right values are used in the summations.
These two steps are repeated for a prespecified number of iterations per pyramid level.
The linear system resulting from each iteration is solved using a Cholesky decomposition. Both track and reduce steps are inherently parallel and directly fit GPU computation while solving the linear system of equations maps to the CPU given the recursive
nature of the task.
Volumetric Integration
The result of the camera tracking stage determines if the current pose differs significantly from the previous pose (by a pre-specified threshold). If it does, the new frame
is fused into the global volumetric representation of the scene. This is accomplished
by making a sweep of each point in the volume and updating the value of the SDF at
each point that lies in the current depth frame. In order to determine if a point lies in
the current frame, it is converted to a vertex in global 3D coordinates using the current
camera matrix. Processing of each point in the volume is independent and can executed in parallel. Given the number of points that need to be visited (for example, a
volume of length 256 has 2563 = 16777216 voxels ), this stage is only feasible on the
GPU.
Raycasting
Raycasting is a standard process in computer graphics used in creating 3D perspectives
in 2D maps. In the case of KinectFusion, it is used as the first step in rendering the
Chapter 3. Application Analysis
22
scene being reconstructed. It extracts the surface embodied in the global volume. This
is done by walking a ray starting from each point in the output image and traversing
the volume until a surface is encountered. A surface is identified as the point along
the ray where the SDF stored at that location in the volume changes sign. Each point
identified is converted to a global 3D coordinates (vertex) and normal for rendering.
The output of raycasting is also used as an additional input for determining the pose
of the camera in the next iteration. Walking individual rays from each starting point in
the output image is independent and can be carried out in parallel.
In conclusion, the KinectFusion algorithm is composed of several stages connected
in a rigid pipeline. Each stage of the pipeline is highly data parallel and directly maps
to the OpenCL programming model.
3.1.1
Dependence Analysis
The purpose of this analysis is to identify the amount of task parallelism contained in
the algorithm. That is, the extent to which operations in the algorithm can be overlapped. Within each iteration, computation proceeds in a pipelined manner with each
stage depending on the output of the previous stage. This hard dependence makes it impossible to interleave the steps in each iteration. Thus, an in order OpenCL queue has
to be used to ensure this dependence is satisfied. More importantly, the data structures
used to represent the scene and the camera pose preserve state across loop iterations.
This introduces a cross iteration dependence (of distance 1) that greatly limits the extent to which iterations can be overlapped. The only activities that do not depend on
state of the previous iteration are those related to frame acquisition and preprocessing
( Retrieving new depth data; performing unit conversion; applying the bilateral filter
and half sampling to generate the pyramids ). It is thus concluded that the algorithm
has a very low degree of task parallelism.
3.1.2
Task Assignment
As previously stated, most parts of the KinectFusion algorithm are highly data parallel
and are suitable for execution on the GPU. The only actual computation performed on
the CPU is determining the solution to the linear system produced as part of camera
tracking. Therefore, apart from orchestrating kernel execution and data movement,
the CPU is idle for the most part. This is fine on desktop systems with highly parallel GPUs. However, on embedded devices, this may not represent the best use of
Chapter 3. Application Analysis
23
Table 3.1: Comparison of execution times on the CPU and GPU. All values shown are
in milliseconds
Function
CPU Execution Time GPU Execution Time
Unit Conversion
0.76
0.23
Bilateral Filter
426
5.69
Generate Pyramid
2.99
0.19
Depth to Vertex
1.25
0.26
resources. This is because the CPU and GPU have comparable core counts and the
CPU has a higher clock frequency than the GPU. Tests were done to compare the performance of several functions on the CPU and GPU. The CPU versions were written
as serial C functions because the OpenCL implementation from ARM does not support
acceleration using the CPU. Table 3.1 shows the result of this experiment. The GPU
gave the better performance for all the functions. Implementing the tasks as OpenCL
kernels thus represented the most efficient solution. The issue of using the CPU and
GPU concurrently is discussed in detail in section 4.3.
3.2
Kernels Development and Profiling
The implementation phase used a basic version obtained by direct manual translation
of a CUDA implementation developed by Reitmayr [13]. In order to determine the
most profitable direction of investigation, this initial implementation was instrumented
to identify the hotspots in the application - kernels that contribute the most to total
execution time. Due to the absence of OpenCL profiling tools for the Mali system,
results were generated using library functionality available in the OpenCL API. This
was done by attaching an event object to each enqueued OpenCL command.
Fig 3.2 shows what percentage of application execution time is accounted for by
each kernel. The results take into account the number of times each kernel was invoked.
Thus, while a single invocation of the bilateral filter kernel takes longer than the track
kernel, the latter is invoked many more times in the course of program execution and
optimising it would be more beneficial. Six kernels account for 97 % of execution time.
Disregarding the kernels required for rendering as they are not part of the workflow of
the algorithm, the top three kernels are integration, tracking and reduction kernels.
Chapter 3. Application Analysis
24
These kernels were investigated the most in the course of this project.
In addition to execution time, data about the maximum number of concurrently
executing hardware threads per GPU core were collected for each kernel. This was
done using the clGetKernelWorkGroupInfo OpenCL call per kernel to request the value
for the CL KERNEL WORK GROUP SIZE attribute. As previously mentioned, each
Mali core supports up to 256 concurrent threads. To ensure optimum performance, at
least 128 threads must be run concurrently per thread [5]. Table 3.2 shows that while
most kernels possessed the optimum degree of concurrency, three kernels (Integration,
rendering and raycast) could only be launched with 64 concurrent threads. The explanation for this is that the relative complexity of these kernels results in a high number
of register usage.
Figure 3.2: Contribution of the various kernels to the total application execution time
In summary, the KinectFusion algorithm has enough data parallelism to justify
an explicitly parallel implementation using OpenCL. The main steps of the algorithm
directly correspond to OpenCL kernels. Profiling an initial implementation of the algorithm has revealed the kernels that contribute the most to overall execution time and
that would benefit most from optimisations. The next chapter presents the result of
optimising the KinectFusion algorithm.
Chapter 3. Application Analysis
25
Table 3.2: Maximum number of concurrent work items per kernel.
Kernel
Max Workgroup Size
Render Input
64
Integration
64
Raycasting
64
Reset
256
Bilateral Filter
128
Track
128
Reduce
128
Vertex to Normal
128
Render Light
256
Render Depth Map
256
Render Track Result
256
Depth to Vertex
256
Millimetres to Metres
256
Generate Pyramid
256
Chapter 4
Program Optimisations
The optimisations presented in this chapter were developed on an Arndale development
board 1 with the following characteristics:
• Samsung Exynos 5250 chip with integrated dual core Cortex-A15 CPU with
clock speed of 1.7 GHz and quad core Mali T604 GPU with clock speed of 533
MHz.
• 2GB Main Memory
• Ubuntu 12.04.2 Linux, 3.11.0-arndale+ kernel
The depth data used for testing the KinectFusion algorithm were obtained by playing
back previously recorded Kinect depth images. Generating and replaying the data were
done using the OpenNI library interface2 . Each input frame was 640 × 480 pixels in
size. For the purpose of this project, the frames were processed at half the resolution.
Thus, the frames were sampled to generate 320 × 240 sized images.
4.1
General Optimisations
This section discusses best practices applied that are not specific to the KinectFusion
application.
1 http://www.arndaleboard.org/wiki/index.php/Main_Page
2 https://github.com/OpenNI/OpenNI
26
Chapter 4. Program Optimisations
4.1.1
27
Reduction of OpenCL Runtime Overheads
Each complete execution of the KinectFusion algorithm results in a large number of
OpenCL kernel invocations. The exact number of invocations is determined by the
amount of input depth frames received (For example, processing 100 frames resulted
in 8,000 kernel invocations). Kernels are created from OpenCL programs using the
clCreateKernel function and disposed of using the clReleaseKernel function. A direct
implementation of the system would result in these two functions being called before
and after each kernel invocation respectively. The overhead incurred will be significant.
As part of program optimisation, all the kernels were created once at program initialisation using the clCreateKernelsInProgram function. The kernels were then placed
in an associative map (using the kernel name as the key) to be consulted each time a
kernel is invoked. The code to release the kernels was moved to an exit function. For a
little increase in memory footprint (the 22 kernels and helper functions in the program
occupied 88 KB of main memory ), total execution time was reduced by over 20%
4.1.2
Memory Optimisations
The OpenCL specification was originally developed for architectures having separate
host and device memory systems. Thus, the default behaviour is for the host to allocate
space for buffers and transfer them to the devices for processing. The output data are
then copied back to the host at the end of kernel execution. For memory bound kernels
that do not involve a lot of computations, this copying is expensive. The standard
provides an alternative way for devices to directly access host memory. This is through
OpenCL mapping and unmapping operations. However, the buffer has to be allocated
by the OpenCL runtime. This zero copy approach is typically not used on systems
with discrete devices (GPUs) because the reduced bandwidth resulting from accessing
memory via the PCI bus usually outweighs the benefits. However, the host-unified
memory architecture of Mali makes it a useful optimisation to explore.
Note that even when the host and devices share the same physical memory, the
default behaviour of OpenCL is for data to be copied before the devices can access
them. This is clearly redundant and wasteful. This project used the mapping/unmapping approach to eliminate data copying. The buffers were created using the
CL MEM ALLOC HOST PTR flag, as it yields the best performance on the Mali
[5]. In situations where copying could not be avoided, steps were taken to reduce the
adverse effect of copying as much as possible. One optimisation applied to kernels
Chapter 4. Program Optimisations
28
that wrote to more than one output buffer was to overlap the writes. This was done by
using non-blocking buffer write commands. In order to ensure memory consistency,
the CPU still had to wait for the memory transfers to complete before launching the
next kernel. Due to the pipelined nature of the algorithm, the output of one stage serves
as the input of the next stage. There is no need to read back the output of one stage
only to write it again as input for the next kernel. This optimisation was applied to the
application. In addition to the CL MEM ALLOC HOST PTR flag, two other flags
exist for buffer creation. They are:
• CL MEM USE HOST PTR which indicates that the OpenCL runtime should
store the buffer in memory referenced by the pointer used in creating it.
• CL MEM COPY HOST PTR which indicates that the runtime should allocate
memory and copy the contents of the pointer used in creating the buffer
The specification does not mandate how implementations enforces these flags. Thus,
rather counter intuitively, the CL MEM COPY HOST PTR flag was found to give
better performance on the Mali GPU.
4.2
Kernel Optimisations
This section presents the report of optimising the KinectFusion kernel code with respect to the Mali GPU. At a high level, optimising kernel code involves using vectors
for both arithmetic and memory operations; using appropriate data types; eliminating
redundant computations and using taking advantage of the speed of built in functions
[1, 3, 5]. Additionally, using the right work group size (recommended to be powers of
2. If no preference exist, allowing the runtime to decide the optimum size is preferred
[5]); avoiding barrier operations if possible and maintaining a high ratio of arithmetic
to memory operations improve kernel efficienvy.
Due to the SIMD nature of the shader cores on the Mali, use of explicit vector
instructions would result in the highest speedup. Based on analysis of the algorithms
implemented, the kernels in the application belong to one of the following three categories:
• Kernels that could be immediately vectorised. There was only one such kernel
and it did not contribute much to the total execution time.
• Kernels that could be reformulated to support use of vector instructions
Chapter 4. Program Optimisations
29
• Kernels that contained little or no opportunities for vectorisation. Additionally,
due to the nature of the problems they solve, reformulation is not possible.
4.2.1
Unit Conversion Kernel
This is a very simple kernel that produces each output element by scaling the corresponding element in the input buffer. A scalar implementation creates a work item for
each element in the output depth frame. As previously stated, the project was carried
out with depth frames half the size of the input image. Each element in the output
frame is obtained from the element in the input frame with twice its coordinates. For
example, output item at (1,3) is computed from input item at (2,6).
The scalar implementation of the unit conversion is shown in listing 4.1. Each
work item first determines its position in the execution space. It uses this value to
access both the output and input depth frames. It should be pointed out that in addition
to being computationally inefficient, it also wastes memory accesses. This is because
only half of the items in each cache line will be used. The vector implementation loads
2n items from the input and writes n items to the output frame. The values allowed for
n in OpenCL are 2, 4 or 8. Listing 4.2 shows the implementation with n = 4. It uses
the convenience ‘even’ access specifier to refer to the correct index in the horizontal
dimension. The stride variable does the same for the vertical dimension. The Mali
GPU can operate on vectors of 128 bits in a single cycle. Larger vectors are processed
in multiple cycles. There is therefore a trade off in this kernel between the number of
items processed in a vector operation and the number of cycles required to carry out
the computation.
Fig 4.1 shows execution times for the possible combinations of load and store vector width as well as the scalar implementation. The read 8/store 4 and the read 16/
store 8 configurations provide the best performances.
This was the only kernel in the application with immediately vectorisable statements. Despite the speedup resulting from applying this optimisation, this kernel does
not contribute much to the overall execution time of the application.
Listing 4.1: Scalar implementation of the unit conversion kernel
1 __kernel void mm2metersKernel1 ( __global float * depth ,
2
const uint2 depthSize ,
3
const __global ushort * in ,
4
const uint2 inSize
5 )
6 {
7
uint2 pixel = ( uint2 ) ( get_global_id (0) , get_global_id (1));
8
depth [ pixel .x + depthSize .x * pixel .y] =
Chapter 4. Program Optimisations
9
30
in [ pixel .x * 2 + inSize .x * pixel .y * 2] / 1000.0 f;
10 }
Listing 4.2: Vector implementation of the unit conversion kernel. This version loads
eight items and stores the four items at positions 0, 2, 4, 6.
1 __kernel void mm2metersKernel1 ( __global float * depth ,
2
const uint2 depthSize ,
3
const __global ushort * in ,
4
const uint2 inSize
5 )
6 {
7
uint2 pixel = ( uint2 ) ( get_global_id (0) , get_global_id (1));
8
const uint stride = depthSize .x /4;
9
ushort8 inVal = vload8 ( pixel .x + 2 * stride * pixel .y , in );
10
vstore4 ( convert_float4 ( inVal . even )/1000.0 f ,
11
pixel .x + stride * pixel .y , depth );
12 }
Figure 4.1: Execution times of various configurations of the vectorised unite conversion
kernel and the scalar implementation. The time units are in milliseconds.
4.2.2
Bilateral Filter Kernel
Bilateral filtering is a standard image enhancement process. Each pixel in the output
image is obtained by computing the weighted average of the intensity values of pixels
within a neighbourhood of the corresponding point in the input image [32]. A Gaussian
filter of radius 2 was used in this project and all optimisations applied took this filter
size into consideration. For every point in the input depth frame with a non zero value,
the output is computed using values located in a 5 × 5 square centered at that point. For
each point lying within the square having a non zero value, its contribution to the sum
is computed by multiplying the relevant Gaussian weights with the result of evaluating
a univariate Gaussian distribution having a predefined variance and mean equal to the
Chapter 4. Program Optimisations
31
depth value at the center of the square. Observe that as with all convolutions, care
must be taken when processing pixels that lie on the edge of the image. This is to
avoid reading beyond the image boundary.
The original implementation of the bilateral filter kernel uses only scalar operations. Each point in the depth image is assigned to a work item. The kernel has a
doubly nested loop that averages the values located in the 5 × 5 square centered on
the pixel assigned to the work item. The square is traversed across the columns to ensure coalesced access to global memory. Given that this implementation targeted the
NVIDIA architecture with its warp based thread scheduling, a first step towards optimising it was to evaluate if coalesced memory access yield performance benefits on
the Mali architecture (which uses a different thread execution method). The Gaussian
weights are read by all the work items and remains constant throughout the applications execution. It is thus reasonable to place it in constant memory. Although OpenCL
specification does not mandate how this memory region is handled, some architectures
like AMD [12] optimise accesses to it. No documentation exists about how constant
memory is handled by the Mali runtime. Experiments were done to evaluate the effect
of using it.
Fig 4.2 shows the result of executing the bilateral filter kernel for all possible combinations of constant/global memory and coalesced/non coalesced access. The results
show that coalescing memory access reduces performance. Additionally, using constant memory does not offer any optimisation. The best configuration (global memory
for the Gaussian filter and non-coalesced access to the input image) had an average
execution time of 4.673 ms which represents a 6.6 % and 11.9 % improvement over
the default and worst configurations respectively. This configuration was used as the
baseline for measuring the effectiveness of subsequent optimisations investigated.
The need to detect and handle access to border pixels of the input depth image
prompted the use of OpenCL Image data type in the kernel. This is because of the
automatic edge clamping it provides. It also represented a chance to evaluate the effectiveness of using the Texture pipeline of the Mali GPU. While the automatic edge
clamping simplified the kernel code, use of the Image data type resulted in almost
4% reduction in performance. This was in addition to the overhead involved in converting host buffers to OpenCL Images. This performance degradation motivation the
evaluation of vectorisaton as a means of optimising this kernel.
Due to the presence of branch statments in the kernel required to ensure that only
values greater than zero are processed, vectorising the kernel is not straightforward.
Chapter 4. Program Optimisations
32
Figure 4.2: Execution times of various configurations of memory access and region. All
times are in milliseconds
However, the fact that coalesced access to the input array does not result in improved
performance makes it possible to unroll the inner loop. The implication is that each
work item now traverses its given square along the rows. Loop unrolling requires that
the values required for the different iterations being merged be available at the same
time. This is accomplished using vector loads. Since there are five iterations of the
inner loop and OpenCL supports vector loads of 2, 3, 4, 8 and 16 elements, several
unroll factors are possible. They are:
• Unroll factor of 2 which requires two loop iteration and loads two elements at
a time in each iteration. A scalar load is needed at the end to handle the last
element.
• Full unrolling with two vector loads - first load 3 elements then 2 elements.
• A single vector load of 4 elements and a scalar load to handle the last element
• A single vector load of 8 or 16 elements but utilizing only the first 5 elements.
In all cases, each point in the depth image is assigned to a single work item. Note
that this optimisation can only be applied by work items not responsible for border
pixels. The pixels on the boundary have to be processed using the scalar approach.
With image size of 320 × 240 and filter radius of 2, (320 − 2 × 2) × (240 − 2 × 2) =
316 × 236 points will be computed using the vectorised memory loads. Observe that
once loaded, accessing the individual items of the vector using the swizzle operator
incurs no additional hardware cost [14].
Chapter 4. Program Optimisations
33
Fig 4.3 shows how effective each version of unrolling the inner loop is compared
with the best serial version (global non-coalesced memory access). As expected, the
reduction in the number of memory loads and the elimination of loop branching instructions obtained from loop unrolling resulted in improved performance of the kernels. The one exception was the version where the looped was unrolled by a factor
of 2 which resulted in performance degradation. The highest speedup was obtained
by issuing one vector load of 4 elements and a single scalar load per row of the input
image. The items loaded were all 32 bit floating point numbers. This kernel thus loads
128 bits at a time which matches the optimal vector size on the Mali [5]. An interesting aspect of the bilateral filter kernel is the amount of data sharing between work
items. Within the 5 × 5 square, two consecutive items share four elements per row.
This sharing is why the versions that load 8 and 16 items do not result in significant
performance reduction despite the fact that only a small fraction of items read from
memory is used in the computation of the output. Each Mali L1 cache line can hold
sixteen 4 byte floating point numbers. Thus, each line loaded by the first work item
in a work group may be reused by up to three adjacent threads. Listing 4.3 shows the
code for the version that loads four items at a time.
Listing 4.3: Bilateral filter kernel with inner loop fully unrolled. Four items are loaded at
a time.
1 __kernel
void bilateral_filterKernel ( __global float *out ,
2 const __global float
*in ,
3 const __global float * gaussian ,
4 const float e_d ,
5 const int r)
6 {
7
const uint2 pos
8
const uint2 size
9
const float center = in [ pos .x + size .x * pos .y] ;
= ( uint2 ) ( get_global_id (0) , get_global_id (1));
= ( uint2 ) ( get_global_size (0) , get_global_size (1));
10
if ( center == 0 )
11
{
12
out [ pos .x + size .x * pos .y] = 0;
13
return;
14
}
15
float sum = 0.0 f;
16
float t = 0.0 f;
17
const float denom = 2* e_d * e_d ;
18
if( pos .x >= 2 && pos .x < size .x - 2 && pos .y >= 2 && pos .y < size .y - 2) //bounds check
19
{
20
for(int i = -r; i <= r; i ++)
21
{
22
const int offset = (i + pos .y )* size .x + pos .x - r;
23
float4 row = vload4 ( 0, in + offset );
24
if( row . s0 > 0)
25
{
26
const float mod = sq ( row . s0 - center );
27
const float factor = gaussian [i + r ]* gaussian [0]* exp (- mod / denom );
28
t += factor * row . s0 ;
29
sum += factor ;
30
}
31
if( row . s1 > 0)
32
{
33
float factor = gaussian [i + r ]* gaussian [1]* exp (- sq ( row . s1 - center )/ denom );
Chapter 4. Program Optimisations
34
34
t += factor * row . s1 ;
35
sum += factor ;
36
}
37
if( row . s2 > 0)
38
{
39
float factor = gaussian [i + r ]* gaussian [2]* exp (- sq ( row . s2 - center )/ denom );
40
t += factor * row . s2 ;
41
sum += factor ;
42
}
43
44
if( row . s3 > 0)
45
{
46
float factor = gaussian [i + r ]* gaussian [3]* exp (- sq ( row . s3 - center )/ denom );
47
t += factor * row . s3 ;
48
sum += factor ;
49
}
50
//Handle the last item
51
const float pix = in [ offset + 4];
52
if( pix > 0)
53
{
54
const float mod = sq ( pix - center );
55
const float factor = gaussian [i + r ]* gaussian [4]* exp (- mod / denom );
56
t += factor * pix ;
57
sum += factor ;
58
}
59
}
60
}
61
else //may read out of bounds. process as scalar
62
{
63
for(int i = -r; i <= r; ++ i)
64
{
65
for(int j = -r; j <= r; ++ j)
66
{
67
const uint2 curPos = ( uint2 )( clamp ( pos .x + j , 0u , size .x -1) , clamp ( pos .y + i , 0u , size .y -1));
68
const float curPix = in [ curPos .x + curPos .y * size .x ];
69
if( curPix > 0)
70
{
71
const float mod = sq ( curPix - center );
72
const float factor = gaussian [i + r] * gaussian [j + r] * exp (- mod / denom );
73
t += factor * curPix ;
74
sum += factor ;
75
}
76
}
77
}
78
}
79
out [ pos .x + size .x * pos .y] = t / sum ;
80
}
81 }
The next version explicitly exploits the sharing of elements among adjacent work
items. Observe that all values needed by four consecutive work items in each input
row lie in 8 contiguous locations in memory. This is illustrated in fig 4.4. The next
version combines inner loop unrolling with use of explicit vector arithmetic. It does
this by exploiting the sharing of elements observed in the previous version. Specifically, a single load of 8 elements is sufficient for computing the values required by
4 consecutive output elements. This observation makes it possible to have one work
item compute the values of four adjacent output elements. This technique is known as
thread coarsening. During each iteration, the 8 items required are loaded using a single
vector operation. Each element in the vector is then tested in turn. The ones having
values greater than zero are used to update the running sums for all output elements
Chapter 4. Program Optimisations
35
Figure 4.3: Speedup of the various versions of the inner loop unrolling bilateral filter
kernel relative to the best performing serial version
that require them. To support use of explicit vector arithmetic, a 4 wide vector mask
is used to implement the sliding of the filter across the row of 8 items. The values in
the mask are turned on only for the output elements that depend on the current input
element being processed. At the end of the iterations, output elements are updated
only if the corresponding input values are non zero. The formulation of this version
potentially involves redundant computations depending on the distribution of values in
the input depth frame. However, since vector operations take one cycle to process all
items in a vector, this overhead is small compared with the reduction in memory operations. Five reads from the input depth image is now sufficient to compute four output
elements as against up to 100 accesses required by the scalar implementation. Note
that the width of the border has now increased from 2 to 4 in the horizontal dimension.
Thus, this version processes (320 − 2 × 4) × (240 − 2 × 2) = 312 × 236
This version took 2.74 ms to complete, representing a 30 % improvement over the
previous implementation.
As a final optimisation, tests were done to determine the most appropriate work
group size for the bilateral filter kernel. Work group size selection affects memory
usage efficiency [12, 14]. Fig 4.5 shows the result for several work group sizes for the
thread coarsened version. While there are some obviously inefficient configurations,
performance was fairly constant for the different work group sizes. An interesting
observation is that experimentally deciding the best work group size (4 × 16) resulted
in slightly better performance than that obtained by allowing the runtime to select the
optimum work group size. A size of (4 × 16) corresponds to only one work group
Chapter 4. Program Optimisations
36
Figure 4.4: Illustration of how with a filter radius of 2, four consecutive elements can
reuse a single vector load of 8 items
being active per core.
Figure 4.5: Effect of work group size on the execution time (in milliseconds) of the
bilateral filter kernel
4.2.3
Volume Integration Kernel
As previously stated, this kernel requires every point in the 3-dimensional volume to
be visited and the value stored at each point updated if the point lies in the current
depth frame. Each point stores a running average (signed distance function SDF) that
represents the current estimate of the distance to the sensor centre.
To ensure scalability and maintain real time reconstruction rates, the KinectFusion
algorithm partitions the work in two dimensions instead of the 3D decomposition natural for the volume. Each thread then progressively visits each point along the third
Chapter 4. Program Optimisations
37
dimension. The goal of this formulation is to ensure coalesced memory access. Given
the observation that coalescing access to memory does not improve performance on
the Mali, the first step in the attempt to optimise this kernel was to reformulate it to
use a 3D iteration space - one work item per point in the volume and thus no looping.
Performance reduced sharply - reconstruction was inaccurate and occurred at very low
rates. This can be attributed to the large overhead required to create and schedule the
hardware threads. However, the kernel could now be launched with 256 concurrent
work items as against the 64 in the original version. This seemed to indicate that unrolling the loop would improve performance. However, the memory access pattern of
the kernel and the thread scheduling policy of the Mali made unrolling impossible.
This is because memory for the volume is laid out linearly and values used by work
items in successive iterations are separated by a distance of N 2 .
The kernel requires SPMD style programming and contained no opportunities for
vectorisation. Additionally, the absence of data sharing among work items prevented
the application of thread coarsening as an optimisation.
The next optimisation opportunity investigated was workgroup size selection. The
goal is to determine the work group dimension that would make the best use of the
L1 cache which is unique to each compute device. On each loop iteration, each work
item may read one floating point from the filtered depth map. Each point in the volume
is first converted into image coordinates. The lookup in the depth map is performed
only if the resulting point lies within the depth map ( the point is aligned with the
current pose of the camera). The value read from the depth map is used to decide if
the SDF value at that point in the global volume should be updated. The SDF value is
made up of two short integers (accessed as a vector of size 2) representing the running
average of the SDF and the weight assigned to it respectively. Thus in each iteration,
every work item reads a maximum of one 4 byte float and two 2 byte shorts from the
depth map and volume respectively. Since the L1 cache lines on the Mali are 64 bytes
long, each will contain 16 items each. To minimise the number of loads issued, cache
lines should be shared as much as possible. The thread scheduling policy on the Mali
suggests that L1 cache sharing would be increased by having work groups with size of
multiples of 16 items in the horizontal index dimension. Experiments were conducted
to validate this hypothesis.
Fig 4.6 shows the result of executing this kernel using the possible work group
dimensions. Also shown is the result of leaving the work group size unspecified and
allowing the runtime to pick the optimum configuration. With the exception of a few
Chapter 4. Program Optimisations
38
outliers, most of the configurations resulted in similar performance. Once again, it
was possible to experimentally determine an optimum work group size better than the
runtime. Additionally, the results do not back the initial hypothesis that work group
sizes of multiples of 16 in the horizontal dimension would be optimum. The best
configuration was the 2 × 2 sized work group.
Figure 4.6: Execution times in milliseconds of the integration kernel for different work
group sizes
A 256×256×256 volume was used in this project. Thus, each work items executes
the loop body 256 times and may both load from and store to memory in each iteration.
The number of operations executed by each thread in each iteration depends on the
value stored at the current point being processed. There is a possibility of threads
diverging - executing different loop iterations. This divergence may lead to irregular
memory access causing performance degradation. To determine if thread divergence
was a problem, all threads were forced to execute a barrier before exiting each loop
iteration. This has the effect of forcing all threads to always execute the same loop
iteration. Performance did not change leading to the conclusion that thread divergence
is not a problem with this kernel.
As part of determining if the SDF at a point in the volume should be updated, the
kernel uses the square root function to compute the normalised distance of the current
point from the origin. OpenCL defines three versions of this function with different
speed and precision - sqrt , native sqrt and half sqrt. The native sqrt is supposed
to give the best performance on the Mali [5]. Fig 4.7 shows the result of executing
the kernel with the these three versions. In all three cases, precision was enough to
Chapter 4. Program Optimisations
39
generate correct results. However, use of the native sqrt actually resulted in reduced
performance.
Figure 4.7: Execution times for different versions of the square root function used in the
Integration kernel
4.2.4
Optimisation of the Track and Reduce Kernels
These two kernels cooperatively determine the current camera pose. The track kernel
implements the ICP algorithm to identify points of correspondence between the current
depth frame and the previous frame. It uses the camera pose computed in the previous
iteration and the current vertex and normal maps. It also uses the raycasted vertex
and normal maps from the previous iteration if they exist. The output of tracking is
summed in parallel by the reduce kernel. Each work item in the track kernel groups
three consecutive elements in the vertex and normal maps and treats them as a single
input item. These points are loaded and stored together as vectors. Due to the amount
of conditional statements required to handle error conditions, this loading and storing
represents the only means of applying vector operations in this kernel.
A good option would have been to use Image data type and access each group of
three elements as a single CL RGB object. However, OpenCL does not support the
CL RGB format for floating point numbers which is the data type used to store both
the vertex and normal maps. The alternative would have been to use the CL RGBA
format, but that would have resulted in increased memory footprint. Additionally, the
overhead of converting to and from Image data type would have been significant.
Kernel code optimisations applied include reducing the amount of private memory
Chapter 4. Program Optimisations
40
used by the work items by avoiding parameter duplication. For example, the vertex
and normal maps both had the same size. Hence, one size parameter was sufficient.
The output of the kernel is a composite type made up of a floating point error value, an
integer result flag and an array of 6 floating point numbers. The total size of this structure is 32 bytes which is an integer multiple of the recommended 16 byte alignment
size on the Mali [5]. Care was taken to ensure both the host and OpenCL compilers
used the best alignment to represent this structure, thus improving memory load 3 .
The largest improvement on the original version was obtained by modifying how
the output is stored. The original version required three stores to save the output.
The code snippet in listing 4.4 shows how the loading is handled. Both halves of the
array are first written to using vector operations to store three items at once. This is
followed by the saving of the output. The vector loads are clearly redundant and can be
eliminated. Listing 4.5 shows the improved code to store the output. Speedup results
from the fact that instructions to set the elements of the array access only registers (So
long as the Compiler has not spilled a variable to memory). This simple modification
resulted in a 20 % speedup of the kernel.
Listing 4.4: Code snippet showing redundant stores in the initial version of the track
kernel
1 vstore3 ( referenceNormal ,0 ,((float *) row .J )); //Write the first three elements of J array
2 vstore3 ( cross ( projectedVertex , referenceNormal ) ,1 ,((float *) row .J )); //Write the remaining three elements
3 output [ pixel .x + outputSize .x * pixel .y] = row ; //Save the output
Listing 4.5: Improved version of the track kernel with redundant stores eliminated
1 row .J [0] = referenceNormal .x;
2 row .J [1] = referenceNormal .y;
3 row .J [2] = referenceNormal .z;
4 const float3 res = cross ( projectedVertex , referenceNormal );
5 row .J [3] = res .x;
6 row .J [4] = res .y;
7 row .J [5] = res .z;
8 output [ pixel .x + outputSize .x * pixel .y] = row ; //Save the output
The reduce kernel contained the least opportunities for optimisation as it had to be
executed in a specific manner and with fixed work group size to ensure correctness.
The output of the tracking phase is shared among the work items in the different work
groups. Each item computes the partial reduction (using addition) of one or more rows
and stores the partial sums in private memory. The partial sums are then copied over to
local memory which is unique to each work group. Each work item in the work group
then executes a barrier to ensure memory consistency. The first 32 items in each work
3A
simple approach was to declare the array before the scalar types. This improved performance by
almost 3 %
Chapter 4. Program Optimisations
41
then then aggregate the partial sums and copy the final output to global memory. As
part of the local reduction, each work item does 22 multiply and accumulate operations
in each loop iteration. As an optimisation attempt, these operations were replaced with
faster but less precise hardware MAD (multiply-add) and FMA (fused multiply-add)
instructions. However, in both cases, the loss of precision resulted in incorrect results.
Despite the limited opportunities for optimisation, speedup was accomplished by
reusing the output buffer created by the track kernel as the input of the reduce kernel,
thus eliminating some buffer management overhead. The pair of track-reduce kernels
are called a fixed number of times per pyramid level per iteration of the algorithm.
The image dimensions used in the tests and the three pyramid levels resulted in these
kernels being invoked 19,278 times. Thus, the reduction in overhead is significant.
4.3
Execution Optimisations
Section 3.1.2 highlighted the fact that the CPU remains idle during most stages in
each iteration of the algorithm. A straightforward way of utilising the CPU and GPU
concurrently is to begin processing the initial stages of the next iteration once the
buffers become safe for reuse. This is analogous to how processors hide the latency of
memory operations by scheduling other instructions in their shadow. The earliest point
where the next iteration can be safely started is during raycasting. All operations up to
but not including tracking can be carried out. The next operation that lasts long enough
to accommodate scheduling of computation concurrently is the input rendering kernel.
To ensure sustained performance, the GPU should be able to start working as soon
as it completes the previous iteration without needing to wait for the CPU. This requires
finely balancing the CPU execution times of the initial stages of the next iteration. This
approach to utilising both processors is challenging to implement because it requires
use of OpenCL callback mechanism which is not very robust (for example, enqueueing
blocking commands in a callback is not defined). Additionally, the host code has to
maintain state information about how far along in the next iteration the program has
gone to avoid work duplication.
A different approach is to have the CPU and GPU work on the same task concurrently. One piece of work that can be naturally partitioned is the integration kernel.
The volume can be split along the z dimension and assigned to the processors. The
higher degree of parallelism of the GPU dictates that it should do a larger fraction of
the volume. To improve performance, the integration kernel should be split into two.
Chapter 4. Program Optimisations
42
The first kernel handles the coordinate conversion for the points on the first slice (z =
0) of the volume. The second kernel uses this result and the position and camera deltas
to determine the correct values for the current point in the volume. The output of the
first kernel is written to CPU accessible memory and also kept in the buffer for use by
the GPU in the second kernel. A copy of the input depth map also has to be made so
both processors can use it simultaneously. This formulation avoids the need to track
state and does not depend on the fragile OpenCL callback.
Chapter 5
Overall Evaluation
The previous chapter discussed the optimisations investigated and presented the effect
of applying them to individual kernels. The overall effect of the optimisations on the
application is discussed in this chapter. Due to issues with rendering observed in the
use of the most optimised version of the bilateral filter kernel, it was not used in the
comparison. The version that unrolls the inner loop by loading a vector of length 4 was
used.
To quantitatively establish the speedup obtained, the final implementation was
compared with the initial unoptimised OpenCL version as well as with a CPU only
implementation. The CPU implementation used OpenMP to facilitate concurrent processing. To capture the overall effect of computations as well as memory accesses,
wall clock time was used in the evaluation. The results are shown in fig 5.1. Using
total elapsed time as a metric, both OpenCL versions perform better than the CPU only
version. The optimisations applied resulted in speedups of 1.5 and 3.3 compared with
the original OpenCL implementation and the CPU version respectively.
In comparison with the CPU only version, the OpenCL versions were better able to
exploit the high degree of data parallelism contained in the algorithm. Reconstruction
algorithms are evaluated in terms of frames processed per second (frame rate). The
KinectFusion algorithm has a frame rate of 30 fps on capable desktop GPUs. However,
this value could not be matched on the Mali GPU despite the optimisations applied.
As previously stated, the existence of conditional statements and the resulting SPMD
nature of most kernels greatly reduced the opportunities available for vectorisation.
Attempts at restructuring the computations to enable use of vector operations were not
always successful.
The effect of conditional statements were most noticeable in the kernels that per43
Chapter 5. Overall Evaluation
44
Figure 5.1: Comparison of execution times (in seconds) of three versions of the KinectFusion algorithm
formed camera tracking and integration of the depth frame into the global volume. The
existence of a program counter per hardware thread helped offset the loss of efficiency
resulting from inability to use vector instructions by eliminating the work duplication
that would have occurred in warp/wavefront based systems. The full caching of data
used by the integration kernel also reduced the effect of possible thread divergence resulting from input data distribution. Nevertheless, the reduced core count on the Mali
ultimately dominated the application’s performance.
As with prior work on mobile visual computing, this project investigated alternative formulations of the algorithm. Focus was primarily on how several iterations can
be overlapped. It may also have been worthwhile to consider possibly less precise but
faster data representations. Another direction that may have yielded positive results is
the use of alternative layout of data items in memory. For example, the 3D data structure used to represent the global scene is laid out linearly in contiguous locations in
memory. A different memory allocation scheme (possibly striping) could have resulted
in improved performance.
Chapter 5. Overall Evaluation
5.1
45
Energy Considerations
While actual power consumption measurements were not obtained, it is reasonable to
expect that the optimisations applied would result in a more efficient solution. This is
because of the reduction in the number of memory accesses and clock cycles needed
to process all input depth frames. While it is true that applying either of proposed
methods of cooperatively utilising the CPU and GPU will result in an increase in peak
power, the consequent speedup will ensure an overall improvement in energy efficiency.
Chapter 6
Conclusions
This project investigated the extent to which the heterogeneous computing facilities
present in current embedded systems make them capable of executing computationally
intensive visual SLAM applications. The KinectFusion algorithm was used as a case
study and its behaviour was evaluated on an ARM CPU + GPU system.
Several optimisation opportunities were investigated with varying degrees of success. Vectorisation, either via explicit SIMD instructions or through vector memory
access yielded the greatest benefits. The nature of the algorithm however greatly limited the extent to which vectorisation could be applied. Also considered were means
of utilising both the CPU and GPU concurrently. The original algorithm uses the GPU
exclusively for most of the computations.
Experimental results show that exploiting the heterogeneous resources and the architecture of the development system resulted in speedup. The final implementation
was about 1.5 faster than the original unoptimised version. It was also about 3.3 times
faster than a CPU only implementation. However, the reconstruction rate achieved was
still much below that obtained on a desktop GPU.
A direct extension to this project includes taking more advantage of the unified
memory architecture of the Mali GPU by eliminating all forms of data copying. This
would involve a reimplementation of the buffer management routines. Alternative
methods of using the CPU and GPU concurrently should be explored also.
46
Bibliography
[1] AMD Accelerated Parallel Processing OpenCL Programming Guide. http:
//developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_
Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf.
Ac-
cessed on August 1, 2014.
[2] NVIDIA CUDA. https://developer.nvidia.com/cuda-zone. Accessed on
April 4, 2014.
[3] NVIDIA OpenCL Best Practices Guide.
http://www.nvidia.com/
content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_
BestPracticesGuide.pdf. Accessed on August 1, 2014.
[4] ARM.
The mali - t604 product page.
http://www.arm.com/products/
multimedia/mali-graphics-plus-gpu-compute/mali-t604.php.
Ac-
cessed on April 4, 2014.
[5] ARM.
2.0.
Mali-T600 Series GPU OpenCL Developer Guide Version
http://infocenter.arm.com/help/topic/com.arm.doc.dui0538f/
DUI0538F_mali_t600_opencl_dg.pdf. Accessed on April 4, 2014.
[6] Kwang-Ting Cheng and Yi-Chu Wang. Using mobile gpu for general-purpose
computing 2013; a case study of face recognition on smartphones. In VLSI Design, Automation and Test (VLSI-DAT), 2011 International Symposium on, pages
1–4, April 2011.
[7] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 303–312,
New York, NY, USA, 1996. ACM.
47
Bibliography
48
[8] Daniel Castao Dez, Hannes Mueller, and Achilleas S. Frangakis. Implementation
and performance evaluation of reconstruction algorithms on graphics processors.
Journal of Structural Biology, 157(1):288 – 295, 2007. Software tools for macromolecular microscopy.
[9] Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and
Jack Dongarra. From CUDA to opencl: Towards a performance-portable solution
for multi-platform GPU programming. Parallel Computing, 38(8):391 – 407,
2012. APPLICATION ACCELERATORS IN HPC.
[10] Engineering and Physical Sciences Research Council (EPSRC).
PAMELA:
a panoramic approach to the many-core landsape - from end-user to enddevice: a holistic game-changing approach.
http://gow.epsrc.ac.uk/
NGBOViewGrant.aspx?GrantRef=EP/K008730/1. Accessed on April 4, 2014.
[11] Jianbin Fang, A.L. Varbanescu, and H. Sips. A comprehensive performance comparison of cuda and opencl. In Parallel Processing (ICPP), 2011 International
Conference on, pages 216–225, Sept 2011.
[12] Benedict Gaster, Lee Howes, David R Kaeli, Perhaad Mistry, and Dana Schaa.
Heterogeneous Computing with OpenCL: Revised OpenCL 1. Newnes, 2012.
[13] Reitmayr Gerhard. Open Source Implementation of KinectFusion Using CUDA.
https://github.com/GerhardR/kfusion.
[14] Johan Gronqvist and Anton Lokhmotov. Optimising opencl kernels for the arm r
mali tm-t600 gpus. 2014.
[15] P. Harris. The Mali GPU: An Abstract Machine, Part 3 - The Shader Core.
http://community.arm.com/groups/arm-mali-graphics/blog/2014/03/
12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.
Accessed on August 1, 2014.
[16] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew
Davison, and Andrew Fitzgibbon. Kinectfusion: Real-time 3d reconstruction
and interaction using a moving depth camera. In Proceedings of the 24th Annual
ACM Symposium on User Interface Software and Technology, UIST ’11, pages
559–568, New York, NY, USA, 2011. ACM.
Bibliography
49
[17] Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban, and Cho-Li
Wang. Heterogeneous computing: Challenges and opportunities. IEEE Computer, 26(6):18–27, 1993.
[18] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces.
In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pages 225–234, Nov 2007.
[19] G. Klein and D. Murray. Parallel tracking and mapping on a camera phone.
In Mixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE International
Symposium on, pages 83–86, Oct 2009.
[20] Seung Eun Lee, Yong Zhang, Zhen Fang, S. Srinivasan, R. Iyer, and D. Newell.
Accelerating mobile augmented reality on a handheld platform. In Computer
Design, 2009. ICCD 2009. IEEE International Conference on, pages 419–426,
Oct 2009.
[21] J. Leskela, J. Nikula, and M. Salmela. Opencl embedded profile prototype in
mobile device. In Signal Processing Systems, 2009. SiPS 2009. IEEE Workshop
on, pages 279–284, Oct 2009.
[22] W.J. MacLean. An evaluation of the suitability of fpgas for embedded vision
systems. In Computer Vision and Pattern Recognition - Workshops, 2005. CVPR
Workshops. IEEE Computer Society Conference on, pages 131–131, June 2005.
[23] Aaftab Munshi. The OpenCL Specification version 2.0. http://www.khronos.
org/registry/cl/specs/opencl-2.0.pdf, Mar 2014. Accessed on April 4,
2014.
[24] Aaftab Munshi, Benedict Gaster, Timothy G Mattson, and Dan Ginsburg.
OpenCL programming guide. Pearson Education, 2011.
[25] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David
Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking.
In Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium on, pages 127–136, Oct 2011.
[26] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, and J.C. Phillips. Gpu
computing. Proceedings of the IEEE, 96(5):879–899, May 2008.
Bibliography
50
[27] S. Rusinkiewicz and M. Levoy. Efficient variants of the icp algorithm. In 3-D
Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, pages 145–152, 2001.
[28] Olaf Schenk, Matthias Christen, and Helmar Burkhart. Algorithmic performance
studies on graphics processing units. Journal of Parallel and Distributed Computing, 68(10):1360 – 1369, 2008. General-Purpose Processing using Graphics
Processing Units.
[29] Nitin Singhal, Jin Woo Yoo, Ho Yeol Choi, and In Kyu Park. Implementation and
optimization of image processing algorithms on embedded gpu. IEICE TRANSACTIONS on Information and Systems, 95(5):1475–1484, 2012.
[30] Google Advanced Technology and Projects (ATAP). Project tango. https://
www.google.com/atap/projecttango/. Accessed on April 4, 2014.
[31] S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel, editors,
Exploring Artificial Intelligence in the New Millenium. Morgan Kaufmann, 2002.
to appear.
[32] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In
Computer Vision, 1998. Sixth International Conference on, pages 839–846, Jan
1998.
[33] Daniel Wagner and D. Schmalstieg. Making augmented reality practical on mobile phones, part 2. Computer Graphics and Applications, IEEE, 29(4):6–9, July
2009.
[34] Guohui Wang, B. Rister, and J.R. Cavallaro. Workload analysis and efficient
opencl-based implementation of sift algorithm on a smartphone. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages
759–762, Dec 2013.
[35] Guohui Wang, Yingen Xiong, J. Yun, and J.R. Cavallaro. Accelerating computer vision algorithms using opencl framework on the mobile gpu - a case study.
In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 2629–2633, May 2013.