Performance Evaluation of an OpenCL
Transcription
Performance Evaluation of an OpenCL
Performance Evaluation of an OpenCL-based Visual SLAM Application on an Embedded Device Olise-Emeka Charles Okpala NI VER S E R G O F H Y TH IT E U D I U N B Master of Science Computer Science School of Informatics University of Edinburgh 2014 Abstract The use of GPUs as accelerators for general purpose applications is standard practice in high performance computing. Their highly parallel architecture and very high memory bandwidth are two key reasons for their wide adoption. Embedded devices like mobile phones and tablets now come equipped with GPUs as well as other special purpose accelerators. As with their desktop counterparts, mobile GPUs have started to attract attention as platforms for executing computationally intensive applications. Executing such applications has traditionally been impractical on mobile devices due to their low processing power occasioned by energy and thermal considerations. Due to the difference in their target operating environments, mobile and desktop GPUs have very different architectures. Differences include number of processing cores and clock speed. While desktop GPUs typically possess thousands of high frequency cores, mobile devices have significantly less number of compute cores operating at much lower frequencies. These and other limitations of embedded GPUs necessitate the evaluation of candidate computationally intensive applications on the mobile devices. Research have focused on image processing and augmented reality applications. This is motivated by today’s use of mobile devices for media consumption. One area attracting research interest is real time 3D scene reconstruction using mobile phones. This project used the OpenCL programming standard to evaluate the performance of a representative application ( KinectFusion) on the ARM Mali GPU. Optimisation opportunities investigated include better utilisation of the device’s memory hierarchy and use of vector instructions to exploit the SIMD architecture of the GPU. Also considered were means of using the CPU and GPU cooperatively. Optimisation efforts were hindered by the nature in which computations were expressed in the algorithm. Where possible, restructuring the implementation helped to overcome these challenges. The resulting version of the project had 1.5 and 3.3 speedups over the initial implementation and a CPU only version respectively. i Acknowledgements I would like to express my gratitude my supervisor, Mike O’Boyle for his guidance and assistance in the course of carrying out this project. Special thanks to my family and friends for all the encouragement. I am indebted to the Nigeria LNG limited for graciously sponsoring my MSc programme. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Olise-Emeka Charles Okpala) iii Table of Contents 1 2 3 Introduction 1 1.1 Project Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background 4 2.1 Review of Heterogeneous Computing . . . . . . . . . . . . . . . . . 4 2.2 OpenCL Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The Mali T600 GPU Series . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Visual SLAM and KinectFusion . . . . . . . . . . . . . . . . . . . . 13 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Visual Computing Acceleration . . . . . . . . . . . . . . . . 15 2.5.2 OpenCL-based Embedded Vision Systems . . . . . . . . . . 17 Application Analysis 18 3.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 Dependence Analysis . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . 22 Kernels Development and Profiling . . . . . . . . . . . . . . . . . . . 23 3.2 4 Program Optimisations 26 4.1 General Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1 Reduction of OpenCL Runtime Overheads . . . . . . . . . . 27 4.1.2 Memory Optimisations . . . . . . . . . . . . . . . . . . . . . 27 Kernel Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 iv 4.3 5 6 4.2.1 Unit Conversion Kernel . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Bilateral Filter Kernel . . . . . . . . . . . . . . . . . . . . . 30 4.2.3 Volume Integration Kernel . . . . . . . . . . . . . . . . . . . 36 4.2.4 Optimisation of the Track and Reduce Kernels . . . . . . . . 39 Execution Optimisations . . . . . . . . . . . . . . . . . . . . . . . . 41 Overall Evaluation 43 5.1 45 Energy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions 46 Bibliography 47 v Chapter 1 Introduction The last few years have seen remarkable increase in the level of sophistication present in mobile processors. Advancements in hardware design in the desktop/server computer space have been successfully translated to embedded systems like mobile phones and tablets. Today’s embedded devices feature multi core CPUs, Graphics Processing Units (GPUs) and high speed DRAM. The result is that in addition to traditional applications in telephony, smart phones are now being used to both generate and consume media rich content. Applications include social networking, spreadsheets, music/video playback and mobile gaming. As has been successfully done in the desktop/server computing space, developers are beginning to utilise the heterogeneous resources of mobile devices (notably CPU and GPU) in a coordinated manner to execute computationally intensive applications. However, two reasons make this challenging. Firstly, embedded devices have lacked support for general purpose heterogeneous programming languages like CUDA [2] requiring solutions to be recast as graphics programs. This has limited the range of general purpose applications developed. The recent development of the Open Computing Language (OpenCL) and its adoption by mobile device manufacturers mean that true general purpose heterogeneous computing is now possible. An even more challenging issue pertains to how embedded devices are designed. Specifically, power constraints limit the number of compute cores available. For example, while thousands of cores are common in desktop GPUs, embedded devices typically have no more than eight cores. Another consequence of the power limitation is that clock frequencies are much lower in embedded systems. Additionally, mobile GPUs are usually integrated with the CPU on a single chip and share the same memory. They therefore lack the dedicated high bandwidth memory that make desktop GPUs suitable for throughput computing [33, 34]. 1 Chapter 1. Introduction 2 These constraints mean that direct ports of desktop implementations of computationally intensive algorithms are not guaranteed to work on embedded devices since mobile architectures are usually not able to take advantage of optimisations that work on desktop systems [34]. Several researchers have evaluated the performance of image processing and basic computer vision algorithms on mobile devices. Driven in part by advancements in mobile camera technology, one class of vision applications that is attracting interest is the use of a mobile device for generating geometrically accurate 3D models of a scene in real time as it is being scanned by the camera. Examples include Google’s project Tango [30] and EPSRC PAMELA [10]. These projects have the potential to provide more immersive augmented reality experience, better navigation assistance for the disabled etc. This project assesses the performance of a representative 3D reconstruction algorithm on embedded devices and explores opportunities for application optimisation. The algorithm chosen is the KinectFusion [25] and an OpenCL based implementation was developed for the ARM Mali T604 mobile GPU. 1.1 Project Contributions Previous OpenCL based evaluation work on mobile and embedded devices have focused on algorithms for processing a single static image. This project extends these prior research efforts by considering an application that processes a stream of image under real time constraints. These applications place much greater demands on the limited computing resources available on mobile devices. The KinectFusion application and the algorithms it uses were developed specifically for execution on massively parallel GPUs. The limited processing power of embedded devices motivates the consideration of different reformulations of the algorithm. This project presents a method of interleaving different iterations of the algorithm to use both the CPU and GPU concurrently. 1.2 Outline Chapter 2 provides background information relevant to this project including a dis- cussion of related work Chapter 1. Introduction Chapter 3 3 analyses the KinectFusion algorithm and provides a motivation for the OpenCL implementation. Issues considered include parallelism identification and task partitioning between the CPU and GPU. The results obtained from profiling the initial implementation are discussed. Chapter 4 presents the optimisations applied. Code, memory and execution optimi- sations are evaluated. Chapter 5 conducts an evaluation of the final version of the algorithm. Chapter 6 concludes this dissertation. A review of the profitability of the optimisa- tions is presented as well as recommendations for future work. Chapter 2 Background This chapter discusses concepts relevant to this project. A brief overview of heterogeneous computing is presented. This is followed by coverage of the OpenCL standard and its implementation on the Mali GPU. An evaluation of related research concludes this chapter. 2.1 Review of Heterogeneous Computing Heterogeneous computing as a problem solving approach has existed in various forms. An early use was in supercomputing environments composed of different classes of machines connected via a network. In this context, heterogeneous computing is defined as the ‘the well-orchestrated and coordinated effective use of a suite of diverse high-performance machines (including parallel machines) to provide superspeed processing for computationally demanding tasks with diverse computing needs’ [17]. The objective is to efficiently solve problems possessing different types of embedded parallelism by using the machines most suitable for each kind of parallelism. The first manifestation of heterogeneity in desktop computers was the use of specialised floating point coprocessors to provide or accelerate arithmetic computations. Other dedicated accelerators include IO processors and graphics processing units (GPUs). Using these specialised processors allowed the CPU’s performance on general purpose code to improve. Processor manufacturers have traditionally increased performance by exploiting advances in manufacturing technology to increase CPU clock frequencies. Power utilisation has been kept low by reducing the chip’s operating voltage. Recently, it became impossible to reduce voltage and still obtain accurate performance (as it will no 4 Chapter 2. Background 5 longer be possible to reliably distinguish between 1 and 0 voltage levels). Thus, further increase in operating frequency would result in greater power utilisation with the attendant thermal management challenges [12]. The solution adopted has been to rely on many low frequency CPU cores to deliver improved performance and to exploit the heterogeneous nature of today’s machines. The GPU while initially designed exclusively for rendering images has always possessed characteristics that make it attractive for general purpose computing. These attributes include its highly parallel nature (thousands of cores) and very high speed graphics memory. Two factors that prevented the use of GPUs for general purpose parallel processing were the lack of support for floating point arithmetic and the fact that they could only be programmed using graphics paradigms. The inclusion of floating point support and development of languages like CUDA [2] meant that GPUs could be used for general purpose computing [26]. Several researchers have demonstrated the performance benefits that result from using general purpose GPU (GPGPU) computing [28, 8]. In many respects, heterogeneous computing today is used exclusively to refer to the CPU + GPU combination. In addition to the GPU, other specialised hardware accelerators like DSPs and FPGAs exist in modern computer systems and have been considered for use in general purpose parallel programming [22]. Proper exploitation of the increased computing capabilities require the use of well defined programming languages and standards. The next section presents OpenCL, a framework for explicitly programming heterogeneous computing systems. 2.2 OpenCL Framework OpenCL [23] is a standard for cross-platform heterogeneous systems development. Applications written in OpenCL can run on general purpose processors like CPUs as well as specialized processors like GPUs, FPGAs and DSPs. The OpenCL framework is composed of a programming language, a runtime system and API.The specification defines a core set of APIs that all implementers must support in addition to providing facility for vendor specific extensions. OpenCL programs are written using a restricted version of the C language, augmented to support parallel programming. Parallel execution is expressed using kernels which are similar to functions in the C language. A consequence of OpenCL’s goal of application portability is that hardware abstractions are low level. A programmer is required to dynamically determine the types and capa- Chapter 2. Background 6 bilities of accelerators available in a given system and select the ones most suitable for the tasks. The kernels are then compiled for the specific device families discovered. The usual way of achieving this is to compile the kernels online (at application startup). Offline compilation is also possible, with binaries being scheduled for execution on the devices. However, this would restrict the application to working on specific vendors’ implementations of OpenCL. The OpenCL specification uses a set of abstract models to describe the architecture of heterogeneous devices and application execution. These models are as follows: 2.2.1 Platform Model This is an abstract description of an heterogeneous system at the hardware level. This model is made up of a single host and one or more devices. The host is responsible for coordinating program execution. It interfaces with the environment external to the OpenCL program to perform tasks like IO and interacting with users. The host is a general purpose processor like an CPU. The devices provide acceleration for OpenCL code. A device is hierarchically divided into independent compute units that are themselves made up of processing elements. OpenCL kernels are executed on the processing elements. Examples of OpenCL devices include GPUs, DSPs and CPUs (both whole processors or individual cores in a multi core chip.) A concrete implementation of this model contains devices from a single vendor and maps the model to vendor-specific hardware architecture. Fig 2.1 presents a description of this model. Figure 2.1: OpenCL platform model [23] Chapter 2. Background 2.2.2 7 Execution Model OpenCL applications are composed of two distinct parts: kernels that execute on devices and a host program. The host program is responsible for setting up the environment as well as initiating execution of kernels. The execution model defines the interaction between the host and the devices as well as how kernels are executed on the devices. The responsibilities of the host program are described in terms of the context and command queues: A context provides the definition of the environment within which kernels execute. It is setup and managed by the host program. In addition to the kernels, a context comprises the following: • A list of devices on which kernels will be executed. All the devices in a given context must be from the same platform i.e. the same hardware vendor. A context can be created for a specific class of devices (e.g. GPU) or for all devices in a system. • Program Objects encapsulating source code and executables that define the kernels. As previously mentioned, OpenCL kernels are usually built as part of application start up. This is done by the host using the compiler provided by the platform for which the context has been created. • Memory Objects which are data structures accessible to devices and transformed by instances of kernel execution. The host program creates these memory objects to serve as either inputs or outputs to the kernels. There are two kinds of memory objects in OpenCL - Buffers and Images. Depending on how memory objects are defined and the specifics of the runtime, explicit data movement may be required to and from host to device memory Host-device interaction occurs via command queues. The host program submits commands for execution on a device. Note that a command queue can be attached to only one device. Functions of OpenCL commands are: scheduling kernel execution; managing transfer of memory objects to and from devices; constraining order in which kernels are executed. Commands placed in a queue are executed in FIFO order by default. Queues can be configured to behave in out of order mode. In this case, the programmer is responsible for explicitly managing dependencies among commands in the queue using events. Chapter 2. Background 2.2.2.1 8 Semantics of OpenCL Parallel Execution Each kernel enqueue command results in the creation of independent parallel threads of execution by the OpenCL runtime. Each instance of the kernel is referred to as a work item. The number of work items created is determined by the size of an integer index space known as an NDRange specified by the programmer. A maximum of 3 dimensions are supported for the index space. One work item is created for each point in this index space. Work items are clustered into equally sized work groups. Work groups are the unit of concurrent execution in OpenCL and provide the only means of synchronising the activities of work items. All work items in the same work group execute concurrently on the same compute unit of an OpenCL device and share resources of the device. It should be noted that the OpenCL specification does not provide guarantees about parallel execution of work items within a work group. An implementation may schedule a work group in smaller batches or serialise the work items so long as the semantics of concurrent execution are preserved [23, 24]. However, the existence of more than one compute units in a device means that work groups can be run in parallel. 2.2.3 Memory Model This model specifies the components and organisation of memory in heterogeneous systems. It also defines the consistency model supported by OpenCL. As previously mentioned, OpenCL contains two kinds of memory objects: Buffers and Images. Buffers are contiguous blocks of memory to which any kind of data structure can be mapped and are accessible via pointers. They are comparable to arrays in the C language. Images are specialised data types holding 1, 2 or 3 dimensional graphics images designed to take advantage of the texture hardware of GPUs. Thus, Images are opaque objects with respect to the programmer and can only be manipulated via OpenCL calls. Access to specific parts of an Image are not allowed. Images are useful in situations requiring edge detection and interpolation as they provide these services automatically in an optimised manner. OpenCL memory is divided into parts: Host Memory directly available to the host processor and managed outside of OpenCL using regular OS facilities and Device Memory accessible to kernels running on OpenCL devices. Device memory is partitioned into four address spaces as follows: Chapter 2. Background Global Memory 9 This region is available to both the host and devices. All work items from all available devices have read-write access to objects in this address space. This memory region has the highest access latency. Caching of reads and writes to this region may be supported depending on the capabilities of the device. Constant Memory This is a subset of global memory used to store read only items accessed by all work items simultaneously. Objects in this region are placed by the host processor. Depending on the capabilities of the device and OpenCL runtime, the complier generates specialised instructions to optimise access to items in this address space. An example optimisation might be issuing only one load instruction just before the kernel is executed. Local Memory This address space is shared by items in a work group. It is unique to each device and depending on the capabilities of the device, will be implemented using on chip memory. It usually provides access latencies much less than global memory. Private Memory This address space is unique to individual work items and is the default scope of kernel variables (excluding pointers). It is normally implemented using registers and thus may provide the lowest access latency. Fig. 2.2 gives a visual illustration of how these address spaces are mapped between a host and a single device. OpenCL defines a relaxed consistency model for memory objects. Ordering guarantees depend on the address space being accessed. For local memory shared by items in a work group, consistency is guaranteed only at synchronisation points within the kernel(using OpenCL barriers). Global memory consistency is not defined between work groups in the same kernel. The runtime only guarantees synchronisation between commands in the queue (i.e at the end of one kernel and the start of the next kernel.) 2.2.4 Programming Model This model defines how parallel algorithms are formulated using OpenCL. Two programming styles are supported by the standard. They are: Data Parallelism where the problem is decomposed using data to be processed and independent partitions are handled in parallel. This style of programming directly Chapter 2. Background 10 Figure 2.2: Memory regions available to OpenCL kernels [24] aligns with OpenCL’s execution model with each point in the NDRange independently processing data elements. Depending on the structure of the kernel, computation may proceed in Single Instruction Multiple Data (SIMD) or Single Program Multiple Data (SPMD) fashion. Different levels of data parallelism are supported. Work items in a work group may process data items in parallel. A single work item may use explicit SIMD instructions to manipulate multiple items at a time. Task Parallelism which involves partitioning a problem into functional modules that can be executed in parallel. The straightforward way of expressing task parallelism in OpenCL is the use of an out of order command queue. In this case multiple kernels can be in execution at the same time on different compute devices. It is possible to combine both forms of parallelism in a single application. For example, tasks in a task-parallel algorithm can contain data parallel instructions. These models ensure portability of OpenCL programs as all implementations are required to conform to them. While code portability is guaranteed, performance is not as it depends on architecture/device family specific optimisations [9, 11]. In order to address the wide difference in architecture and capabilities of devices, the OpenCL specification defines two levels of conformance required of implementations. They are full and embedded profile. The embedded profile relaxes guidelines about floating point arithmetic, availability of data structures like 3D Images, mathematical accuracy of functions etc. This profile is meant for embedded devices Chapter 2. Background 11 with more stringent memory constraints [24]. The current version of the standard is OpenCL 2.0. 2.3 The Mali T600 GPU Series The Mali T600 series [4] is the first of a family of GPUs manufactured by ARM targeting the high end embedded device market designed for both graphics and general compute applications. It uses a unified shader architecture - all computational cores are identical and can perform all kinds of programmable shader tasks [15]. Depending on the use case, between 1 and 8 cores exist on a chip. Each core has a dedicated L1 cache. A centralised memory management unit coordinates access to main memory through a single L2 cache shared among the cores. The chip has a task management unit that distributes workload among the cores. Fig 2.3 shows the block diagram of the T604 used in this project. Figure 2.3: The ARM Mali T604 GPU System Architecture [4] Each core is composed of a tri-pipe execution component along with supporting fixed function hardware units. The parts of the tri-pipe are: • One or more Arithmetic pipelines for computation. This pipeline has a SIMD design and works on 128 bit vector registers. Up to two arithmetic instructions can be executed per clock cycle. • A single Texture pipeline that handles all memory accesses related to processing textures with a throughput of one instruction per clock cycle. Chapter 2. Background 12 • A single Load/Store pipeline that takes care of all other memory accesses. This is a vector based pipeline capable of loading four 32 bit words in one cycle. The cores have a VLIW design. Fig 2.4 shows the architecture of a core. Figure 2.4: Architecture of the Mali T600 Shader Core [15] With respect to the OpenCL standard, the platform model is implemented with each shader core corresponding to a compute device and the tri-pipes corresponding to processing elements. Work items execute as hardware threads on the cores. Up to 256 threads can be executed concurrently on each core [14]. The actual value depends on the number of registers required by the kernel. OpenCL kernels typically only use the Arithmetic and Load/Store pipelines. The Texture pipeline is only used for memory access related to Image processing and for barrier operations. The hardware scheduler assigns work groups to cores in batches. All the work groups in a batch execute on the Chapter 2. Background 13 same core in round robin fashion. At every clock cycle, a thread is chosen to execute one instruction. Threads in a work group are selected in order of increasing identifiers. In an index space with more than 1 dimension, the lower dimensions are incremented as the inner indices in the sequence. For example, a 2D index space would be sequenced as [1,0], [2,0], . . . , [1,1], [2,1], . . . Threads in the next adjacent work group are then scheduled in the same fashion. All threads have their own program counter and do not execute in lockstep like warp/wavefront based architectures [5]. The Mali GPU is built as part of a System on Chip (SoC). Thus, it has a host-unified memory model. This implies that both the GPU and the host processor share the same physical memory. OpenCL local and global memory spaces are implemented using RAM backed by L1 and L2 caches. The Mali T600 GPUs conform to the OpenCL 1.1 full profile. 2.4 Visual SLAM and KinectFusion Simultaneous Localisation and Mapping (SLAM) is a term used to describe the process by which a robot acquires a spatial map of a previously unexplored environment while keeping track of its position in the environment. A global model of the environment is repeatedly updated using cues obtained as the robot moves around. Visual SLAM refers to solutions that work using measurements obtained from camera sensors. The SLAM problem is challenging due to the statistically dependent nature of measurement noise. Inaccuracies in earlier estimates accumulate over time and their effect becomes amplified over time. This issue as well as the dynamic nature of the environment makes solutions to the SLAM problem very computationally intensive [31]. KinectFusion [25, 16] is a visual SLAM algorithm for producing geometrically accurate 3D models of a physical scene using commodity depth cameras (Kinect). it leverages techniques from computer vision and robotics. A single global model of the scene is maintained. The model used is a 3D volumetric representation based on [7] where each point (voxel) stores a Truncated Signed Distance Function (TSDF) value representing the current estimate of the distance of the point relative to the surface. The global model is updated using new depth data obtained from the camera. New views of the scene result in a more accurate model in the classic Bayesian manner. At each point in time, data from the depth camera are used to predict the current pose of the camera. TSDF values for points that lie within the current camera frustum are updated. Camera pose estimation is carried out using the Iterative Closest Point (ICP) algorithm Chapter 2. Background 14 [27]. Example applications of KinectFusion include augmented reality systems, 3D printing and physics simulation. Fig 2.5 shows an example scene reconstruction. Figure 2.5: Example 3D scene reconstruction using KinectFusion [16] Processing of the depth data occurs in several stages. Each stage contains a high degree of parallelism. Detailed analysis of theses stages will be presented in section 3.1. KinectFusion as well as the algorithms it leverages have been designed with execution on the GPU in mind. Fig 2.6 shows the processing flow of the system. Figure 2.6: KinectFusion Execution Workflow [25] 2.5 Related Work The recent increase in the amount of processing power available on mobile and embedded devices means that they have been actively considered as tools for visual computing. The works relevant to this project are presented below. Parallel Tracking and Mapping (PTAM) [18] is a keyframe based SLAM algorithm designed to work with monocular RGB cameras and used in Augmented Reality applications. The concepts introduced by this algorithm serve as a foundation for the KinectFusion algorithm. PTAM provides real time tracking of the camera’s position by repeatedly running two procedures in parallel: Tracking and Mapping. Its effectiveness in generating room scale maps and efficient camera tracking have been demonstrated on desktop computers. Klein and Murray [19] investigated the applicability of Chapter 2. Background 15 the algorithm on mobile phones using the Apple iPhone 3G. Their implementation was designed to work on the phone’s CPU. Owing to the single threaded nature of the device’s processor, the two aspects of the algorithm were performed in alternate manner - a foreground thread handled camera tracking while map optimisation was performed in the background. The process of optimising the map was performed only when no new frame was available for camera tracking. Due to its dependence on a large number of data points per frame, the PTAM algorithm is computationally intensive and relies on the high clock rate of the processor to deliver real time results. The limited processing power available on the iPhone compared with a desktop machine meant that a straightforward implementation would not perform effectively on the mobile phone. This problem was tackled by a reformulation of the algorithm. The mapping procedure, known as bundle adjustment works by constructing a 4 level image pyramid to which all possible landmarks are added. This may result in data duplication. The mobile phone implementation uses only a subset of the available data and a different pyramid construction procedure. Two other modifications to the mapping process are done. Firstly measurements of any given map point observed by multiple keyframes are ranked in order of usefulness and only the most useful ones are retained. This has the effect of creating a less dense map in contrast with the desktop implementation that actively seeks to construct the densest possible map. The other modification involves erasing redundant keyframes once a predefined keyframe count threshold is exceeded. The tracking procedure was also reformulated. The modification involved omitting the stage that ensures tracking accuracy through large feature measurements. This omission was occasioned by the limited bandwidth of the phone’s CPU and the sparse map used. The end result of these adjustments is that while the PTAM algorithm worked on the phone, it was significantly less accurate compared with the desktop version. 2.5.1 Visual Computing Acceleration The limited processing power of mobile CPUs have led researchers to consider the use of other special purpose processors present on the devices as accelerators. Several possibilities have been investigated. An early approach involved the use of Field Programmable Gate Arrays (FPGAs). An FPGA is an Integrated Circuit whose function can be altered by changing the in- Chapter 2. Background 16 terconnectivity of its components. This reconfiguration is achieved by downloading a bitstream that specifies the functionality to be implemented. Maclean [22] evaluated their suitability for vision applications. FPGAs make it possible to exploit the parallelism inherent in vision algorithms. However, their lack of support for floating point operations and even more importantly, the very low level hardware programming required to configure them greatly reduce their applicability. A slightly different approach was pursued by Seung et al [20]. Their work combined a software implementation with hardware level acceleration. This was done in the context of augmented reality applications on an Intel Atom powered Mobile Internet Device. The application runs on the CPU and offloads the computationally intensive parts to custom built hardware accelerators. The work focused on accelerating image recognition and matching algorithms. The accelerators contained elements like Static RAM memory, custom computation pipeline and control unit. Experimental results show that depending on the size of input images, the use of custom accelerators resulted in 14 times speedup with respect to a heavily optimised CPU only implementation. However, the use of purpose built hardware to accelerate parts of an application prevents this method from being generally applicable. A better approach is to have a single device capable of accelerating all applications. The GPU on mobile devices serves this purpose. Advantages include floating point support, programmable compute core and parallel architecture. Kwang-Ting and YiChu [6] implemented a face recognition algorithm on a smartphone. The use of the embedded GPU as an accelerator resulted in speedup and energy efficiency compared with a CPU only implementation. Additionally, it was concluded that for small workloads, mobile GPUs provide greater energy efficiency than their desktop counterparts. The loss of efficiency after a threshold was attributed to the limited cache sizes of mobile GPUs. Singhal et al [29] evaluated several image processing algorithms (Speeded up robust feature detection; Non-photorealistic rendering and stereo matching) on an embedded GPU. The results presented showed successful acceleration compared with a CPU only implementation. Despite being demonstrably suitable for general purpose processing, the fact that mobile GPUs could only be programmed using graphics standards like the OpenGL ES API limited their use as not all algorithms could be expressed in terms of graphics primitives. Fortunately, several embedded device manufacturers have begun providing support for OpenCL as general purpose programming. Chapter 2. Background 2.5.2 17 OpenCL-based Embedded Vision Systems Wang et al [35] reported the first use of OpenCL for implementing computer vision applications on a real mobile device. Previous work by Leskela et al [21] used OpenGL to emulate the OpenCL embedded profile on mobile devices. Wang et al used the object removal algorithm as a case study. All the computations in the algorithm were implemented as OpenCL kernels and executed on the GPU, with the CPU only serving as application coordinator. The primary optimisation employed was the use of GPU local memory as a software managed cache for frequently accessed data. The authors presented results of their implementation’s performance for various problem sizes. While no comparison was made with a base implementation, the authors concluded that executing the class of computer vision problems typified by object removal was feasible on mobile devices. A follow up work by Wang et al [34] evaluated the performance of the ScaleInvariant Feature Transform (SIFT) algorithm on a smartphone. The implementation was done on the Qualcomm Snapdragon S4 chipset that supports OpenCL Embedded Profile acceleration for both the CPU and GPU. Profiling results were used to guide efficient partitioning of tasks between the two processors. Optimisations applied include appropriate data structure selection (OpenCL Images to take advantage of implicit vectorisation) and elimination of branches in kernel code. Additionally, prior knowledge about the data distribution was used to limit the search space of the application. Local memory was also used to reduce the effect of memory latency. Results presented demonstrated superior performance of this heterogeneous approach compared with a CPU only implementation. A common theme of these reports is that while harnessing the heterogeneous capabilities of embedded devices makes more computing power available, there is usually the need to restructure vision algorithms to obtain acceptable results. While speedups are obtained relative to a mobile CPU only implementation, performance still falls below that of desktop systems. Observe that while mobile and embedded devices contain more accelerators than just GPUs, at this time only the GPU is supported by OpenCL implementations. Chapter 3 Application Analysis This chapter presents an overview of the KinectFusion algorithm focusing on how its structure maps to the OpenCL programming model. This is followed by a discussion of the results of profiling the application. 3.1 Algorithm Overview KinectFusion operates by repeatedly refining a model of the scene being reconstructed using new depth data. Fig 3.1 gives a high level view of the algorithm. The steps in the process are described in detail below: Initialisation At application startup, the required memory buffers are allocated. Most of the buffers are allocated by the host using OS memory management commands. OpenCL is only used to allocate space for the data structure used to represent the global model of the scene. This volume is assigned as contiguous linear memory on the GPU. Frame Acquisition Each iteration of the algorithm begins with a new depth frame being collected from the Kinect camera. A series of preprocessing actions are carried out on this raw depth map as follows: 1. Given the difference in units between the live depth frame and that used by the application (millimetres and metres respectively), each pixel in the frame 18 Chapter 3. Application Analysis Figure 3.1: Execution flow of the KinectFusion application 19 Chapter 3. Application Analysis 20 is scaled by dividing by a constant factor - 1000. This is a classic case of an embarrassingly parallel process and directly maps to the SIMD capabilities of the GPU. 2. The depth data received from the Kinect camera is noisy and would reduce reconstruction accuracy if used directly. A noise correcting bilateral filter is thus applied to each frame before further processing. This is basically a convolution using a Gaussian mask that computes each output pixel by aggregating the contributions of input pixels that lie within the radius of the filter centred on the current input position. Given that computation of each point in the output frame is independent, a straightforward implementation is to assign each output point to a work item. The overlap of input pixels used by consecutive threads creates opportunities for data sharing and vectorisation. 3. A three level depth pyramid is constructed using the filtered depth image as its base with each level is computed from the preceding level. Each value in the output depth map is the result of block averaging and sampling (at half resolution) from the values in the input depth map. Processing of each output value is independent and can be executed in parallel. 4. Each value in the filtered depth frame is converted from image coordinates to the Kinect sensor’s reference frame to generate a vertex map. This is computed as a product of the depth measurement, inverted camera calibration matrix and pixel coordinate for each point in the depth map. All computations are independent and can be computed in parallel. A normal map is generated from the resulting Vertex map. Each vector in the Normal map is computed using neighbouring points in the Vertex map. Once again, all output computations are independent and proceed in parallel. The computation of vertex and normal maps is repeated for each level in the pyramid using the corresponding depth map for each level. Camera Tracking This step uses the just computed vertex and normal maps to estimate the current pose of the camera relative to the acquired depth map. The pose is represented as a 4 × 4 matrix. Pose estimation involves the use of ICP algorithm to determine the relative transformation that most closely aligns points in the current frame with those of the Chapter 3. Application Analysis 21 previous frame. The process involves two steps executed in order for each pyramid level. The steps are: 1. Track: A process known as perspective data association is used to find correspondences between points in the current frame and the previous frame. It uses a combination of the previous camera pose and raycasted pose (if any). Each point in the previous vertex map is converted to camera coordinates and perspective projected to image coordinates. The resulting point is used to access the current vertex and normal maps. The output of this step is all points that lie within a threshold (specified using Euclidean distance and angle). 2. Reduce: The output points of the track step are summed up using tree based reduction. This computation is carried out in parallel using barrier synchronisation between reduction stages to ensure the right values are used in the summations. These two steps are repeated for a prespecified number of iterations per pyramid level. The linear system resulting from each iteration is solved using a Cholesky decomposition. Both track and reduce steps are inherently parallel and directly fit GPU computation while solving the linear system of equations maps to the CPU given the recursive nature of the task. Volumetric Integration The result of the camera tracking stage determines if the current pose differs significantly from the previous pose (by a pre-specified threshold). If it does, the new frame is fused into the global volumetric representation of the scene. This is accomplished by making a sweep of each point in the volume and updating the value of the SDF at each point that lies in the current depth frame. In order to determine if a point lies in the current frame, it is converted to a vertex in global 3D coordinates using the current camera matrix. Processing of each point in the volume is independent and can executed in parallel. Given the number of points that need to be visited (for example, a volume of length 256 has 2563 = 16777216 voxels ), this stage is only feasible on the GPU. Raycasting Raycasting is a standard process in computer graphics used in creating 3D perspectives in 2D maps. In the case of KinectFusion, it is used as the first step in rendering the Chapter 3. Application Analysis 22 scene being reconstructed. It extracts the surface embodied in the global volume. This is done by walking a ray starting from each point in the output image and traversing the volume until a surface is encountered. A surface is identified as the point along the ray where the SDF stored at that location in the volume changes sign. Each point identified is converted to a global 3D coordinates (vertex) and normal for rendering. The output of raycasting is also used as an additional input for determining the pose of the camera in the next iteration. Walking individual rays from each starting point in the output image is independent and can be carried out in parallel. In conclusion, the KinectFusion algorithm is composed of several stages connected in a rigid pipeline. Each stage of the pipeline is highly data parallel and directly maps to the OpenCL programming model. 3.1.1 Dependence Analysis The purpose of this analysis is to identify the amount of task parallelism contained in the algorithm. That is, the extent to which operations in the algorithm can be overlapped. Within each iteration, computation proceeds in a pipelined manner with each stage depending on the output of the previous stage. This hard dependence makes it impossible to interleave the steps in each iteration. Thus, an in order OpenCL queue has to be used to ensure this dependence is satisfied. More importantly, the data structures used to represent the scene and the camera pose preserve state across loop iterations. This introduces a cross iteration dependence (of distance 1) that greatly limits the extent to which iterations can be overlapped. The only activities that do not depend on state of the previous iteration are those related to frame acquisition and preprocessing ( Retrieving new depth data; performing unit conversion; applying the bilateral filter and half sampling to generate the pyramids ). It is thus concluded that the algorithm has a very low degree of task parallelism. 3.1.2 Task Assignment As previously stated, most parts of the KinectFusion algorithm are highly data parallel and are suitable for execution on the GPU. The only actual computation performed on the CPU is determining the solution to the linear system produced as part of camera tracking. Therefore, apart from orchestrating kernel execution and data movement, the CPU is idle for the most part. This is fine on desktop systems with highly parallel GPUs. However, on embedded devices, this may not represent the best use of Chapter 3. Application Analysis 23 Table 3.1: Comparison of execution times on the CPU and GPU. All values shown are in milliseconds Function CPU Execution Time GPU Execution Time Unit Conversion 0.76 0.23 Bilateral Filter 426 5.69 Generate Pyramid 2.99 0.19 Depth to Vertex 1.25 0.26 resources. This is because the CPU and GPU have comparable core counts and the CPU has a higher clock frequency than the GPU. Tests were done to compare the performance of several functions on the CPU and GPU. The CPU versions were written as serial C functions because the OpenCL implementation from ARM does not support acceleration using the CPU. Table 3.1 shows the result of this experiment. The GPU gave the better performance for all the functions. Implementing the tasks as OpenCL kernels thus represented the most efficient solution. The issue of using the CPU and GPU concurrently is discussed in detail in section 4.3. 3.2 Kernels Development and Profiling The implementation phase used a basic version obtained by direct manual translation of a CUDA implementation developed by Reitmayr [13]. In order to determine the most profitable direction of investigation, this initial implementation was instrumented to identify the hotspots in the application - kernels that contribute the most to total execution time. Due to the absence of OpenCL profiling tools for the Mali system, results were generated using library functionality available in the OpenCL API. This was done by attaching an event object to each enqueued OpenCL command. Fig 3.2 shows what percentage of application execution time is accounted for by each kernel. The results take into account the number of times each kernel was invoked. Thus, while a single invocation of the bilateral filter kernel takes longer than the track kernel, the latter is invoked many more times in the course of program execution and optimising it would be more beneficial. Six kernels account for 97 % of execution time. Disregarding the kernels required for rendering as they are not part of the workflow of the algorithm, the top three kernels are integration, tracking and reduction kernels. Chapter 3. Application Analysis 24 These kernels were investigated the most in the course of this project. In addition to execution time, data about the maximum number of concurrently executing hardware threads per GPU core were collected for each kernel. This was done using the clGetKernelWorkGroupInfo OpenCL call per kernel to request the value for the CL KERNEL WORK GROUP SIZE attribute. As previously mentioned, each Mali core supports up to 256 concurrent threads. To ensure optimum performance, at least 128 threads must be run concurrently per thread [5]. Table 3.2 shows that while most kernels possessed the optimum degree of concurrency, three kernels (Integration, rendering and raycast) could only be launched with 64 concurrent threads. The explanation for this is that the relative complexity of these kernels results in a high number of register usage. Figure 3.2: Contribution of the various kernels to the total application execution time In summary, the KinectFusion algorithm has enough data parallelism to justify an explicitly parallel implementation using OpenCL. The main steps of the algorithm directly correspond to OpenCL kernels. Profiling an initial implementation of the algorithm has revealed the kernels that contribute the most to overall execution time and that would benefit most from optimisations. The next chapter presents the result of optimising the KinectFusion algorithm. Chapter 3. Application Analysis 25 Table 3.2: Maximum number of concurrent work items per kernel. Kernel Max Workgroup Size Render Input 64 Integration 64 Raycasting 64 Reset 256 Bilateral Filter 128 Track 128 Reduce 128 Vertex to Normal 128 Render Light 256 Render Depth Map 256 Render Track Result 256 Depth to Vertex 256 Millimetres to Metres 256 Generate Pyramid 256 Chapter 4 Program Optimisations The optimisations presented in this chapter were developed on an Arndale development board 1 with the following characteristics: • Samsung Exynos 5250 chip with integrated dual core Cortex-A15 CPU with clock speed of 1.7 GHz and quad core Mali T604 GPU with clock speed of 533 MHz. • 2GB Main Memory • Ubuntu 12.04.2 Linux, 3.11.0-arndale+ kernel The depth data used for testing the KinectFusion algorithm were obtained by playing back previously recorded Kinect depth images. Generating and replaying the data were done using the OpenNI library interface2 . Each input frame was 640 × 480 pixels in size. For the purpose of this project, the frames were processed at half the resolution. Thus, the frames were sampled to generate 320 × 240 sized images. 4.1 General Optimisations This section discusses best practices applied that are not specific to the KinectFusion application. 1 http://www.arndaleboard.org/wiki/index.php/Main_Page 2 https://github.com/OpenNI/OpenNI 26 Chapter 4. Program Optimisations 4.1.1 27 Reduction of OpenCL Runtime Overheads Each complete execution of the KinectFusion algorithm results in a large number of OpenCL kernel invocations. The exact number of invocations is determined by the amount of input depth frames received (For example, processing 100 frames resulted in 8,000 kernel invocations). Kernels are created from OpenCL programs using the clCreateKernel function and disposed of using the clReleaseKernel function. A direct implementation of the system would result in these two functions being called before and after each kernel invocation respectively. The overhead incurred will be significant. As part of program optimisation, all the kernels were created once at program initialisation using the clCreateKernelsInProgram function. The kernels were then placed in an associative map (using the kernel name as the key) to be consulted each time a kernel is invoked. The code to release the kernels was moved to an exit function. For a little increase in memory footprint (the 22 kernels and helper functions in the program occupied 88 KB of main memory ), total execution time was reduced by over 20% 4.1.2 Memory Optimisations The OpenCL specification was originally developed for architectures having separate host and device memory systems. Thus, the default behaviour is for the host to allocate space for buffers and transfer them to the devices for processing. The output data are then copied back to the host at the end of kernel execution. For memory bound kernels that do not involve a lot of computations, this copying is expensive. The standard provides an alternative way for devices to directly access host memory. This is through OpenCL mapping and unmapping operations. However, the buffer has to be allocated by the OpenCL runtime. This zero copy approach is typically not used on systems with discrete devices (GPUs) because the reduced bandwidth resulting from accessing memory via the PCI bus usually outweighs the benefits. However, the host-unified memory architecture of Mali makes it a useful optimisation to explore. Note that even when the host and devices share the same physical memory, the default behaviour of OpenCL is for data to be copied before the devices can access them. This is clearly redundant and wasteful. This project used the mapping/unmapping approach to eliminate data copying. The buffers were created using the CL MEM ALLOC HOST PTR flag, as it yields the best performance on the Mali [5]. In situations where copying could not be avoided, steps were taken to reduce the adverse effect of copying as much as possible. One optimisation applied to kernels Chapter 4. Program Optimisations 28 that wrote to more than one output buffer was to overlap the writes. This was done by using non-blocking buffer write commands. In order to ensure memory consistency, the CPU still had to wait for the memory transfers to complete before launching the next kernel. Due to the pipelined nature of the algorithm, the output of one stage serves as the input of the next stage. There is no need to read back the output of one stage only to write it again as input for the next kernel. This optimisation was applied to the application. In addition to the CL MEM ALLOC HOST PTR flag, two other flags exist for buffer creation. They are: • CL MEM USE HOST PTR which indicates that the OpenCL runtime should store the buffer in memory referenced by the pointer used in creating it. • CL MEM COPY HOST PTR which indicates that the runtime should allocate memory and copy the contents of the pointer used in creating the buffer The specification does not mandate how implementations enforces these flags. Thus, rather counter intuitively, the CL MEM COPY HOST PTR flag was found to give better performance on the Mali GPU. 4.2 Kernel Optimisations This section presents the report of optimising the KinectFusion kernel code with respect to the Mali GPU. At a high level, optimising kernel code involves using vectors for both arithmetic and memory operations; using appropriate data types; eliminating redundant computations and using taking advantage of the speed of built in functions [1, 3, 5]. Additionally, using the right work group size (recommended to be powers of 2. If no preference exist, allowing the runtime to decide the optimum size is preferred [5]); avoiding barrier operations if possible and maintaining a high ratio of arithmetic to memory operations improve kernel efficienvy. Due to the SIMD nature of the shader cores on the Mali, use of explicit vector instructions would result in the highest speedup. Based on analysis of the algorithms implemented, the kernels in the application belong to one of the following three categories: • Kernels that could be immediately vectorised. There was only one such kernel and it did not contribute much to the total execution time. • Kernels that could be reformulated to support use of vector instructions Chapter 4. Program Optimisations 29 • Kernels that contained little or no opportunities for vectorisation. Additionally, due to the nature of the problems they solve, reformulation is not possible. 4.2.1 Unit Conversion Kernel This is a very simple kernel that produces each output element by scaling the corresponding element in the input buffer. A scalar implementation creates a work item for each element in the output depth frame. As previously stated, the project was carried out with depth frames half the size of the input image. Each element in the output frame is obtained from the element in the input frame with twice its coordinates. For example, output item at (1,3) is computed from input item at (2,6). The scalar implementation of the unit conversion is shown in listing 4.1. Each work item first determines its position in the execution space. It uses this value to access both the output and input depth frames. It should be pointed out that in addition to being computationally inefficient, it also wastes memory accesses. This is because only half of the items in each cache line will be used. The vector implementation loads 2n items from the input and writes n items to the output frame. The values allowed for n in OpenCL are 2, 4 or 8. Listing 4.2 shows the implementation with n = 4. It uses the convenience ‘even’ access specifier to refer to the correct index in the horizontal dimension. The stride variable does the same for the vertical dimension. The Mali GPU can operate on vectors of 128 bits in a single cycle. Larger vectors are processed in multiple cycles. There is therefore a trade off in this kernel between the number of items processed in a vector operation and the number of cycles required to carry out the computation. Fig 4.1 shows execution times for the possible combinations of load and store vector width as well as the scalar implementation. The read 8/store 4 and the read 16/ store 8 configurations provide the best performances. This was the only kernel in the application with immediately vectorisable statements. Despite the speedup resulting from applying this optimisation, this kernel does not contribute much to the overall execution time of the application. Listing 4.1: Scalar implementation of the unit conversion kernel 1 __kernel void mm2metersKernel1 ( __global float * depth , 2 const uint2 depthSize , 3 const __global ushort * in , 4 const uint2 inSize 5 ) 6 { 7 uint2 pixel = ( uint2 ) ( get_global_id (0) , get_global_id (1)); 8 depth [ pixel .x + depthSize .x * pixel .y] = Chapter 4. Program Optimisations 9 30 in [ pixel .x * 2 + inSize .x * pixel .y * 2] / 1000.0 f; 10 } Listing 4.2: Vector implementation of the unit conversion kernel. This version loads eight items and stores the four items at positions 0, 2, 4, 6. 1 __kernel void mm2metersKernel1 ( __global float * depth , 2 const uint2 depthSize , 3 const __global ushort * in , 4 const uint2 inSize 5 ) 6 { 7 uint2 pixel = ( uint2 ) ( get_global_id (0) , get_global_id (1)); 8 const uint stride = depthSize .x /4; 9 ushort8 inVal = vload8 ( pixel .x + 2 * stride * pixel .y , in ); 10 vstore4 ( convert_float4 ( inVal . even )/1000.0 f , 11 pixel .x + stride * pixel .y , depth ); 12 } Figure 4.1: Execution times of various configurations of the vectorised unite conversion kernel and the scalar implementation. The time units are in milliseconds. 4.2.2 Bilateral Filter Kernel Bilateral filtering is a standard image enhancement process. Each pixel in the output image is obtained by computing the weighted average of the intensity values of pixels within a neighbourhood of the corresponding point in the input image [32]. A Gaussian filter of radius 2 was used in this project and all optimisations applied took this filter size into consideration. For every point in the input depth frame with a non zero value, the output is computed using values located in a 5 × 5 square centered at that point. For each point lying within the square having a non zero value, its contribution to the sum is computed by multiplying the relevant Gaussian weights with the result of evaluating a univariate Gaussian distribution having a predefined variance and mean equal to the Chapter 4. Program Optimisations 31 depth value at the center of the square. Observe that as with all convolutions, care must be taken when processing pixels that lie on the edge of the image. This is to avoid reading beyond the image boundary. The original implementation of the bilateral filter kernel uses only scalar operations. Each point in the depth image is assigned to a work item. The kernel has a doubly nested loop that averages the values located in the 5 × 5 square centered on the pixel assigned to the work item. The square is traversed across the columns to ensure coalesced access to global memory. Given that this implementation targeted the NVIDIA architecture with its warp based thread scheduling, a first step towards optimising it was to evaluate if coalesced memory access yield performance benefits on the Mali architecture (which uses a different thread execution method). The Gaussian weights are read by all the work items and remains constant throughout the applications execution. It is thus reasonable to place it in constant memory. Although OpenCL specification does not mandate how this memory region is handled, some architectures like AMD [12] optimise accesses to it. No documentation exists about how constant memory is handled by the Mali runtime. Experiments were done to evaluate the effect of using it. Fig 4.2 shows the result of executing the bilateral filter kernel for all possible combinations of constant/global memory and coalesced/non coalesced access. The results show that coalescing memory access reduces performance. Additionally, using constant memory does not offer any optimisation. The best configuration (global memory for the Gaussian filter and non-coalesced access to the input image) had an average execution time of 4.673 ms which represents a 6.6 % and 11.9 % improvement over the default and worst configurations respectively. This configuration was used as the baseline for measuring the effectiveness of subsequent optimisations investigated. The need to detect and handle access to border pixels of the input depth image prompted the use of OpenCL Image data type in the kernel. This is because of the automatic edge clamping it provides. It also represented a chance to evaluate the effectiveness of using the Texture pipeline of the Mali GPU. While the automatic edge clamping simplified the kernel code, use of the Image data type resulted in almost 4% reduction in performance. This was in addition to the overhead involved in converting host buffers to OpenCL Images. This performance degradation motivation the evaluation of vectorisaton as a means of optimising this kernel. Due to the presence of branch statments in the kernel required to ensure that only values greater than zero are processed, vectorising the kernel is not straightforward. Chapter 4. Program Optimisations 32 Figure 4.2: Execution times of various configurations of memory access and region. All times are in milliseconds However, the fact that coalesced access to the input array does not result in improved performance makes it possible to unroll the inner loop. The implication is that each work item now traverses its given square along the rows. Loop unrolling requires that the values required for the different iterations being merged be available at the same time. This is accomplished using vector loads. Since there are five iterations of the inner loop and OpenCL supports vector loads of 2, 3, 4, 8 and 16 elements, several unroll factors are possible. They are: • Unroll factor of 2 which requires two loop iteration and loads two elements at a time in each iteration. A scalar load is needed at the end to handle the last element. • Full unrolling with two vector loads - first load 3 elements then 2 elements. • A single vector load of 4 elements and a scalar load to handle the last element • A single vector load of 8 or 16 elements but utilizing only the first 5 elements. In all cases, each point in the depth image is assigned to a single work item. Note that this optimisation can only be applied by work items not responsible for border pixels. The pixels on the boundary have to be processed using the scalar approach. With image size of 320 × 240 and filter radius of 2, (320 − 2 × 2) × (240 − 2 × 2) = 316 × 236 points will be computed using the vectorised memory loads. Observe that once loaded, accessing the individual items of the vector using the swizzle operator incurs no additional hardware cost [14]. Chapter 4. Program Optimisations 33 Fig 4.3 shows how effective each version of unrolling the inner loop is compared with the best serial version (global non-coalesced memory access). As expected, the reduction in the number of memory loads and the elimination of loop branching instructions obtained from loop unrolling resulted in improved performance of the kernels. The one exception was the version where the looped was unrolled by a factor of 2 which resulted in performance degradation. The highest speedup was obtained by issuing one vector load of 4 elements and a single scalar load per row of the input image. The items loaded were all 32 bit floating point numbers. This kernel thus loads 128 bits at a time which matches the optimal vector size on the Mali [5]. An interesting aspect of the bilateral filter kernel is the amount of data sharing between work items. Within the 5 × 5 square, two consecutive items share four elements per row. This sharing is why the versions that load 8 and 16 items do not result in significant performance reduction despite the fact that only a small fraction of items read from memory is used in the computation of the output. Each Mali L1 cache line can hold sixteen 4 byte floating point numbers. Thus, each line loaded by the first work item in a work group may be reused by up to three adjacent threads. Listing 4.3 shows the code for the version that loads four items at a time. Listing 4.3: Bilateral filter kernel with inner loop fully unrolled. Four items are loaded at a time. 1 __kernel void bilateral_filterKernel ( __global float *out , 2 const __global float *in , 3 const __global float * gaussian , 4 const float e_d , 5 const int r) 6 { 7 const uint2 pos 8 const uint2 size 9 const float center = in [ pos .x + size .x * pos .y] ; = ( uint2 ) ( get_global_id (0) , get_global_id (1)); = ( uint2 ) ( get_global_size (0) , get_global_size (1)); 10 if ( center == 0 ) 11 { 12 out [ pos .x + size .x * pos .y] = 0; 13 return; 14 } 15 float sum = 0.0 f; 16 float t = 0.0 f; 17 const float denom = 2* e_d * e_d ; 18 if( pos .x >= 2 && pos .x < size .x - 2 && pos .y >= 2 && pos .y < size .y - 2) //bounds check 19 { 20 for(int i = -r; i <= r; i ++) 21 { 22 const int offset = (i + pos .y )* size .x + pos .x - r; 23 float4 row = vload4 ( 0, in + offset ); 24 if( row . s0 > 0) 25 { 26 const float mod = sq ( row . s0 - center ); 27 const float factor = gaussian [i + r ]* gaussian [0]* exp (- mod / denom ); 28 t += factor * row . s0 ; 29 sum += factor ; 30 } 31 if( row . s1 > 0) 32 { 33 float factor = gaussian [i + r ]* gaussian [1]* exp (- sq ( row . s1 - center )/ denom ); Chapter 4. Program Optimisations 34 34 t += factor * row . s1 ; 35 sum += factor ; 36 } 37 if( row . s2 > 0) 38 { 39 float factor = gaussian [i + r ]* gaussian [2]* exp (- sq ( row . s2 - center )/ denom ); 40 t += factor * row . s2 ; 41 sum += factor ; 42 } 43 44 if( row . s3 > 0) 45 { 46 float factor = gaussian [i + r ]* gaussian [3]* exp (- sq ( row . s3 - center )/ denom ); 47 t += factor * row . s3 ; 48 sum += factor ; 49 } 50 //Handle the last item 51 const float pix = in [ offset + 4]; 52 if( pix > 0) 53 { 54 const float mod = sq ( pix - center ); 55 const float factor = gaussian [i + r ]* gaussian [4]* exp (- mod / denom ); 56 t += factor * pix ; 57 sum += factor ; 58 } 59 } 60 } 61 else //may read out of bounds. process as scalar 62 { 63 for(int i = -r; i <= r; ++ i) 64 { 65 for(int j = -r; j <= r; ++ j) 66 { 67 const uint2 curPos = ( uint2 )( clamp ( pos .x + j , 0u , size .x -1) , clamp ( pos .y + i , 0u , size .y -1)); 68 const float curPix = in [ curPos .x + curPos .y * size .x ]; 69 if( curPix > 0) 70 { 71 const float mod = sq ( curPix - center ); 72 const float factor = gaussian [i + r] * gaussian [j + r] * exp (- mod / denom ); 73 t += factor * curPix ; 74 sum += factor ; 75 } 76 } 77 } 78 } 79 out [ pos .x + size .x * pos .y] = t / sum ; 80 } 81 } The next version explicitly exploits the sharing of elements among adjacent work items. Observe that all values needed by four consecutive work items in each input row lie in 8 contiguous locations in memory. This is illustrated in fig 4.4. The next version combines inner loop unrolling with use of explicit vector arithmetic. It does this by exploiting the sharing of elements observed in the previous version. Specifically, a single load of 8 elements is sufficient for computing the values required by 4 consecutive output elements. This observation makes it possible to have one work item compute the values of four adjacent output elements. This technique is known as thread coarsening. During each iteration, the 8 items required are loaded using a single vector operation. Each element in the vector is then tested in turn. The ones having values greater than zero are used to update the running sums for all output elements Chapter 4. Program Optimisations 35 Figure 4.3: Speedup of the various versions of the inner loop unrolling bilateral filter kernel relative to the best performing serial version that require them. To support use of explicit vector arithmetic, a 4 wide vector mask is used to implement the sliding of the filter across the row of 8 items. The values in the mask are turned on only for the output elements that depend on the current input element being processed. At the end of the iterations, output elements are updated only if the corresponding input values are non zero. The formulation of this version potentially involves redundant computations depending on the distribution of values in the input depth frame. However, since vector operations take one cycle to process all items in a vector, this overhead is small compared with the reduction in memory operations. Five reads from the input depth image is now sufficient to compute four output elements as against up to 100 accesses required by the scalar implementation. Note that the width of the border has now increased from 2 to 4 in the horizontal dimension. Thus, this version processes (320 − 2 × 4) × (240 − 2 × 2) = 312 × 236 This version took 2.74 ms to complete, representing a 30 % improvement over the previous implementation. As a final optimisation, tests were done to determine the most appropriate work group size for the bilateral filter kernel. Work group size selection affects memory usage efficiency [12, 14]. Fig 4.5 shows the result for several work group sizes for the thread coarsened version. While there are some obviously inefficient configurations, performance was fairly constant for the different work group sizes. An interesting observation is that experimentally deciding the best work group size (4 × 16) resulted in slightly better performance than that obtained by allowing the runtime to select the optimum work group size. A size of (4 × 16) corresponds to only one work group Chapter 4. Program Optimisations 36 Figure 4.4: Illustration of how with a filter radius of 2, four consecutive elements can reuse a single vector load of 8 items being active per core. Figure 4.5: Effect of work group size on the execution time (in milliseconds) of the bilateral filter kernel 4.2.3 Volume Integration Kernel As previously stated, this kernel requires every point in the 3-dimensional volume to be visited and the value stored at each point updated if the point lies in the current depth frame. Each point stores a running average (signed distance function SDF) that represents the current estimate of the distance to the sensor centre. To ensure scalability and maintain real time reconstruction rates, the KinectFusion algorithm partitions the work in two dimensions instead of the 3D decomposition natural for the volume. Each thread then progressively visits each point along the third Chapter 4. Program Optimisations 37 dimension. The goal of this formulation is to ensure coalesced memory access. Given the observation that coalescing access to memory does not improve performance on the Mali, the first step in the attempt to optimise this kernel was to reformulate it to use a 3D iteration space - one work item per point in the volume and thus no looping. Performance reduced sharply - reconstruction was inaccurate and occurred at very low rates. This can be attributed to the large overhead required to create and schedule the hardware threads. However, the kernel could now be launched with 256 concurrent work items as against the 64 in the original version. This seemed to indicate that unrolling the loop would improve performance. However, the memory access pattern of the kernel and the thread scheduling policy of the Mali made unrolling impossible. This is because memory for the volume is laid out linearly and values used by work items in successive iterations are separated by a distance of N 2 . The kernel requires SPMD style programming and contained no opportunities for vectorisation. Additionally, the absence of data sharing among work items prevented the application of thread coarsening as an optimisation. The next optimisation opportunity investigated was workgroup size selection. The goal is to determine the work group dimension that would make the best use of the L1 cache which is unique to each compute device. On each loop iteration, each work item may read one floating point from the filtered depth map. Each point in the volume is first converted into image coordinates. The lookup in the depth map is performed only if the resulting point lies within the depth map ( the point is aligned with the current pose of the camera). The value read from the depth map is used to decide if the SDF value at that point in the global volume should be updated. The SDF value is made up of two short integers (accessed as a vector of size 2) representing the running average of the SDF and the weight assigned to it respectively. Thus in each iteration, every work item reads a maximum of one 4 byte float and two 2 byte shorts from the depth map and volume respectively. Since the L1 cache lines on the Mali are 64 bytes long, each will contain 16 items each. To minimise the number of loads issued, cache lines should be shared as much as possible. The thread scheduling policy on the Mali suggests that L1 cache sharing would be increased by having work groups with size of multiples of 16 items in the horizontal index dimension. Experiments were conducted to validate this hypothesis. Fig 4.6 shows the result of executing this kernel using the possible work group dimensions. Also shown is the result of leaving the work group size unspecified and allowing the runtime to pick the optimum configuration. With the exception of a few Chapter 4. Program Optimisations 38 outliers, most of the configurations resulted in similar performance. Once again, it was possible to experimentally determine an optimum work group size better than the runtime. Additionally, the results do not back the initial hypothesis that work group sizes of multiples of 16 in the horizontal dimension would be optimum. The best configuration was the 2 × 2 sized work group. Figure 4.6: Execution times in milliseconds of the integration kernel for different work group sizes A 256×256×256 volume was used in this project. Thus, each work items executes the loop body 256 times and may both load from and store to memory in each iteration. The number of operations executed by each thread in each iteration depends on the value stored at the current point being processed. There is a possibility of threads diverging - executing different loop iterations. This divergence may lead to irregular memory access causing performance degradation. To determine if thread divergence was a problem, all threads were forced to execute a barrier before exiting each loop iteration. This has the effect of forcing all threads to always execute the same loop iteration. Performance did not change leading to the conclusion that thread divergence is not a problem with this kernel. As part of determining if the SDF at a point in the volume should be updated, the kernel uses the square root function to compute the normalised distance of the current point from the origin. OpenCL defines three versions of this function with different speed and precision - sqrt , native sqrt and half sqrt. The native sqrt is supposed to give the best performance on the Mali [5]. Fig 4.7 shows the result of executing the kernel with the these three versions. In all three cases, precision was enough to Chapter 4. Program Optimisations 39 generate correct results. However, use of the native sqrt actually resulted in reduced performance. Figure 4.7: Execution times for different versions of the square root function used in the Integration kernel 4.2.4 Optimisation of the Track and Reduce Kernels These two kernels cooperatively determine the current camera pose. The track kernel implements the ICP algorithm to identify points of correspondence between the current depth frame and the previous frame. It uses the camera pose computed in the previous iteration and the current vertex and normal maps. It also uses the raycasted vertex and normal maps from the previous iteration if they exist. The output of tracking is summed in parallel by the reduce kernel. Each work item in the track kernel groups three consecutive elements in the vertex and normal maps and treats them as a single input item. These points are loaded and stored together as vectors. Due to the amount of conditional statements required to handle error conditions, this loading and storing represents the only means of applying vector operations in this kernel. A good option would have been to use Image data type and access each group of three elements as a single CL RGB object. However, OpenCL does not support the CL RGB format for floating point numbers which is the data type used to store both the vertex and normal maps. The alternative would have been to use the CL RGBA format, but that would have resulted in increased memory footprint. Additionally, the overhead of converting to and from Image data type would have been significant. Kernel code optimisations applied include reducing the amount of private memory Chapter 4. Program Optimisations 40 used by the work items by avoiding parameter duplication. For example, the vertex and normal maps both had the same size. Hence, one size parameter was sufficient. The output of the kernel is a composite type made up of a floating point error value, an integer result flag and an array of 6 floating point numbers. The total size of this structure is 32 bytes which is an integer multiple of the recommended 16 byte alignment size on the Mali [5]. Care was taken to ensure both the host and OpenCL compilers used the best alignment to represent this structure, thus improving memory load 3 . The largest improvement on the original version was obtained by modifying how the output is stored. The original version required three stores to save the output. The code snippet in listing 4.4 shows how the loading is handled. Both halves of the array are first written to using vector operations to store three items at once. This is followed by the saving of the output. The vector loads are clearly redundant and can be eliminated. Listing 4.5 shows the improved code to store the output. Speedup results from the fact that instructions to set the elements of the array access only registers (So long as the Compiler has not spilled a variable to memory). This simple modification resulted in a 20 % speedup of the kernel. Listing 4.4: Code snippet showing redundant stores in the initial version of the track kernel 1 vstore3 ( referenceNormal ,0 ,((float *) row .J )); //Write the first three elements of J array 2 vstore3 ( cross ( projectedVertex , referenceNormal ) ,1 ,((float *) row .J )); //Write the remaining three elements 3 output [ pixel .x + outputSize .x * pixel .y] = row ; //Save the output Listing 4.5: Improved version of the track kernel with redundant stores eliminated 1 row .J [0] = referenceNormal .x; 2 row .J [1] = referenceNormal .y; 3 row .J [2] = referenceNormal .z; 4 const float3 res = cross ( projectedVertex , referenceNormal ); 5 row .J [3] = res .x; 6 row .J [4] = res .y; 7 row .J [5] = res .z; 8 output [ pixel .x + outputSize .x * pixel .y] = row ; //Save the output The reduce kernel contained the least opportunities for optimisation as it had to be executed in a specific manner and with fixed work group size to ensure correctness. The output of the tracking phase is shared among the work items in the different work groups. Each item computes the partial reduction (using addition) of one or more rows and stores the partial sums in private memory. The partial sums are then copied over to local memory which is unique to each work group. Each work item in the work group then executes a barrier to ensure memory consistency. The first 32 items in each work 3A simple approach was to declare the array before the scalar types. This improved performance by almost 3 % Chapter 4. Program Optimisations 41 then then aggregate the partial sums and copy the final output to global memory. As part of the local reduction, each work item does 22 multiply and accumulate operations in each loop iteration. As an optimisation attempt, these operations were replaced with faster but less precise hardware MAD (multiply-add) and FMA (fused multiply-add) instructions. However, in both cases, the loss of precision resulted in incorrect results. Despite the limited opportunities for optimisation, speedup was accomplished by reusing the output buffer created by the track kernel as the input of the reduce kernel, thus eliminating some buffer management overhead. The pair of track-reduce kernels are called a fixed number of times per pyramid level per iteration of the algorithm. The image dimensions used in the tests and the three pyramid levels resulted in these kernels being invoked 19,278 times. Thus, the reduction in overhead is significant. 4.3 Execution Optimisations Section 3.1.2 highlighted the fact that the CPU remains idle during most stages in each iteration of the algorithm. A straightforward way of utilising the CPU and GPU concurrently is to begin processing the initial stages of the next iteration once the buffers become safe for reuse. This is analogous to how processors hide the latency of memory operations by scheduling other instructions in their shadow. The earliest point where the next iteration can be safely started is during raycasting. All operations up to but not including tracking can be carried out. The next operation that lasts long enough to accommodate scheduling of computation concurrently is the input rendering kernel. To ensure sustained performance, the GPU should be able to start working as soon as it completes the previous iteration without needing to wait for the CPU. This requires finely balancing the CPU execution times of the initial stages of the next iteration. This approach to utilising both processors is challenging to implement because it requires use of OpenCL callback mechanism which is not very robust (for example, enqueueing blocking commands in a callback is not defined). Additionally, the host code has to maintain state information about how far along in the next iteration the program has gone to avoid work duplication. A different approach is to have the CPU and GPU work on the same task concurrently. One piece of work that can be naturally partitioned is the integration kernel. The volume can be split along the z dimension and assigned to the processors. The higher degree of parallelism of the GPU dictates that it should do a larger fraction of the volume. To improve performance, the integration kernel should be split into two. Chapter 4. Program Optimisations 42 The first kernel handles the coordinate conversion for the points on the first slice (z = 0) of the volume. The second kernel uses this result and the position and camera deltas to determine the correct values for the current point in the volume. The output of the first kernel is written to CPU accessible memory and also kept in the buffer for use by the GPU in the second kernel. A copy of the input depth map also has to be made so both processors can use it simultaneously. This formulation avoids the need to track state and does not depend on the fragile OpenCL callback. Chapter 5 Overall Evaluation The previous chapter discussed the optimisations investigated and presented the effect of applying them to individual kernels. The overall effect of the optimisations on the application is discussed in this chapter. Due to issues with rendering observed in the use of the most optimised version of the bilateral filter kernel, it was not used in the comparison. The version that unrolls the inner loop by loading a vector of length 4 was used. To quantitatively establish the speedup obtained, the final implementation was compared with the initial unoptimised OpenCL version as well as with a CPU only implementation. The CPU implementation used OpenMP to facilitate concurrent processing. To capture the overall effect of computations as well as memory accesses, wall clock time was used in the evaluation. The results are shown in fig 5.1. Using total elapsed time as a metric, both OpenCL versions perform better than the CPU only version. The optimisations applied resulted in speedups of 1.5 and 3.3 compared with the original OpenCL implementation and the CPU version respectively. In comparison with the CPU only version, the OpenCL versions were better able to exploit the high degree of data parallelism contained in the algorithm. Reconstruction algorithms are evaluated in terms of frames processed per second (frame rate). The KinectFusion algorithm has a frame rate of 30 fps on capable desktop GPUs. However, this value could not be matched on the Mali GPU despite the optimisations applied. As previously stated, the existence of conditional statements and the resulting SPMD nature of most kernels greatly reduced the opportunities available for vectorisation. Attempts at restructuring the computations to enable use of vector operations were not always successful. The effect of conditional statements were most noticeable in the kernels that per43 Chapter 5. Overall Evaluation 44 Figure 5.1: Comparison of execution times (in seconds) of three versions of the KinectFusion algorithm formed camera tracking and integration of the depth frame into the global volume. The existence of a program counter per hardware thread helped offset the loss of efficiency resulting from inability to use vector instructions by eliminating the work duplication that would have occurred in warp/wavefront based systems. The full caching of data used by the integration kernel also reduced the effect of possible thread divergence resulting from input data distribution. Nevertheless, the reduced core count on the Mali ultimately dominated the application’s performance. As with prior work on mobile visual computing, this project investigated alternative formulations of the algorithm. Focus was primarily on how several iterations can be overlapped. It may also have been worthwhile to consider possibly less precise but faster data representations. Another direction that may have yielded positive results is the use of alternative layout of data items in memory. For example, the 3D data structure used to represent the global scene is laid out linearly in contiguous locations in memory. A different memory allocation scheme (possibly striping) could have resulted in improved performance. Chapter 5. Overall Evaluation 5.1 45 Energy Considerations While actual power consumption measurements were not obtained, it is reasonable to expect that the optimisations applied would result in a more efficient solution. This is because of the reduction in the number of memory accesses and clock cycles needed to process all input depth frames. While it is true that applying either of proposed methods of cooperatively utilising the CPU and GPU will result in an increase in peak power, the consequent speedup will ensure an overall improvement in energy efficiency. Chapter 6 Conclusions This project investigated the extent to which the heterogeneous computing facilities present in current embedded systems make them capable of executing computationally intensive visual SLAM applications. The KinectFusion algorithm was used as a case study and its behaviour was evaluated on an ARM CPU + GPU system. Several optimisation opportunities were investigated with varying degrees of success. Vectorisation, either via explicit SIMD instructions or through vector memory access yielded the greatest benefits. The nature of the algorithm however greatly limited the extent to which vectorisation could be applied. Also considered were means of utilising both the CPU and GPU concurrently. The original algorithm uses the GPU exclusively for most of the computations. Experimental results show that exploiting the heterogeneous resources and the architecture of the development system resulted in speedup. The final implementation was about 1.5 faster than the original unoptimised version. It was also about 3.3 times faster than a CPU only implementation. However, the reconstruction rate achieved was still much below that obtained on a desktop GPU. A direct extension to this project includes taking more advantage of the unified memory architecture of the Mali GPU by eliminating all forms of data copying. This would involve a reimplementation of the buffer management routines. Alternative methods of using the CPU and GPU concurrently should be explored also. 46 Bibliography [1] AMD Accelerated Parallel Processing OpenCL Programming Guide. http: //developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_ Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf. Ac- cessed on August 1, 2014. [2] NVIDIA CUDA. https://developer.nvidia.com/cuda-zone. Accessed on April 4, 2014. [3] NVIDIA OpenCL Best Practices Guide. http://www.nvidia.com/ content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_ BestPracticesGuide.pdf. Accessed on August 1, 2014. [4] ARM. The mali - t604 product page. http://www.arm.com/products/ multimedia/mali-graphics-plus-gpu-compute/mali-t604.php. Ac- cessed on April 4, 2014. [5] ARM. 2.0. Mali-T600 Series GPU OpenCL Developer Guide Version http://infocenter.arm.com/help/topic/com.arm.doc.dui0538f/ DUI0538F_mali_t600_opencl_dg.pdf. Accessed on April 4, 2014. [6] Kwang-Ting Cheng and Yi-Chu Wang. Using mobile gpu for general-purpose computing 2013; a case study of face recognition on smartphones. In VLSI Design, Automation and Test (VLSI-DAT), 2011 International Symposium on, pages 1–4, April 2011. [7] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages 303–312, New York, NY, USA, 1996. ACM. 47 Bibliography 48 [8] Daniel Castao Dez, Hannes Mueller, and Achilleas S. Frangakis. Implementation and performance evaluation of reconstruction algorithms on graphics processors. Journal of Structural Biology, 157(1):288 – 295, 2007. Software tools for macromolecular microscopy. [9] Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. From CUDA to opencl: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing, 38(8):391 – 407, 2012. APPLICATION ACCELERATORS IN HPC. [10] Engineering and Physical Sciences Research Council (EPSRC). PAMELA: a panoramic approach to the many-core landsape - from end-user to enddevice: a holistic game-changing approach. http://gow.epsrc.ac.uk/ NGBOViewGrant.aspx?GrantRef=EP/K008730/1. Accessed on April 4, 2014. [11] Jianbin Fang, A.L. Varbanescu, and H. Sips. A comprehensive performance comparison of cuda and opencl. In Parallel Processing (ICPP), 2011 International Conference on, pages 216–225, Sept 2011. [12] Benedict Gaster, Lee Howes, David R Kaeli, Perhaad Mistry, and Dana Schaa. Heterogeneous Computing with OpenCL: Revised OpenCL 1. Newnes, 2012. [13] Reitmayr Gerhard. Open Source Implementation of KinectFusion Using CUDA. https://github.com/GerhardR/kfusion. [14] Johan Gronqvist and Anton Lokhmotov. Optimising opencl kernels for the arm r mali tm-t600 gpus. 2014. [15] P. Harris. The Mali GPU: An Abstract Machine, Part 3 - The Shader Core. http://community.arm.com/groups/arm-mali-graphics/blog/2014/03/ 12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core. Accessed on August 1, 2014. [16] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, pages 559–568, New York, NY, USA, 2011. ACM. Bibliography 49 [17] Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban, and Cho-Li Wang. Heterogeneous computing: Challenges and opportunities. IEEE Computer, 26(6):18–27, 1993. [18] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pages 225–234, Nov 2007. [19] G. Klein and D. Murray. Parallel tracking and mapping on a camera phone. In Mixed and Augmented Reality, 2009. ISMAR 2009. 8th IEEE International Symposium on, pages 83–86, Oct 2009. [20] Seung Eun Lee, Yong Zhang, Zhen Fang, S. Srinivasan, R. Iyer, and D. Newell. Accelerating mobile augmented reality on a handheld platform. In Computer Design, 2009. ICCD 2009. IEEE International Conference on, pages 419–426, Oct 2009. [21] J. Leskela, J. Nikula, and M. Salmela. Opencl embedded profile prototype in mobile device. In Signal Processing Systems, 2009. SiPS 2009. IEEE Workshop on, pages 279–284, Oct 2009. [22] W.J. MacLean. An evaluation of the suitability of fpgas for embedded vision systems. In Computer Vision and Pattern Recognition - Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on, pages 131–131, June 2005. [23] Aaftab Munshi. The OpenCL Specification version 2.0. http://www.khronos. org/registry/cl/specs/opencl-2.0.pdf, Mar 2014. Accessed on April 4, 2014. [24] Aaftab Munshi, Benedict Gaster, Timothy G Mattson, and Dan Ginsburg. OpenCL programming guide. Pearson Education, 2011. [25] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium on, pages 127–136, Oct 2011. [26] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, and J.C. Phillips. Gpu computing. Proceedings of the IEEE, 96(5):879–899, May 2008. Bibliography 50 [27] S. Rusinkiewicz and M. Levoy. Efficient variants of the icp algorithm. In 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, pages 145–152, 2001. [28] Olaf Schenk, Matthias Christen, and Helmar Burkhart. Algorithmic performance studies on graphics processing units. Journal of Parallel and Distributed Computing, 68(10):1360 – 1369, 2008. General-Purpose Processing using Graphics Processing Units. [29] Nitin Singhal, Jin Woo Yoo, Ho Yeol Choi, and In Kyu Park. Implementation and optimization of image processing algorithms on embedded gpu. IEICE TRANSACTIONS on Information and Systems, 95(5):1475–1484, 2012. [30] Google Advanced Technology and Projects (ATAP). Project tango. https:// www.google.com/atap/projecttango/. Accessed on April 4, 2014. [31] S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel, editors, Exploring Artificial Intelligence in the New Millenium. Morgan Kaufmann, 2002. to appear. [32] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Computer Vision, 1998. Sixth International Conference on, pages 839–846, Jan 1998. [33] Daniel Wagner and D. Schmalstieg. Making augmented reality practical on mobile phones, part 2. Computer Graphics and Applications, IEEE, 29(4):6–9, July 2009. [34] Guohui Wang, B. Rister, and J.R. Cavallaro. Workload analysis and efficient opencl-based implementation of sift algorithm on a smartphone. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 759–762, Dec 2013. [35] Guohui Wang, Yingen Xiong, J. Yun, and J.R. Cavallaro. Accelerating computer vision algorithms using opencl framework on the mobile gpu - a case study. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 2629–2633, May 2013.