Object Tracking using Adaptive Template Matching
Transcription
Object Tracking using Adaptive Template Matching
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.001 1 IEIE Transactions on Smart Processing and Computing Object Tracking using Adaptive Template Matching Wisarut Chantara, Ji-Hun Mun, Dong-Won Shin, and Yo-Sung Ho Gwangju Institute of Science and Technology (GIST), 123 Cheomdan-gwagiro, Buk-gu, Gwangju 500-712, Korea {wisarut, jhm, dongwonshin, hoyo}@gist.ac.kr * Corresponding Author: Yo-Sung Ho Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: Template matching is used for many applications in image processing. One of the most researched topics is object tracking. Normalized Cross Correlation (NCC) is the basic statistical approach to match images. NCC is used for template matching or pattern recognition. A template can be considered from a reference image, and an image from a scene can be considered as a source image. The objective is to establish the correspondence between the reference and source images. The matching gives a measure of the degree of similarity between the image and the template. A problem with NCC is its high computational cost and occasional mismatching. To deal with this problem, this paper presents an algorithm based on the Sum of Squared Difference (SSD) and an adaptive template matching to enhance the quality of the template matching in object tracking. The SSD provides low computational cost, while the adaptive template matching increases the accuracy matching. The experimental results showed that the proposed algorithm is quite efficient for image matching. The effectiveness of this method is demonstrated by several situations in the results section. Keywords: Template matching, Object tracking, Normalized cross correlation, Sum of squared difference, Pattern recognition 1. Introduction Nowadays, video surveillance systems are being installed worldwide in many different sites, such as airports, hospitals, banks, railway stations, and even at homes. Surveillance cameras help a supervisor to oversee many different areas in a single room and to quickly focus on abnormal events taking place in the controlled space. Intelligent video monitoring expresses a fairly large research direction that is applied in different fields. On the other hand, several question arise, such as how can the intensity of the objects tracking with occlusion be improved. When an occlusion occurs, some objects are partially or totally invisible. As shown in Fig. 1, this phenomenon makes it difficult to localize the real object target accurately. 2. Object Tracking Object tracking in real-time environment is a different task in different computer vision applications. Object Fig. 1. Shading real target by other objects. tracking consists of an estimation of the trajectory of moving objects in the sequence of frames generated from a video [1]. Most tracking algorithms assume that the object motion is smooth with no abrupt changes to simplify tracking [2]. For example, the traditional color histogram Mean Shift (MS) algorithm only considers the object’s color statistical information, and does not contain the object’s space 2 Chantara et al.: Object Tracking using Adaptive Template Matching information, so when the object color is close to the background color, or the object’s illumination is changed, the traditional MS algorithm easily causes inaccurate object tracking or it can be lost. Therefore, Mao et al. [3] proposed an object tracking approach that integrated two methods consisting of histogram-based template matching method, and the mean shift procedure was used to estimate the object location. Tracking methods also are related to a traffic surveillance systems. Choi et al. [4] proposed a vehicle tracking scheme using template matching based on both the scene and vehicle characteristics, including background information, local position and size of a moving vehicle. Alternative object tracking can be found in Refs. [1, 5, 6]. On the other hand, these basic tracking algorithms have weak intensity when the other object occludes the real source image. Automation of the computer object tracking is a difficult task. Therefore, to solve these kind of problems, this paper proposes the ‘template matching’ object tracking method. Before explaining the algorithm, some brief concepts of ‘template matching’ are introduced. 3. Template Matching Template matching is the process of finding the location of a sub image, called a template, inside an image. A number of methods for identical images can be used. This section discusses the template matching application for matching a small image, which is a part of a large image. Once a number of corresponding templates are found, their centers can be used as the corresponding control points to determine the matching parameters. Template matching involves comparing a given template with windows of the same size in an image and identifying the window that is most similar to the template. The basic template matching algorithm consists of calculating at each position of the image under examination a distortion function that measures the degree of similarity between the template and image. The minimum distortion or maximum correlation position is then taken to locate the template into the examined image. Typical distortion measures are the Sum of Absolute Differences (SAD) [7, 8] and the Sum of Squared Differences (SSD) [9]. On the other hand, as far as template matching is concerned, the Normalized Cross Correlation (NCC) [10] is often adopted for similarity measure. Essannoun et al. [11] proposed a fast frequency algorithm to speed up the process of SAD matching. They used an approach to approximate the SAD metric by a cosine series, which could be expressed in correlation terms. Furthermore, Hel-Or and Hel-Or [12] proposed a fast template matching method based on accumulating the distortion on the Walsh-Hadamard domain in the order of the associated frequency using SSD. The traditional NCC needs to compute the numerator and denominator, which is a very time-consuming. Lewis [7] employed the sum table scheme to reduce the computation in the denominator. Although the sum table scheme could reduce the computation of the denominator in NCC, it was essential to simplify the computation (a) (b) (c) Fig. 2. (a) Template image, (b) Source image, (c) Window sliding and matching. (x, y) (u, v) (u, v) (Static) (Moving) Template Image T Scene Image I Fig. 3. Matching procedure. involved in the numerator of NCC. Wei and Lai [13] proposed a fast pattern matching algorithm based on NCC criterion by combining adaptive multilevel partition with the winner update scheme to achieve a very efficient search. Alternative similarity measures can be found in Refs. [14-17]. In Fig. 2, to identify a matching area, the method needs to compare a template image against a source image by sliding it. By sliding the template, the process can measure the similarity between the template image and a region in the source image. At each location, a metric is calculated so it represents how similar the patch is to that particular area of the scene image. The process then finds the maximum or the minimum location in the resulting image depending on the measurement. 3.1 Matching Methods Fig. 3 shows the matching procedure. Here, u and v represent a horizontal and vertical position in the kernel, respectively, and x and y correspond to a position of the 3 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 kernel. The procedure can calculate a result pixel using various matching methods. Eqs. (1)-(4) show the matching methods. 1. Sum of Squared Difference (SSD) R ( x, y ) = ∑ (T (u, v) − I ( x + u, y + v) ) 2 (1) matched within the source image, I(x,y), of size p×q, where (p>m and q>n). For each pixel location (x,y) in the image, the SSD distance is calculated as follows: R ( x, y ) = ∑ (T (u, v) − I ( x + u, y + v) ) 2 (5) u ,v u ,v 2. Normalized Sum of Squared Difference (NSSD) R ( x, y ) = ∑ (T (u, v) − I ( x + u, y + v) ) 2 u ,v ∑ T (u, v) ⋅ ∑ I ( x + u, y + v) 2 u ,v 2 (2) u ,v 3. Cross Correlation (CC) R ( x, y ) = ∑ (T (u , v) ⋅ I ( x + u, y + v) ) (3) u ,v 4.2 The best object location 4. Normalized Cross Correlation (NCC) (T (u, v) ⋅ I ( x + u, y + v) ) ∑ u ,v R ( x, y ) = ∑ T (u, v)2 ⋅ ∑ I ( x + u, y + v)2 u ,v The smaller the distance measured SSD at a particular location, the more similar the local sub-image found is the searched template. If the distance SSD is zero, the local sub-image is identical to the template. The minimum distance provides the location of the corresponding object (CL) in the source image. Other small distances and locations (SLs), which are less than the threshold value, can also be found. These values will be proposed in a next subsection. (4) u ,v For SSD or NSSD, the process needs to find the minimum value in the resulting image. Otherwise, if using the CC or NCC, the process needs to find the maximum value in the result. This paper introduces object tracking using an adaptive template matching technique. In this technique, SSD was used to measure the similarity between a template image and source image. The finest object location was selected in the source image. Finally, the template image, which is an adapted template image, was updated. 4. Proposed Methods The objective of the proposed method is to reduce a computational cost and increase accurate matching. As a similarity measure, the SSD is the most popular and widely used for several applications. NCC is more robust against illumination changes than SSD; nevertheless, NCC is more time-consuming than SSD. The following three key steps are involved in implementing of the proposed method. ・Detection of interesting object, ・Tracking of object from frame to frame, ・Updating of suitable template. 4.1 Template Matching The SSD is a simple algorithm for measuring the similarity between the template image (T) and sub-images in the source image (I). It works by taking the squared difference between each pixel in T and the corresponding pixel in the sub-images used for the comparison in I. These squared differences are summed to create a simple metric of similarity. Assume a 2-D m×n template, T(x,y) is This subsection proposes a method for locating the position of an interesting object in an image (OL). The aim of this process was to identify an appropriate coordinate that finds in the source image. Assumption: The previous location of object (PL) is stored in a memory buffer. Condition 1: If the CL is the nearest of PL then, Set OL = CL, and assign Location flag (FL) = 1 Condition 2: If the SLs are the nearest of PL then, Set OL = the nearest of SLs, and assign the FL = 0 Condition 3: If the CL doesn’t match in Condition 1 and 2 then, Set OL = PL, and assign FL = 0 Fig. 4 shows a flowchart of a method for finding the best object location following the above-mentioned. 4.3 Adaptive Template Matching In this subsection, the FL parameter from section 4.2 is considered to determine an appropriate template image. Location flag = 0, it means that ・The OL is the SLs or PL. ・Assign the previous template with the current template. ・Update the current template with the current object template. Location flag = 1, it means that ・The OL is the CL from the SSD method. ・Assign the previous template with the current template ・Update the current template with the original template. The result image ・Green box is the PL position in the source image. ・Red box is the OL position in the source image. The overall actions are described more detail in Fig. 5. 4 Chantara et al.: Object Tracking using Adaptive Template Matching (a) (b) Fig. 6. (a) Source image, (b) Template image. Condition 1: Fig. 4. Algorithm to find the best object location. (a) Fig. 5. Algorithm to update template. 5. Experiment Results This section shows the advantage of the proposed method to full track an interested object. Experiments were performed to examine the matching accuracy. A template image of 27x67 pixels (Fig. 4(b)) was used to match in a source image the sequences of size of 352x288 pixels, as shown in Fig. 4(a). Figs. 7-9 illustrate each condition that is explained in (b) (c) Fig. 7. Condition 1 (a) SSD result, (b) Object location, (c) Update the current template with the original template. section 4.2. An illumination of (a), (b), and (c) in each figure shows the result of the SSD method, the position of involved object and the updated template respectively. To illustrate the performance of the proposed algorithm, 5 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Condition 2: (a) (b) (c) Fig. 8. Condition 2 (a) SSD result with the SLs, (b) Object location shows the PL in a green box and the current location in a red box, (c) Update the current template with the current object template. Condition 3: (a) (b) (c) Fig. 9. Condition 3 (a) SSD result with the SLs, (b) Object location shows the current location with the PL in a green box, (c) Update the current template with the current object template. Fig. 10. Object’s positions on consecutive frame sequence. the template matching algorithm like the NSSD and NCC were compared. The result is shown in Fig. 11. Fig. 10 shows x and y object positions curves on a consecutive image frames. The proposed method outperforms the two conventional methods. In addition, other experiment samples consisting of 6 Chantara et al.: Object Tracking using Adaptive Template Matching F.1 F.10 F.35 F.64 F.177 F.250 (a) (b) (c) (d) Fig. 11. Experiment result (a) NSSD, (b) NCC, (c) Ground truth [18], (d) Proposed method, the results are captured at 1st, 10th, 35th, 64th, 177th, and 250th frame respectively. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 (a) 7 (b) Fig. 12. Result related to condition 3 (a) Template image, (b) Object tracking. (a) (b) Fig. 13. Result related to condition 2 (a) Template image, (b) Object tracking. (a) (b) Fig. 14. Result related to condition 1 (a) Template image, (b) Object tracking. three source images and their templates with the proposed algorithm were implemented. These samples contain different sizes and different illuminations. Figs. 12-14 show the outcomes. 8 6. Conclusion This paper proposed an object tracking using an adaptive template matching algorithm. The adaptive template matching provides the proper object location in a source image. The SSD uses a small number of operations for a similarity purpose. Therefore, the proposed method reduces the computational cost and increases the accuracy. Based on the experiment results, the proposed method is more efficient than the conventional methods, such as NSSD and NCC. Furthermore, a comparison of the identification object in the source image confirmed that the proposed method outperforms the conventional methods. Acknowledgement This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. 2013-067321) References [1] N. Prabhakar, V. Vaithiyananthan, A.P. Sharma, A. Singh, and P. Singhal, “Object Tracking Using Frame Differencing and Template Matching,” Research Journal of Applied Sciences, Engineering and Technology, pp. 5497-5501, Dec. 2012. Article (CrossRef Link) [2] A.Yilmaz, O.Javed, and M.Shah, “Object Tracking: A Survey,” ACM Computing Surveys, vol. 38(4), Article 13, pp.1–45, Dec. 2006. Article (CrossRef Link) [3] D.Mao, Y.Y. Cao, J.H. Xu, and K. Li, “Object tracking integrating template matching and mean shift algorithm,” Multimedia Technology (ICMT), 2011 International Conference on, pp. 3583-3586, Jul. 2011. Article (CrossRef Link) [4] J.H. Choi, K.H. Lee, K.C. Cha, J.S. Kwon, D.W. Kim, and H.K. Song, “Vehicle Tracking using Template Matching based on Feature Points,” Information Reuse and Integration, 2006 IEEE International Conference on, pp. 573-577, Sep. 2006. Article (CrossRef Link) [5] S. Sahani, G. Adhikari, and B. Das, “A fast template matching algorithm for aerial object tracking,” Image Information Processing (ICIIP), 2011 International Conference on, pp. 1–6, Nov. 2011. Article (CrossRef Link) [6] H.T. Nguyen, M. Worring, and R.V.D. Boomgaad, “Occlusion robust adaptive template tracking,” Proceedings. Eighth IEEE International conference on Computer Vision (ICCV), pp. 678-683, 2001. Article (CrossRef Link) [7] J. P. Lewis, “Fast template matching,” Vis. Inf., pp. 120-123, 1995. Article (CrossRef Link) [8] F. Alsaade and Y.M. Fouda, “Template Matching based on SAD and Pyramid,” International Journal of Computer Science and Information Security (IJCSIS), vol. 10 no.4, pp. 11-16, Apr. 2012. Article (CrossRef Link) Chantara et al.: Object Tracking using Adaptive Template Matching [9] J. Shi and C.Tomisto, “Good feature to track,” Proceedings of IEEE Computer Society Conference on Computer Vision Pattern Recognition, pp. 593600, Jun. 1994. Article (CrossRef Link) [10] W. K. Pratt, “Correlation techniques of image registration,” IEEE Trans. On Aerospace and Electronic Systems, vol. AES-10, pp. 353-358, May 1974. Article (CrossRef Link) [11] F. Essannouni, R. Oulad Haj Thami, D. Aboutajdine, and A. Salam, “Adjustable SAD matching algorithm using frequency domain” Journal of Real-Time Image Processing, vol. 1, no. 4, pp. 257-265, Jul. 2007. Article (CrossRef Link) [12] Y. Hel-Or and H. Hel-Or, “Real-time pattern matching using projection kernels,” IEEE Trans. PAMI, vol. 27, no. 9, pp. 1430-1445, Sep. 2002. Article (CrossRef Link) [13] S. Wei and S. Lai, “Fast template matching based on normalized cross correlation with adaptive multilevel winner update” IEEE Trans. Image processing, vol. 17, No. 11, pp. 2227-2235, Nov. 2008. Article (CrossRef Link) [14] N.P. Papanikolopoulos, “Selection of Features and Evaluation of Visual Measurements During Robotic Visual Servoing Tasks,” Journal of Intelligent and Robotic System, vol.13 (3), pp. 279-304, Jul. 1995. Article (CrossRef Link) [15] P. Anandan, “A computational framework and an algorithm for the measurement of visual motion,” International Journal of Computer Vision, vol. 2 (3), pp. 283-310, Jan. 1989. Article (CrossRef Link) [16] A. Singh and P. Allen, “Image flow computation: an estimation-theoretic framework and a unified perspective,” Computer Vision Graphics and Image Processing: Image Understanding, vol. 56 (2), pp. 152-177, Sep. 1992. Article (CrossRef Link) [17] G. Hager and P. Belhumeur, “Real-time tracking of image regions with changes in geometry and illumination,” Proceedings of IEEE Computer Society Conference on Computer Vision Pattern Recognition, pp. 403-410, Jun. 1996. Article (CrossRef Link) [18] Y. Wu, J. Lim, and M.H. Yang, “Online Object Tracking: A Benchmark,” Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 2411-2418, Jun. 2013. Article (CrossRef Link) Wisarut Chantara received the B.S. and M.S. degrees in Computer Engineering from Prince of Songkla University (PSU), HatYai, Thailand in 2004 and 2008, respectively. From 2008 to 2013, he worked as a lecturer at Department of Computer Engineering, PSU, Phuket Campus, Phuket, Thailand. He is currently a Ph.D. student in the School of Information and Communications at Gwangju Institute of Science and Technology (GIST), Korea. His research interests are object tracking, computer vision, light field camera, digital signal, and image processing. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Ji-Hun Mun received his B.S. degree in Electric and Electrical Engineering from Chonbuk National University, Korea, in 2013. He is currently a M.S. student in the School of Information and Communications at Gwangju Institute of Science and Technology (GIST), Korea. His research interests include digital signal and image processing. Dong-Won Shin received his B.S. degree in Computer Engineering from Kumoh National Institute of Technology, Korea, in 2013. He is currently a M.S. student in the School of Information and Communications at Gwangju Institute of Science and Technology (GIST), Korea. His research interests include 3D image processing and depth map generation. Yo-Sung Ho received the B.S. and M.S. degrees in Electronic Engineering from Seoul National University, Seoul, Korea, in 1981 and 1983, respectively, and Ph.D. degree in electrical and computer engineering from the University of California, Santa Barbara, in 1990. He joined the Electronics and Telecommunications Research Institute (ETRI), Daejeon, Korea, in 1983. From 1990 to 1993, he was with Philips Laboratories, Briarcliff Manor, NY, where he was involved in development of the advanced digital highdefinition television system. In 1993, he rejoined the Technical Staff of ETRI and was involved in development of the Korea direct broadcast satellite digital television and high-definition television systems. Since 1995, he has been with the Gwangju Institute of Science and Technology, Gwangju, Korea, where he is currently a Professor in the Department of Information and Communications. His research interests include digital image and video coding, image analysis and image restoration; advanced coding techniques, digital video and audio broadcasting, 3-D television, and realistic broadcasting. Copyrights © 2015 The Institute of Electronics and Information Engineers 9 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.010 10 IEIE Transactions on Smart Processing and Computing Accelerating Self-Similarity-Based Image SuperResolution Using OpenCL Jae-Hee Jun1, Ji-Hoon Choi1, Dae-Yeol Lee2, Seyoon Jeong2, Suk-Hee Cho2, Hui-Yong Kim2, and Jong-Ok Kim1 1 2 School of Electrical Engineering, Korea University / Seoul, South Korea {indra, risty, jokim}@korea.ac.kr Electronics and Telecommunications Research Institute / Dae-jeon, South Korea {daelee711, jsy, shee, hykim5}@etri.re.kr * Corresponding Author: Jong-Ok Kim Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: This paper proposes the parallel implementation of a self-similarity based image SR (super-resolution) algorithm using OpenCL. The SR algorithm requires tremendous computations to search for a similar patch. This becomes a bottleneck for the real-time conversion from a FHD image to UHD. Therefore, it is imperative to accelerate the processing speed of SR algorithms. For parallelization, the SR process is divided into several kernels, and memory optimization is performed. In addition, two GPUs are used for further acceleration. The experimental results shows that a GPGPU implementation can speed up over 140 times compared to a single-core CPU. Furthermore, it was confirmed experimentally that utilizing two GPUs can speed up the execution time proportionally, up to 277 times. Keywords: Super-resolution, OpenCL, Parallelization, Multiple GPUs 1. Introduction As the complexity of image processing algorithms increases, their running time on the CPU has been multiplied accordingly. This has led to an increase in the interest in GPGPUs (general-purpose computing on graphics processing units). GPGPU is a technique to utilize a GPU device for general computational applications, which is typically handled by a CPU. Although a GPU has a much lower clock frequency than a CPU, it has a very large number of cores, and can accelerate complex computations significantly by parallelization. Researchers have recently examined the acceleration of image processing algorithms using GPGPU techniques [8, 9, 11, 12]. The GPGPU technology is categorized into two different techniques. One is CUDA, which was made public in 2007 by Nvidia, and the other is OpenCL, which was released by the Khronos group in 2008 [7]. Some studies have compared CUDA with OpenCL, such as [10]. OpenCL is an open, generalpurpose parallel framework that is renowned for its good portability. This paper proposes the parallel implementation of the self-similarity based image SR (super resolution) algorithm using OpenCL. The SR algorithm, which was proposed previously exploits the inherent self-similarity of an image [1, 2], but it requires tremendous computations to search for a similar patch. In addition, the SR technique is applied primarily to the resolution conversion from FHD (Full HD) to UHD (Ultra HD) which has just begun to be commercialized popularly. The amount of the UHD image data is 4 times that of FHD. Therefore, common software implementation makes it difficult to work in a real-time manner. Accordingly, accelerating the processing speed of SR algorithms is indispensable. For parallelization, the SR process is divided into several kernels, and memory optimization is performed. In addition, two GPUs are utilized for further acceleration. The experimental results show that GPGPU implementation can speed up the SR process significantly compared to a single-core CPU. The remainder of this paper is organized as follows. Section 2 introduces the super resolution algorithm that was implemented with OpenCL. Section 3 briefly explains the OpenCL implementation in detail, particularly focusing on memory optimization and support to the multiple GPU devices. Section 4 presents the experimental results. Section 5 concludes the paper. 11 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Fig. 1. Tasks of the self-similarity based SR algorithm. 2. Super-Resolution Algorithm HFLR = ILR - LFLR The goal of SR is to reconstruct a HR (high resolution) image from one or more observed LR (low resolution) images. Recently, a number of SR methods have been proposed to exploit the property of self-similarity in an image. In natural images, small patches tend to recur redundantly across the scale as well as in-scale. This observation is referred to as self-similarity, and has been exploited popularly for image processing, such as denoising, edge detection and retargeting [3]. In addition, it has been applied to learning-based image SR in recent years [6]. SR methods based on self-similarity can be classified into two categories, depending on the domain to which the self-similarity assumption is applied; spatial domain and LF (low frequency)-HF (high frequency) domain. Self-similarity with the LF-HF domain is almost equal to the learning based method, such as [4]. Unlike the traditional example-based method, however, self-similarity based SR on the LF-HF domain does not require any prior databases. Therefore, this method can reduce the computational complexity and memory consumption, compared to the conventional learning based approach. The self-similarity-based SR method consists of three parts overall, as follows: the generation of image pyramids for both LF and HF components, HF reconstruction-based on self-similarity, and back projection after combining the LF and HF components. First of all, an input LR image is decomposed into LF and HF components, which are given by LFLR = H(ILR) (1) (2) where H is a blurring operator, LFLR is the blurred image of an input image. After decomposition, an HR image is initially obtained by simply interpolating the input image. This HR image can be considered the LF component of the target HR image, which is denoted by LFHR. Scaling is increased gradually to a target scale by a factor of 1.25. An incremental coarse-grained method using a small scale factor was reported to produce a more elaborate result [5]. For the reconstruction of HFHR, for every patch pi in the image LFHR, the most similar patch pj is searched on the LFLR domain. The corresponding HFLR patch is then copied to query the patch. By merging the LFHR and HFHR image, an HR image, IHR, can be reproduced as follows. HFHR (pi) = HFLR (pj) IHR = LFHR + HFHR (3) (4) In this way, hierarchical HF reconstructions in the image pyramid are repeated until the target HR is achieved. Fig. 1 presents the overall processes of the SR algorithm. 3. Implementation Using OpenCL This section present a parallel implementation of the SR algorithm. First, the SR process is divided into a couple of kernels, and the problem of the race condition, which actually occurs in searching similar patches, is then addressed. Next, memory optimization is performed based on the memory model. Finally, we deal with how to handle multiple GPU devices for further acceleration. A kernel is a unit of code that represents a single 12 Jun et al.: Accelerating Self-Similarity-Based Image Super-Resolution Using OpenCL Table 1. Kernel design and computational time. Procedure Kernel name LF/HF subtract Conv & Sub 6,237 Interpolation imgScaler 3,611 Find similar patch SSSR & Setzero 353,347 Patch mapping Add & Div 2,897 LF/HF merge Add 1,179 Back Projection Conv & Add & Sub 4,834 Time(㎲) 3.1 Race Condition When the ‘Find similar patch’ kernel is running, two or more work-items may reference the address value of a single pixel at the same time. At this time, the race condition occurs, which indicates the situation where multiple work-items simultaneously accesses a certain pixel to be declared in the global memory. To overcome this conflict, a serialization method is adopted by the atom command provided by OpenCL. Note that the use of the atomic command ensures the correctness of local memory by serialization. Fig. 2 illustrates the occurrence of the race condition and its solution. 3.2 Memory Optimization Fig. 2. Race condition and its solution. executing instance in a GPU. This single kernel instance corresponds to a work item in OpenCL. When the workitem and work-group are designed, both GPU specification and image resolution should be considered simultaneously. 32x32 work-items are designated as a work-group. The kernels for the SR algorithm are designed, as listed in Table 1 according to tasks and synchronization points. For further optimization of the OpenCL implementation, the individual kernel running time is measured using performance profiler, as listed in Table 1. The kernel used to find the best matching patch is the most time-consuming execution in the SR process. Hence, special optimization of this kernel is essential for better speed acceleration. This will be discussed in Section 3.2. Next, the memory optimization step is described. OpenCL defines a hierarchical memory model, as shown in Fig. 3. Local memory is manufactured by on chip, and it is much faster than global memory. In addition, OpenCL provides image memory that is specialized for accessing image data. Image memory requires fewer GPU cycles when data is read, so its access speed is much faster than global memory. In this paper, speed acceleration is optimized based on two memory model, which are image buffer and local memory. At the ‘Find similar patch’ kernel, similar patches are searched in the surrounding local area of a target patch, and all pixel data in the search area are loaded into the local memory at one time. This can accelerate the memory access speed better than global memory. The local memory is divided into two memory banks. To achieve high memory bandwidth (or high speed), all work-items should be split evenly into each bank to reduce the number of work-items per bank as much as possible. If many work-items are assigned to a certain bank unevenly, it takes more time due to serialization. This phenomenon is the so called bank conflict, and it should be avoided when local memory is used. Fig. 3. OpenCL memory model. 13 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 3.3 Support to Multiple GPUs To utilize multiple GPU devices, a command queue should be prepared for each GPU individually. On the other hand, context can be shared among multiple command queues or each queue can have its own context. The former is a single context-based implementation. This is a typical way to work with multiple devices. When multiple devices use the same context, most of the OpenCL objects are shared (e.g., kernel, program, and memory objects). As mentioned above, one command queue should be generated per GPU; hence, multiple command queues are needed within the same context. This approach is suitable for the algorithm that uses less memory. Another implementation approach is to define the different contexts per device separately. This is commonly used in computing on a heterogeneous device system (e.g., CPU and GPU). Data is handled independently among multiple contexts, and system crashes can be avoided. Fig. 4 illustrates the actual implementation of the above-mentioned two approaches in the case of two GPUs. Suppose that video frames are super-resolved on two GPUs. As shown in Fig. 4(a), the image data is split into two devices at a frame level. In other words, the odd frames are processed into one GPU, and the even frames are processed into the other one. When a single context is adopted, a video frame is split equally into two segments, and each segment is processed independently of each GPU, as shown in Fig. 4(b). (a) Multiple context 4. Experimental Results The experiments are performed on an Nvidia GeForce GTX-780 device. The host machine is an Intel® Core™ i5-4590 CPU @ 3.30Ghz, 64 bits compatible. The OpenCL programs were developed using Visual Studio 2012 and OpenCL version 1.1. The test application was to super-resolve from a 1920 x 1080 FHD image to a 3840x2160 UHD using the self-similarity based SR method in Section 2. The performance was measured by the total execution time of all kernel functions in a GPU. The execution time was measured by Nsight, a profiling tool provided by Nvidia. The parallel implementation in GPU speeds up the running time, as confirmed in Table 2. In particular, if the local memory based optimization is performed, the algorithm running speed can be accelerated further by approximately 2 times compared to the non-memoryoptimization. When each GPU uses its own context separately, the execution speed can be made approximately 277 times faster than that of a single core CPU. In a single context approach, an image is partitioned equally into two segments, and half of the image is processed into each GPU. When there are super-resolving pixels near the border line between two image segments, the pixel data of the other image segment is needed to find a similar patch. Therefore, for a single context approach, more than half an image is actually transferred to individual GPUs. This leads to more transfer time between (b) Single context Fig. 4. Two ways to utilize multiple devices in OpenCL. Table 2. Running time comparisons. # of device Implementation method Time(s) CPU (1 core) 25.209 1 2 GPU 0.388 GPU (image buffer) 0.273 GPU (local memory) 0.169 2 GPU (2 context) 0.091 2 GPU (1 context) 0.098 the CPU and GPU. 5. Conclusion This paper proposed a parallel implementation of a complex image SR algorithm using OpenCL for speed acceleration. Global memory synchronization was solved 14 Jun et al.: Accelerating Self-Similarity-Based Image Super-Resolution Using OpenCL using the race condition. In the SR algorithm, finding a similar patch was very time-consuming; its speed could be accelerated using the local memory. In addition, two GPUs were utilized for further acceleration. The experimental results showed that GPGPU implementation can speed up the process by more than 140 times compared to that using a single-core CPU. Furthermore, it was confirmed experimentally that the use of two GPUs can accelerate the execution time proportionally, by up to 277 times. Acknowledgement This work was supported by the ICT R&D program of MSIP/IITP. [13-912-02-002, Development of Cloud Computing Based Realistic Media Production technology] Conference on Parallel Processing, ICPP’11, Sep. 2011. Article (CrossRef Link) [11] G. Wang, Y. Xiong, J. Yun, and J. R. Cavallaro, “Accelerating computer vision algorithms using OpenCL framework on the mobile GPU – a case study”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May. 2013. Article (CrossRef Link) [12] J.H. Jun, J.H. Choi, D.Y. Lee, S.Y. Jeong, J.S. Choi, J.O. Kim, “Implementation Acceleration of SelfSimilarity Based Image Super Resolution Using OpenCL” , in International Workshop on Advanced Image Technology (IWAIT) & International Forum on Medical Imaging in Asia (IFMIA), Jan. 2015. Jae-Hee Jun received his B.S. degree in electronic engineering from Korea University, Seoul, Korea, in 2013, He is currently pursuing the M.S. degree in electronic engineering at Korea University, Seoul, Korea. His research interests are image signal processing and multimedia communications. References [1] S.J. Park, O.Y. Lee, and J.O. Kim, “Self-similarity based image super-resolution on frequency domain,” Proc. of APSIPA ASC 2013, pp. 1-4, Nov. 2013. Article (CrossRef Link) [2] J.H. Choi, S.J. Park, D.Y. Lee, S.C. Lim, J.S. Choi, and J.O. Kim, “Adaptive self-similarity based image super-resolution using non local means,” Proc. Of APSIPA ASC 2014, Dec. 2014. Article (CrossRef Link) [3] A. Buades, B. Coll and J.M Morel, “A non-local algorithm for image denoising”, IEEE Int. Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 60 – 65, June. 2005. Article (CrossRef Link) [4] W.T. Freedman, T.R Jones, and E.C. Pasztor “Example-based super-resolution,” IEEE Computer Graphics and Applications, vol. 22, pp. 56-65, Mar. 2002. Article (CrossRef Link) [5] D. Glasner, S. Bagon, and M.Irani, “Super-resolution from a single image,” 12th IEEE Int. Conf. on Computer Vision, pp. 349 – 356, Sep. 2009. Article (CrossRef Link) [6] G. Freeman, and R. Fattal, “Image and video upscaling from local self-examples,” ACM Trans. on Graphics, vol. 30, Article No. 12, Apr. 2011. Article (CrossRef Link) [7] Khronos OpenCL Working Group. The OpenCL Specification Version 1.2. Khronos Group, 2012. http://www.khronos.org/opencl [8] J. Zhang, J. F. Nezan, and J.-G. Cousin, “Implementation of motion estimation based on heterogeneous parallel computing system with OpenCL,” in Proc. IEEE Int. Conf. High Performance Computing and Commun., 2012, pp. 41–45. Article (CrossRef Link) [9] N.-M. Cheung, X. Fan, O. C. Au, and M.-C. Kung, “Video coding on multicore graphics processors,” IEEE Signal Processing Magazine, vol. 27, no. 2, pp. 79–89, 2010. Article (CrossRef Link) [10] J. Fang, A. L. Varbanescu, and H. Sips, “A Comprehensive Performance Comparison of CUDA and OpenCL,” in Proceedings of the International Ji-Hoon Choi received his B. S. degree in electronic engineering from Korea University, Seoul, Korea, in 2011, He is currently pursuing a Ph.D. degree in electronic engineering at Korea University, Seoul, Korea. His research interests are image signal processing and multimedia communications. Dae-Yeol Lee received his BS and MEng degrees in electrical and computer engineering from Cornell University, United States, in 2012 and 2013 respectively. In August 2013, he joined Broadcasting and Telecommunications Media Research Department, ETRI, Korea, where he is currently a researcher. His research interests include image/video processing, video coding and computer vision. Seyoon Jeong received his BS and MS degrees in electronics engineering from Inha University, Korea, in 1995 and 1997 respectively, and received hsi PhD degree in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST) in 2014. He joined ETRI, Daejeon, Rep. of Korea in 1996 as a senior member of the research staff and is now a principal researcher. His current research interests include video coding, video transmission and UHDTV. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Sukhee Cho received her BS and MS in computer science from Pukyong National University, Busan, Rep. of Korea, in 1993 and 1995, and her PhD in electrical and computer engineering from Yokohama National University, Yokohama, Kanagawa, Japan, in 1999. Since 1999, she has been with the Broadcasting and Telecommunications Media Research Department at ETRI, Daejeon, Rep. of Korea. Her current research interests include realistic media processing technologies, such as video coding and the systems of 3DTV and UHDTV. Hui Yong Kim received his BS, MS, and Ph.D degrees from Korea Advanced Institute of Science and Technology (KAIST), Korea, in 1994, 1998, and 2004, respectively. From 2003 to 2005, he was the leader of Multimedia Research Team of AddPac Technology Co. Ltd. In November 2005, he joined the Broadcasting and Telecommunications Media Research Laboratory of Electronics and Telecommunications Research Institute (ETRI), Korea, and currently serves as the director of Visual Media Research Section. From 2006 to 2010, he was also an affiliate professor in University of Science and Technology (UST), Korea. From September 2013 to August 2014, he was a visiting scholar at Media Communications Lab of University of Southern California (USC), USA. He has made many contributions to the development of international standards, such as MPEG Multimedia Application Format (MAF) and JCT-VC High Efficiency Video Coding (HEVC), as an active technology contributor, editor, and Ad-hoc Group co-chair in many areas. His current research interest include image and video signal processing and compression for realistic media applications, such as UHDTV and 3DTV. Copyrights © 2015 The Institute of Electronics and Information Engineers 15 Jong-Ok Kim (S’05-M’06) received his B.S. and M.S. degrees in electronic engineering from Korea University, Seoul, Korea, in 1994 and 2000, respectively, and Ph.D. degree in information networking from Osaka University, Osaka, Japan, in 2006. From 1995 to 1998, he served as an officer in the Korean Air Force. From 2000 to 2003, he was with SK Telecom R&D Center and Mcubeworks Inc. in Korea, where he was involved in research and development on mobile multimedia systems. From 2006 to 2009, he was a researcher in ATR (Advanced Telecommunication Research Institute International), Kyoto, Japan. He joined Korea University, Seoul, Korea in 2009, and is currently an associate professor. His current research interests are in the areas of image processing and visual communications. Dr. Kim was a recipient of a Japanese Government Scholarship during 2003-2006. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.016 16 IEIE Transactions on Smart Processing and Computing Whole Frame Error Concealment with an Adaptive PU-based Motion Vector Extrapolation for HEVC Seounghwi Kim, Dongkyu Lee, and Seoung-Jun Oh Department of Electronic Engineering, Kwangwoon University / Seoul, Korea {k2090535, dongkyu, sjoh}@media.kw.ac.kr * Corresponding Author: Seounghwi Kim Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: Most video services are transmitted in wireless networks. In a network environment, a packet of video is likely to be lost during transmission. For this reason, numerous error concealment (EC) algorithms have been proposed to combat channel errors. On the other hand, most existing algorithms cannot conceal the whole missing frame effectively. To resolve this problem, this paper proposes a new Adaptive Prediction Unit-based Motion Vector Extrapolation (APMVE) algorithm to restore the entire missing frame encoded by High Efficiency Video Coding (HEVC). In each missing HEVC frame, it uses the prediction unit (PU) information of the previous frame to adaptively decide the size of a basic unit for error concealment and to provide a more accurate estimation for the motion vector in that basic unit than can be achieved by any other conventional method. The simulation results showed that it is highly effective and significantly outperforms other existing frame recovery methods in terms of both objective and subjective quality. Keywords: Wireless networks, Error concealment, Motion vector extrapolation, HEVC, Prediction unit 1. Introduction Many users are provided with video services in wireless networks. Although a size of video data is increasing gradually, high compression is essential to transfer over communication channels. Because highly compressed video data is more sensitive to channel errors than non-compressed video data, errors can occur in wireless transmission, which will degrade the entire video quality significantly. To reduce this effect, many error concealment algorithms have been proposed in various ways. The HEVC standard is the newest joint video project of the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) standardization organization, working with in a partnership known as the Joint Collaborative Team on Video Coding (JCT-VC) [1]. The main goal of HEVC is a bit-rate reduction of 50% with similar video quality to H.264/AVC. Such considerably higher compression efficiency of the HEVC will potentially lead to packet loss, which will have a large impact on the users of portable devices, such as tablets and smartphones. In HEVC, each slice is encoded in a single network abstraction layer (NAL) unit. Each NAL unit of the HEVC stream is encapsulated in a single real-time transport protocol (RTP) packet using the RTP payload format for HEVC organized by Schierl et al. [2]. The multiple packet loss due to channel errors is more likely to occur than single packet loss in the wireless networks. In this situation, multiple packet loss results in a loss of an entire frame. Some conventional block-based concealment algorithms are not satisfied with whole frame concealment [3-5]. Therefore, it is essential to conceal the entire frame error. Error concealment (EC) is an essential process to recover channel errors and several EC algorithms have been proposed. Some algorithms proposed a spatial EC using interpolation [3, 6-10], others exploited a temporal error concealment using MVE [11-15]. Spatial EC algorithms are inappropriate for concealing whole frame if the channel errors are spread to the entire frame. Although conventional MVE methods are suitable to conceal the entire frame, these methods have a drawback in their computational complexity or subjective quality. Unlike existing video coding standards, such as H.264/AVC, one of the characteristics of HEVC is the 17 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 more flexible block partitioning. The block structure of HEVC supports coding tree units (CTUs), a CTU is split into a quad-tree structure with four square coding units (CUs). Within the CU level, a prediction unit (PU) and a transform unit (TU) are composed to predict and transform, respectively. PU decides whether a corresponding CU is coding in intra or inter prediction mode. In the interprediction, a motion estimation and compensation are performed for a PU, meaning that the same object in a reference frame is likely to exist within the current PU. This paper proposes an Adaptive PU-based Motion Vector Extrapolation (APMVE) method to increase the EC performance. The proposed approach consists of two stages: 1) Error Concealment Basic Unit (ECBU) decision, and 2) ECBU-based motion vector extrapolation for error concealment. In the first stage, an MV of every PU in the correctly received previous frame is extrapolated to a lost frame. To decide the ECBU size of the lost CTU, the relationship between the extrapolated PUs and the lost CTU is then considered. After the ECBU decision stage, the APMVE method is repeated to estimate the best MVs of the ECBU blocks in the lost CTU. Finally, an MV of a non-overlapped blocks (NOBs) is duplicated from an MV of a co-located PU in the corrected received frame. The remainder of this paper is organized as follows. Section 2 briefly introduces the related works. Section 3 describes the proposed APMVE algorithm. Finally, the simulation results and conclusion are reported in sections 4 and 5, respectively. A B C Fig. 1. Hybrid MV extrapolation (HMVE). because this can to some extent indicate the object information. To obtain an MV in a missing frame, Lin’s algorithm looks for both a larger overlapped area and a larger partition size to allocate higher priority to the MVs associated with a larger object. 3. The Proposed Algorithm 2. Related Work Some EC algorithms have been proposed to recover the entire frame loss during transmission. Chen et al. proposed a Pixel-based MVE (PMVE) method to recover a whole missing frame [15]. For each pixel in the lost frame, Chen’s algorithm extrapolates an MV from the MVs of the previous reconstructed frame and the next frame. First, the pixels in the missing frame can be categorized into two parts: 1) For a pixel that is overlapped by at least an extrapolated macroblock (MB), its MV is estimated by averaging the MVs of all overlapped MBs. 2) For a pixel that is not covered by any of the extrapolated MBs, its MV is duplicated from the MV of the same pixel in the previous frame. On the other hand, the drawback of this algorithm is that the computational complexity is high because of the pixel-based MV determination. Yan et al. proposed a Hybrid Motion Vector Extrapolation (HMVE) method based on PMVE, which uses the extrapolated MVs of the pixels and blocks [11]. This method first classifies the pixels in the missing frame into three parts: A, B, and C, as shown in Fig. 1. After reliable MV candidates in each pixel are rejected by a threshold, a final MV is then determined. The abovementioned methods were implemented in H.263 and H.264/AVC, respectively. Alternatively, Lin et al. proposed a Motion Vector Extrapolation with Partition (MVEP) information that considers the partition decision information from the previous frame in HEVC [12]. The extrapolated blocks (EBs) were selected to a PU size, Although the MVEP algorithm takes advantage of the characteristics in HEVC, it still needs to overcome a drawback; the MVEP algorithm can degrade the subjective quality of a concealment frame because its error concealment basic unit (ECBU) is CTU, which is too rough to present a motion of video contents. An APMVE scheme, which uses the PU information of the previous frame for both the EB and ECBU decision, is proposed to overcome this drawback. This proposed algorithm can decide the ECBU of the lost CTU using the relationship between the extrapolated PUs and the lost CTU in the corrupted frame and assign an MV to each ECBU within the CTU. 3.1 ECBU Decision In this algorithm, a PU size is exploited as an ECBU size. The PU is suitable for an EB because the same object likely exists in a PU that contains its motion information. To decide an ECBU, MVE is performed for each PU of a previous frame. Fig. 2 gives an example of the PU-based MVE. The shaded CTU is being recovered. PU Ni −1 denotes the ith PU in frame N–1, where i is 1 to K, and K is the total number of PUs in that frame. MV ( PU Ni −1 ) denotes the MV of PU Ni −1 . The extrapolated block of PU Ni −1 and its MV are denoted by EPU Ni and MV ( EPU Ni −1 ) , respectively. MV ( EPU Ni −1 ) can be obtained using Eq. (2). 18 Kim et al.: Whole Frame Error Concealment with an Adaptive PU-based Motion Vector Extrapolation for HEVC MV(PUNa−1) MV(PUNb −1) MV(PUNc −1) PUNa−1 MV(EPUa ) N MV(EPUNb ) PUNb −1 PUNc −1 EPUNb EPUNc EPU Na EPUNc frame N frame N Fig. 2. Example of the PU-based MVE. ⎧⎪1 p ∈ EPU Ni f Ni ( p) = ⎨ , i ⎪⎩0 p ∉ EPU N 2 ECBU 1N ECBU N ECBU N3 ECBU N4 (4) p is a pixel in the ECBU Nj . The best MV ( ECBU Nj ) is then obtained by selecting the MV of the EPU with the maximum weight wNi . This is obtained as follows: frame N frame N frame N Fig. 4. Example of the MV estimation for the ECBU Nj . b N EPU Nc MV(ECBUN4 ) MV(ECBUN3 ) EPUNb MV(EPUNc ) frame N -1 frame N -2 EPU MV(ECBU1,2 N ) EPUNa EPUNa MV ( ECBU Nj ) = MV ( EPU Ni* ) , (5) i* = arg max{wNi } , (6) Fig. 3. Example of the ECBU decision in the lost CUT. where MV ( EPU Ni ) = MV ( PU Ni −1 ) , (2) where i is 1 to K. Some EPUs can be overlapped with the lost CTU. The MVs in the previous frame can be extended linearly to the current frame in MVE. Therefore, the lost CTU can be overlapped with various sized EPUs. In this paper, one of the overlapped EPUs is used to select the ECBU size of the lost CTU. To achieve the best quality, the smallest EPU size is selected as the ECBU size. On the other hand, if the ECBU size is very small, it is possible to cause block-artifacts in the concealed frame. The smallest ECBU size was restricted to M×M to overcome this problem. In other words, if the ECBU size is smaller than M×M, it is changed to M×M. Fig. 3 gives an example of the ECBU decision in the lost CTU. The lost CTU is divided into the size of EPU Na . Each block is denoted by ECBU Nj , where j is 1 to L and L is the total number of j N ECBU . 3.2 ECBU-based Motion Vector Extrapolation for Error Concealment To decide the MV of ECBU Nj , which is defined by MV ( ECBU Nj ) , we take advantage of the degree of overlap (the number of overlapped pixels) wNi between EPU Ni and ECBU Nj . wNi is given by wNi = ∑ p∈ECBU Nj where f Ni ( p), i = 1, ", K and j = 1, ", L , (3) Fig. 4 gives an example of the MV ( ECBU Nj ) estimation. Because EPU Na are overlapped with ECBU 1,2 N a the most, MV ( ECBU 1,2 N ) are set to MV ( EPU N ) . In the same manner, MV ( ECBU N3 ) and MV ( ECBU N4 ) are applied to MV ( EPU Nb ) and MV ( EPU Nc ) , respectively. If ECBU Nj has not been overlapped by any EPUs, the MV of this block is duplicated from an MV of a co-located PU in the corrected received frame. 4. Simulation Results The proposed APMVE method was simulated using the HEVC test model (HM) 11.0 and evaluated under the lowdelay P configuration under a common test condition with some changes; The number of reference frames was 1 and asymmetric motion partitioning (AMP) was off. Two standard video sequences “BasketballDrill” and “BQMall” were used to evaluate the performance of the proposed algorithm. The frame rates were 50 frames/s and 60 frame/s, respectively. The frame size of both “BasketballDrill” and “BQMall” is 832×480 and quantization parameter (QP) was 32. In addition, the intra mode in P frame is disable when evaluating the accuracy of the results of APMVE method. To decide the minimum ECBU size of the proposed method, an experiment was performed to select the optimal size among 4×4, 8×8 and 16×16. Fig. 5 shows the average PSNR values between the original sequence and the 19 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 (a) Fig. 5. PSNR value for “BQMall” sequence versus ECBU size. (b) Fig. 6. PSNR quality of the EC algorithms for the “BasketballDrill” (BBD) and “BQMall” sequences. concealed sequence according to the minimum ECBU sizes for the “BQMall” sequence. Whole frame loss occurs in every 15 frame in the original sequence. MN denotes the minimum ECBU size of N. Although M16 produces similar performance to both M4 and M8 in the “BasketballDrill” sequence, M16 produces better performance than M4 and M8 in the “BQMall” sequence, as shown in Fig. 5, while providing similar results in the other sequences. Therefore, M16 is used as the minimum ECBU size. To explain the effectiveness of the proposed method, the performance of the proposed APMVE algorithm was compared with that of the MVEP, frame copy (FC) and motion compensation (MC). FC means copying the previous frame for concealment. MC means that original MVs are received correctly, but residual information has been lost. As it is very difficult to estimate the MVs of a frame concealment than that of a block concealment, MC could be regarded as the upper bound. Fig. 6 shows the simulation results with PSNR for different test sequences. The proposed APMVE algorithm significantly outperformed FC and MVEP. Figs. 7 and 8 show the results of one frame extracted from different sequences, where (a) is the error-free frame, and (b) to (d) are the images reconstructed using MC, MVEP and the proposed APMVE algorithm, respectively. Owing to the PU-based ECBU decision, the proposed APMVE algorithm can show good subjective quality, (c) (d) Fig. 7. Restored 130th frame of “BasketballDrill” sequence with QP of 32 (a) error-free frame, (b) MC frame, (c) concealed with MVEP, (d) concealed with APMVE. particularly around the edges of the image objects, as highlighted by the solid circles. In the “BasketballDrill” sequence, APMVE method conceals the basketball well, whereas MVEP generates a distorted sequence, resulting from an excessively rough ECBU size of the CTU. 20 Kim et al.: Whole Frame Error Concealment with an Adaptive PU-based Motion Vector Extrapolation for HEVC addition, the MV of a non-overlapped block duplicated from a co-located MV can be improper. This block can be compensated for by a post-process, such as boundary matching algorithm and spatial-based error concealment algorithm. 5. Conclusion (a) This paper proposes a new EC algorithm, APMVE method, to conceal the whole missing frame encoded by HEVC. The PU information was used to adaptively decide the ECBU under the assumption that the same object likely exists in the PU. The MV of every PU in the correctly received previous frame is extrapolated to the lost frame. The temporal correlation between the extrapolated PU and the lost CTU was used to decide the ECBU. After the ECBU decision, the best MV of each ECBU in the lost CTU was estimated using degree of overlap and used for the lost frame concealment. The simulation results have showed that APMVE method provides better objective and subjective quality than the frame copy and MVEP method, particularly around the edges of the image objects. (b) Acknowledgement The present Research has been conducted by the Research Grant of Kwangwoon University in 2014. References (c) (d) Fig. 8. Restored 34th frame of “BQMall” sequence with QP of 32 (a) error-free frame, (b) MC frame, (c) concealed with MVEP, (d) concealed with APMVE. Although a PU represents an object’s shape, APMVE method can estimate the wrong MV, as shown by dotted circles. This results from using a simple criterion, degree of overlap, in an MV estimation, which is sensitive to outliers. This limitation can be compensated for by other methods that focus on how to find the best MV [11, 14]. In [1] High Efficiency Video Coding (HEVC), document Rec. ITU-T H.265 and ISO/IEC 23008-2, Jan. 2013. Article (CrossRef Link) [2] T. Schierl, S. Wenger, Y. –K, Wang and M. M. Hannuksela, “RTP payload format for high efficiency video coding,” IETF Internet Draft Sept. 2013.Article (CrossRef Link) [3] J. W. Park and S. U. Lee, “Recovery of corrupted image data based on the NURBS interpolation,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 10, pp. 1003-1008, Oct. 1999.Article (CrossRef Link) [4] W. –M. Lam, A. R. Reibman and B. Liu, “Recovery of lost or erroneously received motion vectors,” Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. 417-420, Apr. 1993.Article (CrossRef Link) [5] C. Cai, j. Zhang, “Improved error concealment method with weighted boundary matching algorithm,” ICMP, pp. 384-386, July. 2011.Article (CrossRef Link) [6] P. Salama, N. B. Shroff, E. J. Coyle, and E. J. Delp, “Error concealment techniques for encoded video streams,” in Proc. IEEE Int. Conf. Image Processing, 1995, pp. 9-12.Article (CrossRef Link) [7] J. W. Park, J. W. Kim, and S. U. Lee, “DCT coefficients recovery based error concealment technique and its application to the MPEG-2 bit stream error,” IEEE Trans. Circuits Syst. Video IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 [8] [9] [10] [11] [12] [13] [14] [15] Technol., vol. 7, no. 12, pp. 845-854, Dec. 1997. Article (CrossRef Link) H. Sun and W. Kwok, “Concealment of damaged block transform coded images using projections onto convex sets,” IEEE Trans. Image Process., vol. 4, no. 4, pp. 470-477, Apr. 1995. Article (CrossRef Link) J. Baek, P. S. Fisher, and M. Jo “An enhancement of mSCTP handover with an adaptive primary path switching scheme,” in Proc. of IEEE VTC Fall, pp. 1–5, Sep. 2010. Article (CrossRef Link) H. Gharavi and S. Gao, “Spatial interpolation algorithm for error concealment,” in Proc. IEEE ICASSP, pp. 1153-1156, Apr. 2008. Article (CrossRef Link) B. Yan and H. Gharavi, “A hybrid frame concealment algorithm for H.264/AVC,” IEEE Transactions on Image Processing, vol. 19, pp. 98107, Jan 2010. Article (CrossRef Link) T. L. Lin, N. C. Yang, R. H. Syu, C. C Liao and W. L. Tsai, “Error concealment algorithm for HEVC coded video using block partition decisions,” IEEE ICSPCC, pp. 1-5, 2013 Article (CrossRef Link) Q. Peng, T. Yang, and C. Zhu, “Block-based temporal error concealment for video packet using motion vector extrapolation,” in Proc. IEEE Int. Conf. Communications, Circuits and Systems and West Sino Expositions, vol. 1, pp. 10-14, Jun. 2002. Article (CrossRef Link) Zhang Y., Cui H. and Tang K., “Whole frame error concealment with refined motion extrapolation and prescription at encoder for H.264/AVC,” Tsinghua Science and Technology, vol. 14, pp. 691-697, Dec. 2009. Article (CrossRef Link) Y. Chen, K. Yum J. Li, and S. Li, “An error concealment algorithm for entire frame loss in video transmission,” presented at the IEEE Picture Coding Symp., 2004. Article (CrossRef Link) Copyrights © 2015 The Institute of Electronics and Information Engineers 21 Seounghwi Kim was born in Seoul, Korea. He received his B.S. degrees in Electronic Engineering from Kwangwoon University, Seoul, Korea, in 2014. He is a Master student at the Kwangwoon University. His research interests are video processing, video compression, and video coding. Dongkyu Lee was born in Seoul, Korea. He received his B.S. and M.S. degrees in Electronic Engineering from Kwangwoon University, Seoul, Korea, in 2012 and 2014, respectively. He is a PhD student at the Kwangwoon University. His research interests are image and video processing, video compression, and video coding. Seoung-Jun Oh was born in Seoul, Korea, in 1957. He received both the B.S. and the M.S. degrees in Electronic Engineering from Seoul National University, Seoul, in 1980 and 1982, respectively, and the Ph.D. degree in Electrical and Computer Engineering from Syracuse University, New York, in 1988. In 1988, he joined ETRI, Daejeon, Korea, as a Senior Research Member. Later he was promoted to a Program of Multimedia Research Session in ETRI. Since 1992, he has been a Professor of Department of Electronic Engineering at Kwangwoon University, Seoul, Korea. He has been a Chairman of SC29-Korea since 2001. His research interests include image and video processing, video compression, and video coding IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.022 22 IEIE Transactions on Smart Processing and Computing Analysis of the JND-Suppression Effect in Quantization Perspective for HEVC-based Perceptual Video Coding Jaeil Kim and Munchurl Kim Department of Information and Communications, KAIST, South Korea {jaeil1203, mkimee}@kaist.ac.kr * Corresponding Author: Jaeil Kim, Munchurl Kim Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: Transform-domain JND (Just Noticeable Difference)-based for PVC (Perceptual Video Coding) is often performed in quantization processes to effectively remove perceptual redundancy. This study examined the JND-suppression effects on quantized coefficients of transform in HEVC (High Efficiency Video Coding). To reveal the JND-suppression effect in quantization, the properties of the floor functions were used for modeling the quantized coefficients, and a JNDadjustment process in an HEVC-compliant PVC scheme was used to tune the JND values by analyzing the JND suppression effect. In the experimental results, the bitrate reduction decreases slightly, but the PSNR and perceptual quality are improved significantly when the proposed JND adjustment process is applied. Keywords: Perceptual video coding, Quantization, HEVC 1. Introduction Recently, the standardization of next-generation video coding, called High Efficiency Video Coding (HEVC) [1], has been finalized with the goal of improving its coding efficiency more than 2 times compared to that of H.264/AVC [2]. To improve the coding performance, the following advanced techniques have newly been adopted for HEVC: a flexible structure of the Coding Unit (CU) with skip modes; squared and asymmetric prediction units (PUs) with advanced motion vector prediction and merge modes; a residual quad-tree transform structure with 4×4, 8 ×8, 16× 16, and 32×32; 33 directional intra prediction with planar and DC modes; Transform Skipping Mode (TSM) for 4×4 inter and intra prediction mode; and additional inloop filtering of sample adaptive offset [1]. Fig. 1 shows the CU, PU and TU partition structures of HEVC. On the other hand, such efforts toward improving the coding efficiency have led to dramatic increases in the encoder complexity, and further improvement of the coding efficiency becomes increasingly difficult in a ratedistortion (MSE: Mean Squared Error) sense, even with more elaborate coding tools at the price of complexity increases. Another axis toward coding efficiency improvement is perceptual video coding (PVC), where perceptual distortion can be incorporated instead of MSE. Therefore, considerable effort has been made in this direction in the sense of PVC [3-7]. In PVC, how to effectively incorporate visual perception models into the coding process is one of the most important issues in reducing the perceptual redundancy. Among the techniques for reducing the perceptual redundancy, JND is an efficient model to assess the perceptual redundancy of human visual systems (HVS) [3-7]. In recent years, several methods have focused on PVC based on HVS [3-7]. In [3], a JNDadaptive residue pre-processor in the image-domain was proposed for a motion-compensated residue signal with a nonlinear additive model to improve the compression performance. Chen et al. [5] introduced a foveation JND model to improve the existing spatial and temporal JND models. The proposed JND model is used for macroblock quantization adjustment in H.264/AVC. Naccari et al. [6] reported a decoder-side JND model estimation to avoid additionally required signal bits for the adjustment of the quantization matrices. Based on this, they proposed a H.264/AVC-based perceptual video coding architecture. In [7], a perceptual visual model was applied for the earliest reference software version of HEVC, called Test Model under Consideration (TMuC), to improve the coding performance. On the other hand, all previous methods are only applied for fixed and small- 23 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 where wn,i,j and ln,i,j are the transformed and quantized transform coefficients at (i, j) in block n, respectively, and q and f are the quantization step size and rounding offset value for quantization process, respectively, and ⎣⎢ ⎦⎥ is the CTB 2N×2N(SKIP) PU 2N×2N(inter) 2N×N(inter) N×2N(inter) N×N(inter) 2N×2N(intra) floor function (called integer operation). HEVC and H.264/AVC use a shift operation for quantization process that are different from Eq. (1) to avoid division operation which causes mismatch between quantization processes for the encoder and decoder, and has heavy computation complexity. In this paper, the simple quantization model in Eq. (1) was used to analyze the effects from the quantization process and JND suppression for simple derivation. N×N(intra) 2.1 HEVC-Compliant Transform-domain JND suppression TU split flag = 0 TU split flag = 1 TU TU split flag = 0 TU split flag = 1 Fig. 1. CU, PU and TU Partition Structures. sized transforms (4x4 and 8x8). Luo et al. [8] proposed a PVC scheme determining the optimal quantization levels limited in JND error. Nevertheless, to find the optimal quantization levels, this PVC scheme required a recursive structure that involved significant computational complexity. A previous study [9] proposed a JND-based HEVC-compliant PVC scheme with simple computation complexity. On the other hand, the effects of JND suppression and quantization errors were not investigated for human perception. Therefore, this study analyzed the JND suppression with a quantization error limited in JND and modified the proposed JND value. The remainder of this paper is organized as follows: Section 2 briefly describes the previous JND-based HEVC-compliant PVC scheme [9] and proposes a new PVC scheme with a JND-suppression adjustment taking the effects of JND suppression and quantization in Section 3 into account. Section 4 presents the performance evaluations and discussion of the effect of the proposed scheme. Finally, this paper is concluded in Section 4. 2. The proposed HEVC-Compliant PVC scheme To consider the quantization and JND suppression effects, a simple model of quantization process was defined as follows: ln,i , j = ⎢ wn,i , j / q + f ⎥ ⎣ ⎦ (1) The PVC schemes can be categorized into two: standard-incompliant PVC schemes and standardcompliant PVC schemes. The standard-incompliant PVC schemes have been applied for quantization and inverse quantization in encoder processing and inverse quantization in decoder processing. Therefore, the related works for standard-incompliant PVC schemes require the modification of encoders and decoders to reduce the perceptual redundancy in quantization. On the other hand, the related works for standard-compliant PVC schemes have only been applied for quantization in encoder processing. The standard-incompliant PVC schemes have more freedom to reduce the bitrate, but cannot be applicable in real applications of the video coding standard. The standard-compliant PVC schemes can be applicable in actual video coding encoders in a compliant manner with video coding standards, such as H.264/AVC and HEVC. Among the related works, Luo’s PVC scheme [8] proposed a standard-compliant JND-suppression scheme by modifying the quantization process only in the encoder sides. In this scheme, the quantized coefficients, ln′,i , j were suppressed in the direction by JND as ln′,i , j = ln,i , j − hn ,i , j (2) where the tuning value, hn,i,j, is derived from the following equation: hn ,i , j = max h s.t.0 ≤ h ≤ ln,i , j , h ∈ Z (3) en ,i , j (h) / Tn,i , j ≤ JNDn,i , j In Eq. (3), Tn,i,j(h) is the probability of detecting distortion based on a visual saliency map and en,i,j(h) is the error function between transformed coefficient and JNDsuppressed reconstruction coefficient in terms of the tuning value, h, for the quantized coefficient. This means that Luo’s PVC scheme reduces the quantized coefficient until the error by quantization and JND-suppression are the lowest, and can control error less than the JND value. On the other hand, Luo’s PVC scheme needs to find a 24 Kim et al.: Analysis of the JND-Suppression Effect in Quantization Perspective for HEVC-based Perceptual Video Coding where wˆ n ,i , j and wˆ n′ ,i , j are the reconstruction coefficient quantized JND-suppression level recursively, requiring high computational complexity. To solve this problem, a previous work [9] proposed a JND-suppression method as follows: ( ln′,i , j ) ⎧⎢ w ⎥ − JNDnscale ,i , j ⎪ ⎢ n ,i , j ⎥, + f ⎪ ⎥ = ⎨⎢ q ⎦⎥ ⎪ ⎣⎢ ⎪⎩ 0, and JND-suppressed reconstruction coefficient with JND adjustment. In the right-hand side, the reconstruction coefficient is used rather than the transformed coefficient because the purpose of the proposed PVC scheme is to reduce the difference between the JND-suppressed reconstruction coefficient and reconstruction coefficient to a level virtually undetectable by the human visual system. From Eq. (1), Eq. (5) can be redefined as follows: wn ,i , j > JNDnscale ,i , j otherwise ( (4) scale n,i , j where JND ( q ⎢ wn,i , j − JNDn ,i , j − α ⎣ )) q + f ⎥ ≤ wˆ n ,i , j − JNDn ,i , j (6) ⎦ is a scaled JND value to consider a where α is the tuning value to adjust the JND value, JNDn,i,j. To consider the rounding offset, f, for the transformed coefficient, wn,i,j, and JND value, JNDn,i,j, in Eq. (6), Eq. (7) can be defined as normalization factor for the HEVC transform in [9]. In Eq. (4), subtraction operation is also needed for the PVC-based quantization process. Nevertheless, perceptual distortion can be detected when the error by quantization and JNDsuppression is larger than a JND value. Therefore, these effects are considered in the following chapter. ( 3.2 Analysis of JND suppression in terms of the quantization effects )) q + 2 f ⎥ ≤ wˆ n,i , j − JNDn,i , j ⎦ (7) where αʹ can be defined in To consider the error effect by quantization and JND suppression and improve the previous PVC scheme, this section presents the proposed HEVC-compliant JNDsuppression PVC scheme by considering the JND suppression effect with quantization error, as shown in Fig. 2. The proposed JND adjustment process for a transformdomain JND model tunes the quantized level according to its JND value, quantization step size and rounding offset. Firstly, a lower bound is defined as follows: wˆ n′ ,i , j ≤ wˆ n ,i , j − JNDn ,i , j ( q ⎢ wn ,i , j − JNDn,i , j − α ′ ⎣ α′ = α − f ⋅q (8) To find the optimal α, the floor function in [10] can be used as follows: ⎣⎢ x + y + z ⎦⎥ ≤ ⎣⎢ x ⎦⎥ + ⎣⎢ y ⎦⎥ + ⎣⎢ z ⎦⎥ + 2 for real numbers, x, y, z. From Eq. (9), Eq. (7) can be rewritten as follows: (5) Transform-domain JND Model for non-TSMs JND Adjustment TSM + TU + Transform - Quantization ME/MC DPB Intra prediction Deblocking filter Inv. Quant. t+1 + Fig. 2. Proposed PVC scheme considering JND suppression effect. Inv. Trans. CABAC Pixel-domain JND Model for TSM t (9) 25 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 JNDn,i,j + Shift operation scale JND'n,i,j α - Calculate α q, f Fig. 3. Proposed JND adjustment process. ( ( q ⎢ wn,i , j − JNDn ,i , j − α ′ ⎣ )) q+2f ⎥ ⎦ ≤ q ⎢ wn ,i , j q + f ⎥ − q ⎢ JNDn ,i , j q + f ⎥ + q ⎢⎣α ′ q ⎥⎦ + 2q ⎣ ⎦ ⎣ ⎦ ≤ wˆ n ,i , j − JNDn ,i , j (10) and ˆ wˆ n,i , j − JND n,i , j + q ⎢ ⎣α ′ q ⎥⎦ + 2q ≤ wˆ n,i , j − JNDn,i , j (11) ˆ (12) − JND n ,i , j + + q ⎢ ⎣α ′ q ⎥⎦ + 2q ≤ − JNDn,i , j ˆ In Eqs. (11) and (12), JND n , i , j is the reconstructed JND value. Finally, ˆ k ≤ JND n , i , j q − JNDn , i , j q − 2q (13) where k is defined in 3. Experimental Results To evaluate the performance improvement achieved by the proposed JND-suppression adjustment scheme, the proposed model in Eqs. (14) and (15) were implemented into the HM 11.0 [11] and the JND-based HEVCcompliant PVC scheme with JND-suppression adjustment process was compared with the original HM 11.0 and the JND-based HEVC-compliant PVC without a JNDsuppression adjustment in the TU blocks. For the experiments, four FHD (Full High Definition – 1920×1080) test sequences of the ParkScene (PkS), Cactus (Ca), BQTerrace (BQT) and BasketballDrive (BDv) and four FWVGA (Full Wide Video Graphics Array – 832×480) test sequences of RaceHorses (RH), BQMall (BQM), PartyScene (PtS) and BasketballDrill (BDl) were used, which are all in 4:2:0 color format. As the test conditions, the default Low-Delay (LD) main configuration [11] and the first 100 frames were used for all experiments. The original HM11.0, PVC (PVC without JNDsuppression adjustment process) and PVCAP (PVC with JND-suppression adjustment process) were compared with respect to PSNR, bitrate reduction and encoding time. The encoding times were measured on a PC with an Inter 2.5GHz HexaCore processor with 32GB RAM. The bitrate reduction (ΔBitrate), the encoding time (ΔTime) and the PSNR difference (ΔPSNR) were calculated as ΔBitrate = Bitrate pro − Bitrateref ΔTime = k = ⎢⎣α ′ q ⎦⎥ (14) From Eqs. (8) and (14), the range of α can be derived from the definition of floor function in [10] as follows: k ≤ α ′ q < k +1, (15) qk + qf ≤ α < kq + q + fq . (16) and From Eq. (13), the optimal k can be found first as ˆ ⎥ k = ⎢ JND n , i , j q − JNDn , i , j q − 2q ⎦ ⎣ (17) From Eq. (16), α can be found as the smallest value as α = ⎡⎢ qk + qf ⎤⎥ (18) where ⎡⎢ ⎤⎥ is the ceiling function. As shown Fig. 3, the proposed JND adjustment process uses the JND value, JNDn,i,j, quantization step size, q, and rounding offset, f, to derive α in Eq. (15) for tuning the value for JND. Bitrateref TimePro − Timeref Timeref × 100 × 100 ΔPSNR = PSNR pro − PSNRref (16) (17) (18) Table 1 presents the performance comparison for PSNR and bitrate reduction between PVC and PVCAP under the LD configuration. As shown in Table 1, the average bitrate reduction was 23.66% for PVC and 18.53% for PVCAP, compared to the original HM 11.0. This suggests that the PVC outperforms PVCAP with 5.13% more bitrate reduction but PVCAP outperforms PVC with a 0.39dB PSNR increase. As shown in Table 1, PVCAP increases the encoder complexity by 29.9% in execution compared to the HM 11.0 whereas PVC increases 22.7 %, which has similar computation complexity. In terms of perceptual verification, however, PVCAP can improve the visual quality significantly, as shown in Figs. 4 and 5. Fig. 4(a) shows the 51st frame of the original RaceHorses sequence with an image region (some grass) in the bounded box. Figs. 4(b)-(d) show some image regions of the reconstructed frames by the original HM 11.0, PVC and PVCAP, respectively. From a comparison of the visual quality, the proposed PVCAP produces fewer visual artifacts in the reconstructed region of some grass compared to the PVC. Also, Fig. 5(a) shows the 30st frame of the original BQMall sequence with an image region (back of one’s head) in the bounded box. Figs. 5(b)-(d) show some image regions of the reconstructed frames by 26 Kim et al.: Analysis of the JND-Suppression Effect in Quantization Perspective for HEVC-based Perceptual Video Coding Table 1. Experimental Results. Seq. QP PkS Ca BQT BDv RH BQ M PtS BDl Avg ΔBitrate(%) ΔPSNRY(dB) 22 27 32 22 27 32 22 27 32 22 27 32 22 27 32 22 27 32 22 27 32 22 27 32 PVC -33.73 -23.12 -9.00 -46.77 -19.74 -5.76 -65.33 -40.86 -12.09 -36.06 -15.57 -6.57 -35.84 -25.86 -5.92 -28.34 -20.42 -7.82 -32.45 -25.33 -12.29 -17.40 -18.00 -18.50 PVCAP -32.75 -16.33 -7.08 -48.37 -13.63 -5.76 -64.13 -18.78 -7.59 -38.09 -12.16 -3.65 -34.71 -2.67 -1.68 -27.50 -8.99 -2.97 -30.29 -11.08 -3.49 -23.63 -10.96 -3.54 PVC -2.75 -1.73 -0.74 -2.00 -1.32 -0.71 -2.82 -1.12 -0.45 -1.97 -1.46 -0.98 -4.27 -2.65 -0.91 -3.54 -2.46 -1.30 -4.61 -2.71 -1.13 -2.44 -1.40 -0.55 - -23.66 -18.53 -1.98 ΔTime PVCAP PVC PVCAP -2.62 23.0 26.9 -1.36 32.6 36.9 -0.58 38.1 40.8 -1.89 13.0 14.9 -0.93 30.3 32.1 -0.47 36.5 37.4 -2.73 7.7 8.5 -0.64 29.2 29.4 -0.32 40.1 37.4 -1.88 23.0 22.6 -1.07 32.2 34.1 -0.56 38.3 38.9 -4.01 8.6 19.8 -1.06 16.9 30.8 -0.60 22.7 36.2 -3.32 10.7 22.1 -1.67 18.8 31.8 -0.84 22.8 36.4 -4.31 5.3 17.4 -1.72 16.0 28.2 -0.61 23.6 35.5 -2.31 12.6 26.3 -0.97 19.1 34.9 -0.29 23.9 39.5 -1.59 22.7 29.9 (a) (b) (c) (d) Fig. 5. Comparisons of subjective qualities at QP27 (a) st original 30 frame of the BQMall sequence, (b) the reconstructed region of the back of one’s head by the original HM 11.0, (b) the reconstructed region of the back of one’s head by the PVC, (c) the reconstructed region of the back of one’s head by the PVCAP. the original HM 11.0, PVC and PVCAP, respectively. From a comparison of the visual quality, the proposed PVCAP produces fewer ringing artifacts in the reconstructed region of the back of one’s head compared to the PVC. 4. Conclusion This study examined the effects of JND suppression in terms of the quantization effect in the transform domain, and proposed a JND value adjustment method for a JNDbased HEVC-compliant PVC scheme. The bitrate reduction decreased in some QP ranges but the PSNR and visual quality were improved significantly. (a) Acknowledgement This study was supported by the IT R&D program of MSIP/IITP (10039199, A Study on Core Technologies of Perceptual Quality based Scalable 3D Video Codecs), Korea. (b) (c) (d) Fig. 4. Comparisons of subjective qualities at QP27 (a) st original 51 frame of the RaceHorses sequence, (b) the reconstructed region of the grass by the original HM 11.0, (b) the reconstructed region of the grass by the PVC, (c) the reconstructed region of the grass by the PVCAP. References [1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard.” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649-1668, Dec. 2012. Article (CrossRef Link) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 [2] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, T. Wiegand, “Comparison of the Coding Efficiency of Video Coding Standards – Including High Efficiency Video Coding (HEVC),” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1669-1684, Dec. 2012. Article (CrossRef Link) [3] Y. Xiaokang, et al., “Motion-compensated residue preprocessing in video coding based on justnoticeable-distortion profile,” IEEE Trans. Circuits Syst. Video Technol., vol.15, no. 6, pp. 742-752, Jun. 2005. Article (CrossRef Link) [4] W. Zhenyu and K. N. Ngan, “Spatio-Temporal Just Noticeable Distortion Profile for Grey Scale Image/Video in DCT Domain,” IEEE Trans. Circuits Syst. Video Technol., vol.19, no. 3, pp. 337-346, Mar. 2009. Article (CrossRef Link) [5] Z. Chen and C. Guillemot, “Perceptually-Friendly H.264/AVC Video Coding Based on Foveated JustNoticeable-Distortion Model,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 6, pp. 806-819, Jun. 2010. Article (CrossRef Link) [6] M. Naccari and F. Pereira, “Advanced H.264/AVCBased Perceptual Video Coding: Architecture, Tools, and Assessment,” IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 6, pp. 766-782, Jun. 2011. Article (CrossRef Link) [7] M. Naccari and F. Pereira, “Integrating a spatial just noticeable distortion model in the under development HEVC codec,” Proc IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 817-820, Prague, Czech Republic, May 2011. Article (CrossRef Link) [8] Z. Luo, L. Song, S. Zheng, and N. Ling, “H.264/Advanced Video Control Perceptual Optimization Coding Based on JND-Directed Coefficient Suppression,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 6, pp. 935-948, Jun. 2013. Article (CrossRef Link) [9] J. Kim, M. Kim, “An HEVC-Compliant Perceptual Video Coding Scheme based on JND Models for Variable Block-sized Transform Kernels”, IEEE Trans. Circuits Syst. Video Technol., to be published, 2015. Article (CrossRef Link) [10] B. M. Stephen, R. Anthony, Discrete Algorithmic Mathematics, Third Edition, CRC Press, 2005. Article (CrossRef Link) [11] HM reference software 11.0. Article (CrossRef Link) Copyrights © 2015 The Institute of Electronics and Information Engineers 27 Jaeil Kim received his B.S. degree in electrical engineering from the Korea University of Technology and Education, Cheonan, Korea, in 2006. He received a Ph.D. degree in Department of Information and Communications Engineering of Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. His current research interests include perceptual video coding, hardware friendly video codec design, transform coding for video, image/video processing, and pattern recognition. Munchurl Kim received his B.E. degree in electronics from Kyungpook National University, Daegu, Korea, in 1989, and has M.E. and Ph.D. degrees in electrical and computer engineering from the University of Florida, Gainesville, in 1992 and 1996, respectively. After his graduation, he joined the Electronics and Telecommunications Research Institute, Daejeon, Korea, as a Senior Research Staff Member, where he led the Realistic Broadcasting Media Research Team. In 2001, he was an Assistant Professor with School of Engineering, Information and Communications University, Daejeon, Korea. Since 2009, he has been an Associate Professor and is now a Full Professor of Department of Electrical Engineering, KAIST, Daejeon, Korea. His current research interests include 2D/3D perceptual video coding, visual perception modeling on 2D/3D video, high dynamic range video processing, machine learning, and pattern recognition. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.028 28 IEIE Transactions on Smart Processing and Computing Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos Hochan Ryu, Yong-Jo Ahn, Jung-Soo Mok, and Donggyu Sim Department of Computer Engineering, Kwangwoon University / Seoul, South Korea {jolie, yongjoahn, jsmok, dgsim}@kw.ac.kr * Corresponding Author: Donggyu Sim Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: Several parallelization methods that can be applied to High Efficiency Video Coding (HEVC) decoders are evaluated. The market requirements of high-resolution videos, such as Full HD and UHD, have been increasing. To satisfy the market requirements, several parallelization methods for HEVC decoders have been studied. Understanding these parallelization methods and objective comparisons of these methods are crucial to the real-time decoding of high-resolution videos. This paper introduces the parallelization methods that can be used in HEVC decoders and evaluates the parallelization methods comparatively. The experimental results show that the average speed-up factors of tile-level parallelism, wavefront parallel processing (WPP), frame-level parallelism, and 2D-wavefront parallelism are observed up to 4.59, 4.00, 2.20, and 3.16, respectively. Keywords: HEVC decoder, Parallelism, Tile, WPP, Wavefront 1. Introduction Aideo services have evolved rapidly with the demands of high-quality videos. Higher resolution and higher bitrates are required with the evolution of video services. In the current market, there are many video streaming services and the market requires high-resolution video, such as Full HD and 4K-UHD. To correspond to the requirements, the Joint Collaborative Team Video Coding (JCT-VC) of the ISO/IEC Moving Picture Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG) developed High Efficiency Video Coding (HEVC), which is the next generation video coding standard [1]. HEVC is based on quad-tree based coding tree units (CTUs) and coding units (CUs), which can be larger than macroblocks of the previous video coding standards. In addition, the prediction unit (PU) and transform unit (TU) that can be split from CU are used. In addition, 8-tap filters are used to interpolate the fractional samples, compared to 6-tap filtering of H.264/AVC [2]. 35 modes are then used in the intra prediction [3] and sample adaptive offset (SAO) is used in the in-loop filtering process [4]. These new features and additional tools compared to the previous video coding standards increase the decoder complexity [5]. To solve these problems, many studies have focused on fast HEVC encoders and decoders [6, 7]. However, it is difficult to make real-time codecs without parallelization. Therefore, parallelization of the video decoder could be essential for the real time decoding of high resolution videos. Currently, the multicore system can be used for the parallelization of a video decoder. There are several conventional parallelization methods for video decoders [8, 9]. For efficient parallelization, an objective comparison of the parallel methods on the multi-core system is very important. In this paper, several parallelization methods of HEVC were analyzed and evaluated comparatively. In the process of establishing a HEVC standard, several techniques were proposed for the parallelism. Finally, the tile and wavefront parallel processing (WPP) were adopted for the HEVC standard. The tile is a rectangular region of coding tree blocks (CTBs) within a particular tile column and a particular tile row in a picture [10]. There are no data dependencies among the partitioned regions. Therefore, the partitioned regions can be decoded in parallel [11]. WPP is a parallelization technique that uses a separate CABAC context for each CTU line. Based on this method, entropy decoding of each CTU line can be conducted IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 29 Fig. 1. Standardization flow of HEVC. simultaneously [12]. Another parallelization method of a HEVC decoder is frame-level parallelization. Using framelevel parallelization, several pictures can be decoded in parallel. On the other hand, for the parallel decoding of each picture, consideration of the structure of reference pictures is necessary. The parallelization methods introduced above are encoder-dependent methods. On the other hand, 2D-wavefront method is not an encoderdependent method. For this reason, a 2D-wavefront method is used most widely as a parallelization method in video decoders. Using 2D-wavefront parallelization, pixel decoding of the entropy decoded CTUs can be conducted simultaneously in the CTU line level. This paper compares several parallelization methods and evaluates parallel performance of all the parallelization methods objectively. This paper is organized as follows. Section 2 reviews the HEVC standard. In Section 3, the parallelization methods of HEVC decoder are introduced. The advantages and drawbacks of the methods are then compared. In Section 4, all the parallelization methods are evaluated in parallel and RD performance is assessed. Finally, Section 5 concludes the paper. 2. HEVC Overview HEVC is the newest video coding standard jointly developed by the ISO/IEC MPEG and ITU-T VCEG focused on significantly better coding efficiency. The first version of the HEVC standard was completed on January, 2013. The second version of the HEVC standard was finalized on June, 2014 including range extensions, scalable coding extensions, and multi-view extensions [14]. In addition, the screen content coding that is designed for computer generated images and videos is being developed [15]. HEVC has new features for the efficiency of highresolution and high-quality videos. First, HEVC employs hierarchical variable blocks of CU, PU, and TU; whereas, the previous video codecs use the fixed size macroblock [16]. Because the blocks of HEVC can be larger than the previous macroblock, HEVC can compress the highresolution videos more efficiently. Fig. 2 presents one of Fig. 2. Block partition of HEVC. Fig. 3. Block diagram of the HEVC decoder. the block partition examples based on HEVC. A picture can be partitioned with multiple CTUs and each CTU splits into multiple CUs of that size is variable. In addition, PU and TU are split from each CU. Secondly, a more accurate intra prediction can be performed with 35 intra prediction modes. In the case of inter prediction, the precision of motion compensation is increased using the 8-tap interpolation filter. Merge and advanced motion vector prediction (AMVP) are used for efficient syntax in representing the inter prediction. Third, HEVC uses an additional in-loop filter called SAO to improve the subjective quality and coding efficiency at the same time. On the other hand, these additional features make the encoder and decoder more complex [17]. The main sources of the HEVC decoder complexity are entropy decoding and pixel decoding parts, as shown in Fig. 3. For high-resolution and high-bitrate videos, the complexity of the entropy decoding part is more complicated [18]. Several parallelization methods can be used to reduce the complexity of entropy decoding and pixel decoding parts. 30 Ryu et al.: Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos (a) (b) Fig. 4. Example of tile partitioning in a picture (a) No tile splits, (b) Use of 4 tile splits. Fig. 5. CABAC context model initialization positions and dependencies of among CTUs. Table 1. BD rates according to the number of tiles. BD-rate Sequence BasketballDrill 4 tiles 2.03 8 tiles 3.09 1.75 BQMall 0.85 PartyScene 0.60 0.80 Racehorse 0.99 1.34 Avg. 1.12 1.75 3. Parallelization Methods of HEVC Decoder The HEVC standard is designed to be suitable for parallel processing. Tile and WPP were adopted for better parallel processing. In addition, HEVC supports a parallel friendly in-loop filter structures [13]. In this section, four parallelization methods that can be used in HEVC decoders for the real-time decoding of highresolution videos are introduced and compared. Fig. 6. Example of the GOP structure for the frame-level parallelism. each CTU line, the CABAC context model is copied from the updated CABAC context that is located at the right-top CTU. If WPP is used, the efficiency of CABAC can be degraded slightly. In addition, the dependence among the CTUs should be considered. Fig. 5 shows the CABAC context model initialization positions and the dependencies of the WPP method. Because the CABAC contexts should be initialized at the first CTU of each CTU line, the entropy decoding of each CTU line cannot begin at the same time. In addition, the dependencies among the CTUs should be checked in a pixel decoding process. 3.3 Frame-level Parallelism 3.1 Tile-level Parallelism Tile-level parallelism can be used for parallel decoding in a picture. A picture can be partitioned into a number of rectangular blocks composed of CTBs. Fig. 4 gives an example of tile partitioning. No data dependencies exist among the partitioned regions. In addition, the scan order of the CTUs is changed to the tiles scan order. Each tile can be decoded independently by splitting into multiple cores or threads. Therefore, this method is suitable for multi-core-based parallelization. The parallel performance increases with increasing number of tile splits; however, an increase in the tile splits depletes the compression performance. Table 1 lists the compression performance degradation according to the number of tiles. The BD rate loss of 4 tiles and 8 tiles are 1.12% and 1.75% on average. Frame-level parallelism is suitable for the special purpose applications. For the frame-level parallelism, a special GOP structure should be considered in an encoder. Fig. 6 gives an example of the GOP structure for framelevel parallelism. In Fig. 6, all the B-pictures in a GOP only refer to the previous I-picture. Using this structure, the following B-pictures can be decoded in parallel. On the other hand, a limited reference picture can lead to a rapid decrease in compression performance. If the number of Bpictures that refer to previous I-pictures are increased, good parallel performance can be achieved in the decoding parts of the B-pictures. On the other hand, complexity of the I-Picture is much higher than that of the B-picture. Therefore, this parallel method cannot beat the decoding time of the I-picture that is decoded sequentially. 3.4 2D-wavefront Parallelism 3.2 Wavefront Parallel Processing (WPP) WPP is an adopted parallelization method of the HEVC standard. Using this method, the entropy decoding of each CTU line can be conducted in parallel. At the first CTU of 2D-wavefront parallelism is the most widely used parallel method in video decoders. This method is an encoder independent method compared to the above three methods. Using 2D-wavefront parallelism, the pixel 31 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Fig. 7. Dependencies among the neighboring CTUs in pixel decoding. Fig. 9. Distribution of the CTUs that can be decoded at the same time. Table 2. Comparisons of the four parallel methods. Compression performance degradation Parallelization method Encoder dependency Tile-level parallelism ○ ○ WPP ○ ○ Frame-level parallelism ○ ○ 2D-wavefront parallelism None None Fig. 8. Problem of 2D-wavefront parallelism. Table 3. Experimental conditions. decoding process can be conducted simultaneously on a CTU line level. Like the WPP method, the dependencies among the CTUs should be considered. Fig. 7 presents the neighboring CTUs that should be decoded before decoding the current CTU. Fig. 8 shows the problem of 2D-wavefront parallelism. Because of the dependencies of the neighboring CTUs, idle threads can occur in the prologue and epilogue parts in a picture. The maximum number of CTUs (P_CTU) that can conduct pixel decoding in parallel is calculated by Eq. (1). 〖nCTU〗_h is the number of CTUs in the picture height and 〖nCTU〗_w is the number of CTUs in the picture width. PCTU nCTU w ⎧ + , 1,ٛ h ⎪⎪min(nCTU 2 =⎨ nCTU w , (nCTU,h ٛ⎪ min ⎪⎩ 2 if nCTU w is odd Processor Intel Xeon CPU E5-2687W v2 3.40GHz OS Windows Server 2008 R2 Standard Compiler Intel C++ Compiler XE 13.0 Parallel tool OpenMP 2.0 Resolution FHD (1920×1080), 4K-UHD (3840×2160) Coding mode Low Delay QP 22, 27, 32, 37 dependent. On the other hand, the 2D-wavefront parallelization method has no degradation of compression performance and it is not dependent on the encoders. 4. Performance Evaluation otherwise (1) In a Full HD resolution, there are 30×17 CTUs that the size is 64×64 in a picture. In this case, the number of CTUs that can be decoded simultaneously are represented in Fig. 9. In the prologue and epilogue parts of a picture, the CTUs that can be decoded at the same time are insufficient. For this reason, the threads can be idle in these parts. 3.5 Comparisons of parallelism As introduced above, all the parallelization methods have advantages and disadvantages. Table 2 compares the four parallel methods. Note that tile-level parallelism, WPP and frame-level parallelism have compression performance degradation, and these methods are encoder For a performance evaluation of the four parallelism methods, all the parallelization methods were implemented on ANSI C-based private decoder developed based on the HM12.0 decoder [19]. For the parallel implementation, OpenMP was used with Microsoft Visual Studio 2012 [20]. Table 3 lists the experimental conditions. The parallelism performance was measured by Eq. (2). Decoding timeref represents the decoding time of the HM12.0 decoder without any parallelism methods. Decoding time prop denotes the decoding time of the parallelized private decoder with a parallelization method. Speed − up = Decoding timeref Decoding time prop (2) Table 4 lists the speed-up factors of tile-level 32 Ryu et al.: Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos Table 4. Speed-up factors of tile-level parallelism. Resolution Sequence BasketballDrive FHD 4K-UHD Speed-up according to the number of tiles 2 2.81 4 3.61 4K-UHD Sequence BasketballDrive Speed-up according to GOP size 4 2.12 6 2.21 8 2.33 BQTerrace 2.37 4.08 4.80 BQTerrace 2.01 2.13 2.37 Cactus 3.04 3.68 4.26 Cactus 1.96 2.06 2.28 Kimono1 3.00 4.14 4.65 ParkScene 3.12 3.84 4.78 FHD Kimono1 1.71 1.83 1.93 ParkScene 1.82 1.94 2.07 Avg. 2.87 3.87 4.59 Avg. 1.92 2.03 2.20 Beauty 2.89 4.15 4.56 Beauty 1.68 1.85 1.89 Jockey 3.06 4.02 4.87 Jockey 1.71 1.83 1.85 ShakeNDry 2.97 3.64 4.26 Avg. 2.97 3.94 4.56 Sequence BasketballDrive FHD Resolution 6 4.45 Table 5. Speed-up factors of WPP. Resolution Table 6. Speed-up factors of frame-level parallelism. 4K-UHD ShakeNDry 1.67 1.84 1.91 Avg. 1.69 1.84 1.88 Table 7. Speed-up factors of 2D-wavefront parallelism. Speed-up according to the number of threads Resolution Sequence 2 4 6 2.47 3.07 3.40 BasketballDrive Speed-up according to the number of threads 2 2.75 4 3.43 6 3.47 BQTerrace 2.09 2.93 3.07 BQTerrace 3.00 3.61 3.67 Cactus 2.49 2.95 3.73 Cactus 2.13 2.50 2.47 Kimono1 2.10 2.80 2.82 Kimono1 2.81 3.50 3.57 ParkScene 1.98 2.88 2.85 ParkScene 2.17 2.62 2.63 Avg. 2.23 2.93 3.17 Avg. 2.57 3.13 3.16 Beauty 2.71 3.58 4.60 Beauty 2.04 1.94 2.03 Jockey 2.36 3.30 2.93 FHD Jockey 2.67 3.66 4.11 ShakeNDry 2.11 2.63 3.28 ShakeNDry 2.03 2.39 2.35 4.00 Avg. 2.14 2.54 2.44 Avg. 2.50 3.29 parallelism. The speed-up factors increase with increasing number of tiles. The results shows that the average of the speed-up factors are 4.59 and 4.56 for FHD and 4K-UHD with 6 tiles, respectively. Table 5 lists the speed-up factors using the WPP. The results show that the speed-up factors for the 4K-UHD sequences are higher than those for FHD sequences. The speed-up factors are influenced by the entropy decoding complexity between the FHD and 4K-UHD sequences. The entropy decoding part of the 4K-UHD sequences is more complex than the entropy decoding of the FHD sequences. Therefore, the parallel performance of the WPP for parallel entropy decoding showed reasonably good parallel performance for the 4K-UHD sequences. Table 6 presents the results of frame-level parallelism. The speed-up factors are not large. These results are due to the decoding overhead of the I-picture. Table 7 lists the speed-up factors of the 2D-wavefront parallelism. The mean speed-up factors for the FHD videos were higher than those for the 4K-UHD videos. This means that the parallel performance of 2D-wavefront parallelism is affected by the entropy decoding part that is not a parallel region based on the 2D-wavefront parallelism. 4K-UHD 5. Conclusion This paper introduced several parallel methods for HEVC decoders and evaluated the real-time decoding of high-resolution videos. Tile-level parallelism, WPP, frame-level parallelism, and 2D-wavefront parallelism were implemented on the ANSI C-based private decoder. The parallel performance of several parallel methods were compared. For the FHD test sequences, the mean speed-up factors of tile-level parallelism, WPP, frame-level parallelism, and 2D-wavefront parallelism were evaluated and found to be up to 4.59, 3.17, 2.20, and 3.16, respectively. In addition, the speed-up factors of the 4KUHD sequences are observed up to 4.56, 4.00, 1.88, and 2.54 for tile-level parallelism, WPP, frame-level parallelism, and 2D-wavefront parallelism, respectively. Acknowledgement This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning(NRF-2014R1A2A1A11052210) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 References [1] High Efficiency Video Coding, Rec. ITU-T H.265 and ISO/IEC 23008-2, Jan 2013. Article (CrossRef Link) [2] J. Lainema, F. Bossen, W. Han, J. Min, and K. Ugur, “Intra Coding of the HEVC Standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp.1792-1801, Dec. 2012. Article (CrossRef Link) [3] G.J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp.1648-1667, Dec. 2012. Article (CrossRef Link) [4] C. Fu, E. Alshina, A. Alshin, Y. Huang, C. Chen, and C. Tsai, C. Hsu, S. Lei, J. Park, and W. Han, “Sample adaptive offset in the HEVC standard,” IEEE trans. on Circuits and Systems for Video Technol., vol. 22, no. 12, pp. 1755-1764, Dec. 2012. Article (CrossRef Link) [5] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T.K. Tan, and T. Wiegan, “Comparison of the coding efficiency of video coding standards-Including High Efficiency Video Coding (HEVC),” IEEE Trans. Circuits Syst. Video technol., vol. 22, no. 12, pp. 1668-1683, Dec. 2012. Article (CrossRef Link) [6] Y. Ahn, T. Hwang, D. Sim, and W. Han, “Implementation of fast HEVC encoder based on SIMD and data-level parallelism,” EURASIP Journal on Image and Video Processing, March. 2014. Article (CrossRef Link) [7] E. Ryu, J. Nam, S. Lee, H. Jo, and D. Sim, "Sample adaptive offset parallelism in HEVC," International Conference on Green and Human Information Technology (ICGHIT 2013), Hanoi, Vietnam, Feb. 27 ~ Mar. 1, 2013. Article (CrossRef Link) [8] H. Jo and D. Sim, “Hybrid Parallelization for HEVC Decoder,” Image and Signal Processing (CISP)., vol. 1, pp. 170-175, Dec. 2012. Article (CrossRef Link) [9] E. Baaklini, H. Sbeity, and S. Niar, “H.264 Macroblock Line Level Parallel Video Decoding on Embedded Multicore Processors,” Digital System Design (DSD), pp. 902-906, Sep. 2012. Article (CrossRef Link) [10] B. Bross, W.-J, Han, G.J. Sullivan, J.-R. Ohm, and T. Wiegand, “High Efficiency Video Coding (HEVC) Text Specification Draft 10 (for FDIS & Last Call,”document JCTVC-L1003_v34, Geneva, CH, Jan. 2013. Article (CrossRef Link) [11] K. Misra, A. Segall, M. Horowiz, S. Xu, A. Fuldseth and M. Zhou, “An Overvidw of Tiles in HEVC,” IEEE Selected Topics in Signal Processing, vol. 7, pp. 969-977, Nov. 2013. Article (CrossRef Link) [12] C. Chi, M. Alvarez-Mesa, B. Juulink, G. Clare, F. Henry, S. Pateux, and T. Schierl, “Parallel Scalability and Efficiency of HEVC Parallelization Approaches,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1827-1838, Dec. 2012. Article (CrossRef Link) [13] A. Norkin, G. Bjøntegaard, A. Fuldseth, M. Narroschke, M. Ikeda, K. Andersson, M. Zhou, and [14] [15] [16] [17] [18] [19] [20] 33 G. Auwera, “HEVC deblocking filter,” IEEE trans. on Circuits and Systems for Video Technol., vol. 22, no. 12, pp. 1746-1754, Dec. 2012. Article (CrossRef Link) J. Boyce, et al., "Draft high efficiency video coding (HEVC) version 2, combined format range extensions (RExt), scalability (SHVC), and multiview (MV-HEVC) extensions," JCTVC-R1013, 18th JCT-VC meeting, Sapporo, JP, June 2014. Article (CrossRef Link) ITU-T Q6/16 and ISO/IEC JTC1/SC29/WG11 document N14175, "Joint Call for Proposals for Coding of Screen Content," San Jose, USA, Jan. 2014. Article (CrossRef Link) T. Wiegand, G. J. Sullivan, G.Bjontegaard, and A. Luthra, “Overview of the H.264/AVC Video Coding Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, Jul. 2003. Article (CrossRef Link) F. Bossen, B. Bross, K. S¨uhring, and D. Flynn, “HEVC complexity and implementation analysis,” IEEE trans. on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1685-1696, Dec. 2012. Article (CrossRef Link) H. Jo and D. Sim, "Bitstream decoding processor for fast entropy decoding of variable length coding– based multiformat videos," Optical Engineering on Imaging Components, Systems, and Processing, vol. 53, no. 6, June 2014. Article (CrossRef Link) HM-12.0, https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftwar e/tags/HM-12.0 Article (CrossRef Link) L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” IEEE Computational Science & Engineering, vol. 5, no. 1, pp.46-55, Jan. 1998. Article (CrossRef Link) Hochan Ryu received his B.S. degree in Computer Engineering from Kwangwoon University, Seoul, Republic of Korea, in 2013. Currently, he is working toward a M.S degree in computer engineering at the same university. His current research interests include video coding and computer vision. Yong-Jo Ahn was born in Jeju Province, Republic of Korea, in 1984. He received his B.S. and M.S. degree in Computer Engineering from Kwangwoon University, Seoul, Korea, in 2010 and 2012, respectively. Now, he is working toward a Ph.D. degree at the same university. His current research interests include high-efficiency video compression techniques, parallel processing for video coding and multimedia system. 34 Ryu et al.: Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos Jung-Soo Mok received his B.S. degree in Computer Engineering from Kwangwoon University, Seoul, Republic of Korea, in 2013. Currently, he is working towards a M.S degree in computer engineering at the same university. His current research interests include video coding and computer vision. Donggyu Sim received his B.S. and M.S. degrees in Electronic Engineering from Sogang University, Korea, in 1993 and 1995, respectively. He also received the Ph.D. degree at the same University, in 1999. He was with Hyundai Electronics Co., Ltd. From 1999 to 2000, being involved in MPEG-7 standardization. He was a senior research engineer at Varo Vision Co., Ltd, working on MPEG-4 wireless applications from 2000 to 2002. He worked for the Image Computing Systems Lab. (ICSL) at the University of Washington as a senior research engineer from 2002 to 2005. He researched ultrasound image analysis and parametric video coding. Since 2005, he has been with the Department of Computer Engineering at Kwangwoon University, Seoul, Korea. In 2011, he joined the Simon Frasier University, as a visiting scholar. He was elevated to an IEEE Senior Member in 2004. He is one of main inventors in many essential patents licensed to MPEG-LA for HEVC standard. His current research interests include video coding, video processing, computer vision, and video communication. Copyrights © 2015 The Institute of Electronics and Information Engineers IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.035 35 IEIE Transactions on Smart Processing and Computing Deep Convolution Neural Networks in Computer Vision: a Review Hyeon-Joong Yoo Department of IT Engineering, Sangmyung University / Chonan, Choongnam-Do 330-720 Korea [email protected] * Corresponding Author: Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper * Review Paper: This paper reviews the recent progress possibly including previous works in a particular research topic, and has been accepted by the editorial board through the regular reviewing process. Abstract: Over the past couple of years, tremendous progress has been made in applying deep learning (DL) techniques to computer vision. Especially, deep convolutional neural networks (DCNNs) have achieved state-of-the-art performance on standard recognition datasets and tasks such as ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). Among them, GoogLeNet network which is a radically redesigned DCNN based on the Hebbian principle and scale invariance set the new state of the art for classification and detection in the ILSVRC 2014. Since there exist various deep learning techniques, this review paper is focusing on techniques directly related to DCNNs, especially those needed to understand the architecture and techniques employed in GoogLeNet network. Keywords: Deep learning, Convolutional neural network, ImageNet, Computer vision, GoogLeNet 1. Introduction Over the last couple of years, deep learning techniques have made tremendous progress in computer vision, especially in the field of object recognition. Deep learning is a family of methods that uses deep architectures to learn high-level feature representation. (Several much lengthier definitions can be found in [1].) When a network has more than one hidden layer, it is of deep architecture. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lowerlevel ones. The family of deep learning methods have been growing increasingly richer, encompassing those of neural networks, hierarchical probabilistic models, and a variety of unsupervised and supervised feature learning algorithms. Active researchers in this area include those at University of Toronto, New York University, University of Montreal, Stanford University, Google, Baidu, Microsoft Research, Facebook, just to name a few. These researchers have demonstrated empirical successes of deep learning in diverse applications of computer vision, phonetic recognition, voice search, conversational speech recognition, speech and image feature coding, robotics, and so on. In this paper we focus on the architecture and training methods of deep networks, specifically deep convolutional neural networks. 1.1 History of Neural Networks Historically, the concept of deep learning originated from artificial neural network research. The history of artificial neural network is filled with individuals from many different fields, including psychologists and physicists. The early model of an artificial neuron is introduced by McCulloch and Pitts in 1943 [2]. Their work is often acknowledged as the origin of the neural network field. They showed that networks of artificial neurons could, in principle, compute any arithmetic or logical function. In 1949, Hebb [3] proposed a mechanism for learning in biological neurons, and introduced the Hebb learning rule. The Hebb rule is often paraphrased as "Neurons that fire together wire together." If the set of input patterns used in training are mutually orthogonal, the association can be learned by a two-layer pattern associator using 36 Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review Hebbian learning. However, if the set of input patterns are not mutually orthogonal, interference may occur and the network may not be able to learn associations, resulting in a low absolute capacity of the Hebb rule. The first practical application of artificial neural network came with the invention of the perceptron network and associated learning rule by Rosenblatt in the late 1950s [4]. With the perceptron network, he demonstrated its ability to perform pattern recognition. At about the same time, Widrow and Hoff [5] introduced a new learning algorithm called the Widrow-Hoff learning rule, and used it to train adaptive linear neural networks, which were similar to Rosenblatt’s perceptron. Unfortunately, both of the networks suffered from the same inherent limitations, which were widely publicized in a book by Minsky and Papert [6]. They discovered two key issues with the computational machines that processed neural networks. The first issue was that single-layer neural networks were incapable of solving the exclusive-or (XOR) problem. The second significant issue was that computers were not sophisticated enough to effectively handle the long run time required by large neural networks. Many people, influenced by Minsky and Papert, believed that further research on neural networks was a dead end. The book caused many researchers to leave the field and nearly killed neural net research for more than a decade. However, some important work continued during the 1970s. Kohonen [7] and Anderson [8] independently and separately developed new neural networks that could act as memories. Grossberg [9] was also very active during this period in the investigation of self-organizing networks. During the 1980s research in neural networks increased dramatically. The impediments in the 1960s were overcome, and, in addition, important new concepts were introduced and responsible for the rebirth of neural networks. Firstly, in 1982, physicist Hopfield used statistical mechanics to explain the operation of a certain class of recurrent network, which could be used as an associative memory [10]. Another key development was the backpropagation algorithm, a generalized form of the delta rule, for training multilayer perceptron networks. Although the derivation procedure had previously been published by Werbos in [11], the most influential publication of the backpropagation algorithm was by Rumelhart and McClelland [12]. They showed that this method works for the class of semilinear activation functions (non-decreasing and differentiable). It was the answer to the criticisms Minsky and Papert had made in 1969. These new development reinvigorated the field of neural networks. One long-term goal of machine learning research is to produce methods that are applicable to highly complex tasks, such as perception (vision, audition), reasoning, intelligent control, and other artificially intelligent behaviors. In order to progress toward this goal, algorithms that can learn highly complex functions with minimal need for prior knowledge, and with minimal human intervention must be discovered. However, shallow architectures can be very inefficient in terms of required number of computational elements and examples. However, for deep architectures, there are such backpropagation-specific properties that can occasionally be a problem as slow learning speed (the further the weights are from the output layer, the slower backpropagation learns) and overfitting. Researchers tried using stochastic gradient descent and backpropagation to train deep networks. Unfortunately, except for a few special architectures, they didn't have much luck. The networks would learn, but very slowly, and in practice often too slowly to be useful. Building on Rumelhart et al. [12], LeCun et al. [13] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs). CNNs saw commercial use with [14], but then fell out of fashion with the rise of support vector machines and other, much simpler methods such as linear classifiers. Promising new methods like Hinton et al.’s [15] have been developed that enable learning in deep neural nets, and in 2012, Krizhevsky et al. [16] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC, http://www.image-net.org/). These techniques have enabled much deeper (and larger) networks to be trained - people now routinely train networks with many hidden layers. And, it turns out that these perform far better on many problems than shallow neural networks [16-18]. The rest of this paper is organized as follows. The next section describes in chronological order important historical events and techniques needed to understand the architecture and training methods of current state of the art deep convolutional neural network, GoogLeNet. Then the GoogLeNet network is investigated deeper in Section 3. Finally, we conclude this paper in Section 4. 2. Deep Convolution Neural Networks Shallow architectures have been shown effective in solving simple or well-constrained problems, but their limited modeling and representational power can cause difficulties when dealing with more complicated realworld applications involving natural signals such as natural image and visual scenes. Human information processing mechanisms suggest the need of deep architectures for extracting complex structure and building internal representation form rich sensory inputs. However, training deep neural networks is hard, as backpropagated gradients quickly vanish exponentially in the number of layers. A set of techniques has been developed that enable learning in deep neural networks. People now routinely train networks with many hidden layers. And, it turns out that these perform far better on many problems than shallow neural networks. Deep neural networks have finally attracted wide-spread attention. In this section, we describe some important techniques and events necessary to understand the deep convolutional neural network called GoogLeNet. Convolutional networks are an attempt to solve the dilemma between small networks that cannot learn the training set, and large networks that seem overparameterized. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 37 Fig. 1. Hierarchical structure of the neocognitron network (Adapted from [19]). Fig. 2. Architecture of CNN in 1989 (Adapted from [13]). 2.1 Neocognitron A hierarchical multilayered artificial neural network called neocognitron was proposed by Fukushima in 1980 [19]. Local features in the input are integrated gradually and classified in the higher layers. Fig. 1 shows its architecture. Later, in 1989, the idea of local feature integration is adapted in LeCun et al.’s convolutional neural networks [13]. The neocognitron was specifically proposed to address the problem of handwritten character recognition. It is a hierarchical network with many pairs of layers corresponding to simple (S layer) and complex (C layer) cells with a very sparse and localized pattern of connectivity between layers. The number of planes of simple cells and of complex cells within a pair of S and C layers being the same, these planes are paired, and the complex plane cells process the outputs of the simple plane cells. The simple cells are trained so that the response of a simple cell corresponds to a specific portion of the input image. If the same part of the image occurs with some distortion, in terms of scaling or rotation, a different set of simple cells responds to it. The complex cells output to indicate that some simple cell they correspond to did fire. While simple cells respond to what is in a contiguous region in the image, complex cells respond on the basis of a larger region. As the process continues to the output layer, the C-layer component of the output layer responds, corresponding to the entire image presented in the beginning at the input layer. The neocognitron, however, lacked a supervised training algorithm. 2.2 CNN Building on Rumelhart et al. [12], LeCun et al. [13] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks, a class of models that extend the neocognitron. LeCun et al. presented a convolutional neural network consisting of 3 hidden layers including 2 convolutional layers, and 64,660 connections. They applied it to handwritten zip code recognition. Fig. 2 shows its architecture. Unlike previous works, the network was directly trained on a low-level representation of data that had minimal preprocessing (as opposed to elaborate feature extraction), thus demonstrating the ability of backpropagation networks to deal with large amounts of low-level information. The first hidden layer is composed of several planes called feature maps. Distinctive features of an object can appear at various locations on the input image. It seems judicious to have all units in a plane share the same set of weights, thereby detecting the same feature at different locations. Thus the detection of a particular feature at any location on the input can be easily done using the "weight sharing" technique. Since the exact position of the feature is not important, the feature maps need not have as many units as the input. However, units do not share biases. Due to the weight sharing, the network has few free parameters. For example, though layer H1 has 19,968 connections, it has only 1,068 free parameters (768 biases plus 25 times 12 feature kernels). The connection scheme between H1 and H2 is slightly more complicated: Each unit in H2 combines local information coming from 8 of the 12 different feature maps in H1. Its receptive field is composed of eight 5 by 5 neighborhoods centered on units that are at identical positions within each of the eight maps. Once again, all units in a given map are constrained to have identical weight vectors. Layer H3 has 30 units, and is fully connected to H2. The output layer is also fully connected to H3. In summary, the network has 1,256 units, 64,660 connections, and 9,760 independent parameters. The target values for the output units were chosen within the quasilinear range of the sigmoid. This prevents the weights from growing indefinitely and prevents the output units from operating in the flat spot of the sigmoid. During each learning experiment, the patterns were repeatedly presented in a constant order. The weights were updated according to the stochastic gradient or "on-line" procedure. From empirical study (supported by theoretical arguments), the stochastic gradient was found to converge much faster than the true gradient, especially on large, redundant data bases. It also finds solutions that are more robust. 38 Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review 2.3 LeNet-5 Convolutional networks combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance: local receptive fields, shared weights (or weight replication), and spatial or temporal sub-sampling. Fig. 3 shows the architecture of LeNet-5, a convolutional neural network proposed by LeCun et al. in 1998 [14], which was used commercially for reading bank checks. A convolutional layer is composed of several feature maps (with different weight vectors), so that multiple features can be extracted at each location. The receptive fields of contiguous units in a feature map are centered on correspondingly contiguous units in the previous layer. Therefore receptive fields of neighboring units overlap. All the units in a feature map share the same set of weights and the same bias so they detect the same feature at all possible locations on the input. A sequential implementation of a feature map would scan the input image with a single unit that has a local receptive field, and store the states of this unit at corresponding locations in the feature map. This operation is equivalent to a convolution, followed by an additive bias and squashing function, hence the name convolutional network. An interesting property of convolutional layers is that if the input image is shifted, the feature map output will be shifted by the same amount, but will be left unchanged otherwise. This property is at the basis of the robustness of convolutional networks to shifts and distortions of the input. A large degree of invariance to geometric transformations of the input can be achieved with this progressive reduction of spatial resolution compensated by a progressive increase of the richness of the representation (the number of feature maps). The convolution/subsampling combination, inspired by Hubel and Wiesel’s notions of "simple" and "complex" cells [20], was implemented in Fukushima's Neocognitron [19], though no globally supervised learning procedure such as backpropagation was available then. Starting with LeNet-5 [14], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and maxpooling) are followed by one or more fully-connected layers. All the weights are learned with backpropagation. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [16, 18, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [22] and layer size [23, 24], while using dropout [25] to address the problem of overfitting. 2.4 Multi-Scale ConvNet Sermanet et al. [26] modified the traditional ConvNet (convolutional networks) architecture by feeding 1st stage features in addition to 2nd stage features to the classifier as shown in Fig. 4. Each stage is composed of a (convolutional) filter bank layer, a non-linear transform layer, and a spatial feature pooling layer. ConvNets are Fig. 3. Architecture of LeNet-5, a Convolutional Neural Network, here for characters recognition. Each plane is a feature map, i.e., a set of units whose weights are constrained to be identical (Adapted from [14]). Fig. 4. A 2-stage ConvNet architecture. The input is processed in a feedforward manner through two stage of convolutions and subsampling, and finally classified with a linear classifier. The output of the 1st stage is also fed directly to the classifier as higher-resolution features. (Adapted from [26]). IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 generally composed of one to three stages, capped by a classifier composed of one or two additional layers. Contrary to Fan et al.’s [27], they used the output of the first stage after pooling/subsampling rather than before. Additionally, applying a second subsampling stage on the branched output yielded higher accuracies than with just one. Feeding the outputs of all the stages to the classifier allows the classifier to use, not just high-level features, which tend to be global, invariant, but with little precise details, but also pooled low-level features, which tend to be more local, less invariant, and more accurately encode local motifs with more precise details. The motivation for combining representation from multiple stages in the classifier is to provide different scales of receptive fields to the classifier. When applied to the task of traffic sign classification as part of the GTSRB competition, experiments produced a new record of 99.17%, above the human performance of 98.81%. 2.4 ReLU and Dropout Sparsity was first introduced in computational neuroscience in the context of sparse coding in the visual system [28]. The neuroscience literature [29, 30] indicates that cortical neurons are rarely in their maximum saturation regime, and suggests that their activation function can be approximated by a rectifier. In 2011, Glorot et al. [31] showed that using a rectifying non-linearity gives rise to real zeros of activations and thus truly sparse representations. They showed that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard nonlinearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Their experiments on image and text data indicated that training proceeds better when the artificial neurons are either off or operating mostly in a linear regime. Rectifying activation allowed deep networks to achieve their best performance without unsupervised pre-training on purely supervised tasks with large labeled datasets. In 2012, Krizhevsky et al. [16] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition 39 Challenge (ILSVRC) with their model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. They used purely supervised learning, suggesting that classical backpropagation does well even without unsupervised pre-training. Their success resulted from training a large, deep CNN, shown in Fig. 5, on 1.2 million labeled images, together with a few twists on LeCun’s CNN, i.e., max(x, 0) rectifying non-linearities and “dropout” regularization. In terms of training time with gradient descent, the saturating nonlinearities f(x) = tanh(x) or f(x) = (1 + e-x)-1 are much slower than the non-saturating nonlinearity f(x) = max(x, 0). Following Nair and Hinton [32], they refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. Faster learning has a great influence on the performance of large models trained on large datasets. When a large feedforward neural network is trained on a small training set, it typically performs poorly on heldout test data due to its high capacity. To prevent this “overfitting”, they employed a regularization approach called “dropout” that stochastically sets half the activations within a hidden layer to zero for each training sample during training. They used dropout in the first two fullyconnected layers (See Fig. 6). Thereby a hidden unit cannot rely on other hidden units being present. This prevents complex co-adaptations on the training data. It has been shown to deliver significant gains in performance across a wide range of problems. Another way to view the dropout procedure is as a very efficient way of performing model averaging with neural networks. A good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. The standard way to do this is to train many separate networks and then to apply each of these networks to the test data, but this is computationally expensive during both training and testing. Random dropout makes it possible to train a huge number of different networks in a reasonable time. There is almost certainly a different network for each presentation of each training case but all of these networks share the same weights for the hidden units that are present. Dropout does not seem to have the same benefits for convolutional layers, which are common in many networks designed for vision tasks. Dropout roughly doubles the Fig. 5. Architecture of Krizhevsky et al.’s DCNN (Adapted from [16]). 40 Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network structure can be learned layerwise by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. Szegedy et al. [18] took inspiration and guidance from this theoretical work, and their GoogLeNet DCNN significantly outperformed the ILSVRC 2014 classification and detection challenges. Fig. 6. Sparse propagation of activations and gradients (Adapted from [31]). 3. A Deeper and Wider CNN number of iterations required to converge. Without dropout, their network exhibited substantial overfitting. They achieved record-breaking results on a highly challenging dataset using purely supervised learning. They found that the depth (five convolutional and three fullyconnected layers) really was important for achieving their results by observing that removing convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance (a loss of about 2% for the top-1 performance of the network). 2.5 ConvNet Visualization Though large ConvNets (convolutional networks) have recently demonstrated impressive classification performance, there is no clear understanding of why they perform so well, or how they might be improved, which is deeply unsatisfactory from a scientific standpoint. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In 2013, Zeiler et al. addressed both issues [24]. They introduced a visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. The visualization technique uses a multi-layered deconvolutional network (deConvNet), as proposed by Zeiler et al. [33], to project the feature activations back to the input pixel space. DeConvNets are not used in any learning capacity, just as a probe of an already trained ConvNet. They also performed an ablation study to discover the performance contribution from different model layers. This enabled them to find model architectures that outperformed Krizhevsky et al. [16] on the ImageNet classification benchmark. 2.6 Sparsely Connected Architecture Bigger size typically means two drawbacks: a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited; and the dramatically increased use of computational resources, which is always finite, in a deep vision network where convolutional layers are chained. The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the work of Arora et al. [34]: If the probability In this section the important factors in architecture and training methods of GoogLeNet’s classification submission and detection submission are described. Most of the content in this section is from Szegedy et al.’s [18]. Szegedy et al.’s GoogLeNet won the classification and object recognition challenges in the ILSVRC 2014 by setting the new state of the art. GoogLeNet, a radically redesigned DCNN, used a new variant of convolutional neural network called “Inception” for classification, and the R-CNN [17] for detection. GoogLeNet submission to ILSVRC 2014 actually uses 12x fewer parameters than the winning architecture of Krizhevsky et al. [16] from two years ago, while being significantly more accurate. Compared to the 2013 result, the detection accuracy has almost doubled from 22.6% to 43.9%. The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. GoogLeNet uses a significantly deeper and wider convolutional neural network architecture than traditional DCNNs at the cost of a modest growth in evaluation time. Fig. 7 shows a schematic view of GoogLeNet network which includes 9 repeated layers of so-called Inception module. Fig. 8 shows Inception module with dimension reduction. Inspired by a neuroscience model of primate visual cortex, the Inception model uses a series of trainable filters of different sizes in order to handle multiple scales. The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. An alternative parallel pooling path in the module is added in each stage since pooling operations have been essential for the success in current state of the art convolutional networks. Since even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters, 1x1 convolutions are placed before the expensive 3x3 and 5x5 convolutions as dimension reduction module. This allows for not just increasing the depth, but also the width of the networks without significant performance penalty. The resultant architecture leads to over 10x reduction in the number of parameters compared to most state of the art vision networks, which reduces overfitting during training and allows the system to perform inference with low memory footprint. All the convolutions, including those inside the Inception modules, use rectified linear activation. However, IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 41 Fig. 7. A schematic view of GoogLeNet network (Adapted from [18]). Fig. 8. Inception module with dimension reduction (Adapted from [18]). given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. To solve this issue, inspired by [34] in which a layer-by-layer construction is suggested, they add auxiliary classifiers connected to the layers in the middle of the network during training, expecting to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization. However, the biggest gains in object-detection have come from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al. [17]. For classification challenge entry, several ideas from the work of [35] were incorporated and evaluated, specifically as they relate to image sampling during training and evaluation. Their approach yielded solid evidence that moving to sparser architectures is feasible and useful idea in general. new ideas, algorithms and improved network architectures. Although the tremendous progress of this new technology is very impressive and encouraging, it is still difficult to predict the future success of deep neural networks. Remembering that we still know very little of the brain mechanism and have many orders of magnitude to go in order to match the infero-temporal pathway of the primate visual system, the most important advances in deep neural networks certainly lie in the future. Acknowledgment This material is based upon work supported by Sangmyung University, Korea. Comments by Prof. Dongsun Park at Chonbuk National University, Korea, greatly helped to improve an earlier version of this manuscript. References 4. Conclusions Deep learning techniques have made tremendous progress in computer vision, and are significantly outperforming other techniques, and even humans in certain limited recognition tests. Most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of [1] L. Deng and D. Yu, Deep Learning Methods and Applications, now Publishers Inc., 2014. [2] W. McCulloch and W. Pitts, “A Logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943. [3] Donald O. Hebb, The Organization of Behavior: A Neuropsychological Theory, Wiley, June 1949. 42 Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review [4] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, pp. 386-408, 1958. Article (CrossRef Link). [5] B. Widrow and M.E. Hoff, Jr., “Adaptive Switching Circuits,” IRE WESCON Convention Record, Part 4, pp. 96-104, August 1960. [6] M. Minsky and Seymour Papert, Percetrons, Cambridge, MIT Press, 1969. [7] T. Kohonen, "Correlation Matrix Memories," IEEE Transactions on Computers, vol. 21, pp. 353-359, April 1972. Article (CrossRef Link) [8] James A. Anderson, “A simple neural network generating an interactive memory,” Mathematical Biosciences, vol. 14, pp. 197–220, 1972. Article (CrossRef Link) [9] S. Grossberg, "Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors," Biological Cybernetics, vol. 23, pp. 121-134, 1976. Article (CrossRef Link) [10] J. J. Hopfield, “Neural Networks and Physical Systems with Emergent Collective Computational Abilities,” Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 79, pp. 2554–2558, 1982. Article (CrossRef Link) [11] P. J. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, PhD thesis, Harvard University, 1974. [12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart and J. L McClelland, eds, vol. I, pp 318362, MIT, Cambridge, 1986. [13] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, pp. 541– 551, 1989. Article (CrossRef Link) [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998. Article (CrossRef Link) [15] Geoffrey E. Hinton and Simon Osindero, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527-1554, 2006. Article (CrossRef Link) [16] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012. [17] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR 2014, IEEE Conference on, 2014. [18] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, “Going Deeper with Convolutions,” CoRR, 2014. [19] K. Fukushima, “Neocognitron: A Self-organizing [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,” Biol. Cybernetics, vol. 36, pp. 193-202, 1980. Article (CrossRef Link) D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction, and functional architecture in the cats visual cortex,” Journal of Physiology (London), vol. 160, pp. 106-154, 1962. Article (CrossRef Link) M.D. Zeiler, R. Fergus, “Visualizing and Understanding Convolutional Networks,” ECCV 2014 (Honorable Mention for Best Paper Award), Arxiv 1311.2901, Nov 28, 2013. Min Lin, Qiang Chen, and Shuicheng Yan, “Network in network,” CoRR, abs/1312.4400, 2013. Pierre Sermanet, David Eigen, Xiang Zhang, Micha¨el Mathieu, Rob Fergus, and Yann LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” CoRR, abs/1312.6229, 2013. Matthew D. Zeiler, Hierarchical Convolutional Deep Learning in Computer Vision, Ph.D. Thesis, Nov. 8, 2013. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, abs/1207.0580, 2012. Pierre Sermanet and Yann LeCun, “Traffic Sign Recognition with Multi-Scale Convolutional Networks,” IJCNN, 2011. J. Fan, W. Xu, Y. Wu, and Y. Gong, “Human tracking using convolutional neural networks,” Neural Networks, IEEE Transactions on, vol. 21, pp. 1610 –1623, 2010. B. A. Olshausen and D. J. Field, Sparse coding with an overcomplete basis set: a strategy employed by V1?, Vision Research, vol. 37, pp. 3311-3325, 1997. P. C. Bush and T. J. Sejnowski, The cortical neuron, Oxford University Press, 1995. R. Douglas and K. Martin, “Recurrent excitation in neocortical circuits,” Science, vol. 269, pp. 981-985, 1995. Xavier Glorot, Antoine Bordes, and Yoshua Bengio, “Deep Sparse Rectifier Neural Networks”, Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA, vol. 15 of JMLR:W&CP 15, pp. 315-323, 2011. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” Proc. 27th International Conference on Machine Learning, 2010. M. Zeiler, G. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” ICCV, 2011. Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma, “Provable bounds for learning some deep representations,” CoRR, abs/1310.6343, 2013. Andrew G. Howard, “Some improvements on deep convolutional neural network based image classification,” CoRR, abs/1312.5402, 2013. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Hyeon-Joong Yoo is Professor of IT Engineering Department at Sangmyung University, Chonan, Korea. He received his B.S. degree in Electronics Engineering from Sogang University, Korea, in 1982 and his M.S. and Ph.D. in Electrical and Computer Engineering from the University of Missouri in 1991 and 1996, respectively. His current research interests include computer vision, pattern recognition, artificial neural network, and image processing. Copyrights © 2015 The Institute of Electronics and Information Engineers 43 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.044 44 IEIE Transactions on Smart Processing and Computing A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation for Low Power Applications Abbas Syed Hayder, Hyun-Gu Park, Hongin Kim, Dong-Soo Lee, Hamed Abbasizadeh, and Kang-Yoon Lee College of Information and Communications Engineering, Sungkyunkwan University / Suwon, South Korea {hayder,blacklds,squarejj,cbmass85, hamed, klee}@skku.edu * Corresponding Author: Kang-Yoon Lee Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper * Extended from a Conference: Preliminary results of this paper were presented at the IEEE ISCIT2014. Abstract: This paper presents a high speed synchronous single inductor dual output boost converter using Type-III compensation for power management in smart devices. Maintaining multiple outputs from a single inductor is becoming very important because of inductor the sizes. The uses of high switching frequency, inductor and capacitor sizes are reduced. Owing to synchronous rectification this kind of converter is suitable for SoC. The phase is controlled in time sharing manner for each output. The controller used here is Type-III, which ensures quick settling time and high stability. The outputs are stable within 58µs.The simulation results show that the proposed scheme achieves a better overall performance. The input voltage is 1.8V, switching frequency is 5MHz, and the inductor used is 600nH. The output voltages and powers are 2.6V& 3.3V and 147mW&, 230mW respectively. Keywords: DC converter, Switch mode power supplies, SIMO converters 1. Introduction As the technology advances, the desire for small area devices is increasing. Power management is the heart of all electronic devices, which makes the other blocks alive by supplying proper voltage levels. DC-DC converters provide a regulated supply for the rest of the circuit including the DSP chip in portable devices, with a different output level from the unregulated input supply. Voltage regulation for SoC (System-on-Chip) applications is required to fulfil a variety of functions [2]. These are classified into two categories, linear regulators and switching regulators. Conventional power management techniques consume high power under static conditions, but switching regulators have high efficiency compared to conventional techniques. Linear regulators waste most of their energy in the form of heat, requiring extra heat sinks. On the other hand switching regulators or choppers, dissipate a negligible amount of power during conversion. Chopper circuit requires at least two switches, one switching source, control circuit, one energy storage element. The input voltage is chopped off at a specific frequency. As the switching frequency is increased the charge and discharge region of the inductor shrinks and demands very careful calculations, timing diagrams and circuit design to generate proper signals for gate drives. Power loss at the power stage increases roughly with f s [4]. These include; conduction losses and switching losses. Switching speed of MOSFET devices is calculated by the time required to charge/discharge the gate capacitor. Therefore, parallel multiple MOSFET’s can be used to achieve a high switching speed. Bootstrap, ø-inverters,3rd generation gate drive techniques [5, 6] are ideal candidates for switching the gate voltage at higher speed. In this paper, a small amount of efficiency is sacrificed to achieve high speed and lower volume. In addition, efficiency is an inverse function of the difference between the input and output voltages. Conventional dc-dc converters suffer power losses due to their bulky nature and require many inductors as outputs. The popular control scheme for 45 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 DC/DC Converters SIMO Buck/Boost Single Inductor Multiple Outputs Wastes Energy Dissipating excess Power as Heat Linear Regulators Switching Regulators SISO Buck/Boost Single Inductor Single Output Syn/Asynch Fig. 1. Classification of DC-DC Converters. Fig. 2. (a) The circuit of an ideal boost DC-DC converter, (b) The equivalent circuit of the inductor charge phase, (c) the inductor discharge phase. switching converters is pulse-width modulation (PWM). The control uses a constant switching frequency and varies the duty cycle as the load current varies. This scheme achieves good regulation, low noise spectrum, and high efficiency. A DC-DC converter, which is an essential block in PMIC (Power Management IC), provides a regulated output voltage from a linearly discharging battery in portable devices [7]. An LED driver is essentially a current source (or sink) that maintains a constant current required for achieving the desired color and luminous flux from an array of LEDs [8]. In the proposed converter design the inductor and capacitor sizes are reduced to 40% of the lowest average value published thus far. In addition a smooth handover block in the feedback loop was designed to assist the loop at start up and avoid indefinite unpredictable behavior. 2. Classification of Converters Initially there are only two categories of dc-dc converters before the evolution of SIMO (Single Inductor Multiple Output) converters, linear regulators and switching regulators. Fig. 1 shows the classification of dcdc converters according to the present times. Topologies in each category of switching regulators are divided into four classes: Buck Converter, Boost Convert, Inverting topology and transformer flyback topology. Inductor is charged at the first phase and energy of inductor with input supply is pushed to output node, just like fill and throw of energy to the output. The inductor charge phase of a boost converter is shown in Fig. 2. The equivalent circuit for Φ1is shown in Fig. 2(b), which is simplified by closing SW1 and opening SW2 for a certain on-time ton. During Φ1 the inductor L is 46 Hayder et al.: A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation … Fig. 4. Type-III Compensator. Fig. 3. Block Diagram of the Proposed Converter in [9]. charged by the voltage source Uin, causing the inductor current iL(t) to increase from its minimum value IL_min to its maximum value IL_max. Simultaneously the output capacitor C is discharged through the load RL. For the inductor discharge phase Φ2 , equivalent circuit is shown in Fig. 2(c), which is accomplished by opening SW1 and closing SW2 for a certainoff-time toff . During L is discharged into C and RL, causing iL(t) to decrease. As a result iL(t) is divided over C and RL, thereby charging C and providing power to RL. Because L is discharged in series with Uin it can be intuitively seen that the output voltage Uout will always be higher than Uin [9] 3. Related Work The initial work with a single output has already been presented in our previous work [10] and [11]. A 3.7V High speed dc-dc synchronous boost converter with smooth loop handover is discussed. Switching frequency was 5MHz, inductor size was 600nH. Output of 3.7V is regulated from a supply of 1.8V. The three main blocks of dc-dc converter has been discussed. In the feedback loop PWM control is used. On the other hand in [11] we used Type-III compensation for single output converter to achieve quick settling time. In both papers the dual output with Type-III was not addressed. In the architecture shown in Fig. 3 has three main blocks: Power stage, Control stage, and Gate drive stage. The control stage has further two sub-blocks: PWM generation and SLHO (Smooth Loop Hand Over). The current circulation in the power stage is much higher than the control stage. Output voltage is sensed scaled and compensated using error amplifier. It is compared with carrier signal to generate a PWM signal proportional to the output voltage. PWM signal generated from feedback loop and constant duty cycle signal are fed to Smooth Loop Mux (SLM). In the startup KHF is engaged, while after the predetermined time which is set in DB (Delay Block) the PWM loop is engaged for the rest of the operation. Delay Block is triggered just after 20µs from power on of the converter. Simultaneously while depending on the load the Cod is connected and disconnected at the output using the control signals Rc and Rcb to reduce the ripples. Ds1 and Ds2 are two non-overlapped signals generated to avoid any loss or conflict in inductor current. This architecture works with CCM (continuous conduction mode) operation. Sm and Sa are used with anti parallel diodes, as MOSFET channel has to carry the positive current, e.g., n-channel MOSFET's drain to source. If the load is an inductive, interval that switch is turned on, current flows in the opposite direction, but this current will flow through the diode. Without a diode, the induced current generates peaks of high voltage, causing current to stop. At heavy loads conduction losses dominate while switching losses dominate at light loads. We used four stages of cascaded inverters as gate drives [10]. 4. Control Techniques and Efficiency Model There are many control techniques and compensation methods which are being used in dc/dc converters. Some of them are: Hysteretic, PI, Type-II and Type-III, with combination of voltage mode and current mode sensors. Peak inductor current and average current of inductor is acquired and monitored to make it as input for controllers. In the meantime feed-forward control also can be used for dynamic performance of the dc/dc converter. But in our case we used PI control technique. Conventional PWM techniques have already been discussed in our previous work. But due to slow response time we re-designed our converter with Type-III controller. This controller is comparatively fast. Type-III compensation is used to improve the transient response, and to boost the crossover frequency and phase margin. It guarantees the targeted bandwidth and phase margin, as well as an unconditionally stable control loop [12]. Fig. 4 shows the conventional Type III compensation using voltage Op-Amp. Simulation results from discussed work of single inductor single output (SISO) is shown in Fig. 5. 47 Vo (V) IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 2 0 0 100 Err (V) Vcarr (V) L I (A) 0.35 0.25 0.1 0 150 1.5 1 0.5 0 s 1 200 (a) 152 1.5 300 400 154 (b) 2 2.5 (c) 500 156 3 600 158 3.5 4 160 4.5 (1) 5 5 0 -5 D 1 (V) Transfer function with the feedback block is 4 0 100 200 (d) 300 400 266 (e) 267 Time(μs) 500 600 (2) 1 0.5 0 264 265 268 269 where R Y and R Z are obtained using (7) and (8) 270 (3) Fig. 5. Simulations Results of SISO. (4) There are two poles ( f p1 and f p 2 , besides the pole-at- Po : Output power zero f p 0 ) and two zeros ( f z1 and f z 2 ) provided by this Pc , Ps : Conduction and Switching losses compensation. All passive components involved play a dual role in determining the poles and zeros. So, the calculation can become fairly cumbersome and iterative. But a valid simplifying assumption that can be made is that C1 is much greater than C3 [13]. So the locations of the poles and zeros are finally: trv : Rise time t fl : Fall time CGN : NMOS gate capacitance CGP : PMOS gate capacitance L V01 Coa RL Sa Sm Vg V02 V01 Cob CLK Sb Sm Sa Sb Sp1 Sp2 Gate Drives Type-III Controller Sp1 V02 D SET CL R Q Q D SET CL R Q Q Sp2 Sp1 Ref1 Sp2 Ref2 Fig. 6. Block diagram of proposed SIDO converter. RL 48 Sp1(V) Hayder et al.: A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation … 0 -0.5 Sp2(V) and high inductor value. In contrast we designed with high switching frequency, Type-III controller, lower inductor value, low load capacitor and without current sensors. Because current sensors occupy extra areas. Type-III controllers show a rapid response. To comply with higher switching frequency improved controllers are necessary. During phase 1output vo1 is maintained with coordination of Sm and Sa. In the meantime controller enables sp1 thus by selecting Ref1. On the other hand output vo2 is maintained with coordination of Sm and Sb. Charging time of Sm is decided by the controller depending upon the present output voltage at vo2. In this phase Ref2 is selected. Rate of charging inductor is same for both outputs with Vg/L, while discharge slope for phase1 is (Vg-Vo1)/L and for phase2 is (Vg-V02)/L. Filtering capacitors Coa and Cob smoothes the output glitches. The overall controller output is shown in Fig. 7(c). Comparison with other works is summarizes in Table 1. For the time duration when sp1 is on, MOSFET Sm duty ratio is calculated to generate output required for the Vo1, similarly, when sp2 is on MOSFET Sm duty ratio is calculated to stable the output Vo2. At a time one sensor value is selected using sp1 and sp2. When sp1 is selected, reference voltage Ref1 becomes input to Type-III controller. In contrast when sp2 is selected Ref2 becomes the input for the controller. Each output is scaled and becomes input to Type-III controller where error signal is generated after subtracted from reference signals. This error signal becomes input to comparator and 1 0.5 0 0.2 0.4 0.6 0 0.2 0.4 0.6 (a) 0.8 1 1.2 1.4 1.6 1.8 2 0.8 1 1.2 1.4 1.6 1.8 2 1 0.5 0 -0.5 Type-III(V), LI (A) (b) 1 Type-III IL 0.5 0 -0.5 0 0.2 0.4 0.6 0.8 (c) 1 1.2 Time(μs) 1.4 1.6 1.8 2 Fig. 7. Phase Control Signals and Type-III output. C X : Equivalent parasitic capacitance at node X R ,ٛ L C : Equivalent series resistances of the inductor and capacitor respectively 5. Proposed Dual Output Architecture In our design of proposed dual output converter as shown in Fig. 6, the inductor energy is shared between the two outputs. Dual output with phase control is presented in paper [1] with lower switching frequency, current sensors Sa (V) 1 0.5 0 -0.5 Vo2 (V) 2.5 1 0 3.2 PstCmp(V) Vo1 (V) 1 0.5 0 -0.5 Sb (V) Sm (V) 1 0.5 0 -0.5 3 1 0 1 2 3 (a) 4 5 6 7 8 9 10 0 1 2 3 (b) 4 5 6 7 8 9 10 0 1 2 3 (c) 4 5 6 7 8 9 10 0 50 100 150 (d) 200 250 300 350 400 450 500 0 50 100 150 (e) 200 250 300 350 400 450 500 100 150 (f) 250 300 350 400 450 500 2.8 2.6 Vcarr(V) L I (A) 0 0 50 200 3 2 1 5 (g) 0 10 15 1.8 0 50 55 (h) 60 Time(μs) 65 70 Fig. 8. Simulation Results of proposed Converter. 75 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Table 1. Comparison Summary. 49 converters, ultimately reducing inductor and capacitor sizes. Settling time is 58µs. The Output voltage was scaled and fed to the controller in phase control mode. In each phase, the two phase control signals keeps track for adjusting the duty ratios to achieve the desired output. Parameters Switching frequency Input voltage Unit [1] [3] [This Work] HZ 500K 1M 5M V 1.8 2.6-5 1.8 Inductor Henry 1µH 22µH 600nH Output Voltages Filtering Capacitors Compensation Settling time Current sensors V 3.0, 3.45 1.2 2.6, 3.3 Acknowledgement Farad 44µ,47µ 35µ 1µ, 1µ Sec - Yes 150m, 170m No Type-III 58µs No 147m, 230m This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MEST) (No. 2011-0009454) research was supported by the National Research Foundation Output power W 240m Vo1 Vo2 Vo (V) 4 3.3 2.6 2 150 200 250 300 350 400 0.08 Io1 Io2 Iout(A) 0.07 0.06 0.05 0.04 150 200 250 300 350 400 Time(μs) Fig. 9. Output Voltages and Load Currents. from dead time control non overlapped signals are generated. Gate driver is used to generate high current and to turn and off the Power MOSFETs. We used existing clock for switching frequency using clock divider 2.5MHz is achieved through feedback of D-Flip-Flop to drive sp1 and sp2. The simulation results are shown in Figs. 8(a)-(h). NMOS signal Sm generated is shown in Fig. 8(a), Signal Sa, Sb are fed to high side PMOS switches located near the output Vo1 and Vo2 respectively. Boosted Output voltages are presented in Figs. 8(d) and (e), where output voltage level stables within 58µs. Comparator input is given in Fig. 8(f). Carrier signal and inductor current are illustrated in Figs. 8(g) and (h). Zoomed output voltages and load currents are shown in Fig. 9 for the same load resistors connected in both outputs. Average load current taken by Vo1 and Vo2 was 56mA and 70mA respectively. 6. Conclusion This paper proposed a single inductor dual output dc-dc converter design using a Type-III compensation. In the beginning the existing single output and dual output architectures were discussed. The design was then formulated towards a dual output dc-dc converter with an improved control scheme, which is essential for high speed References [1] D. Ma, W.-H. Ki, C.-Y. Tsui, and P. K. Mok, "A single-inductor dual-output integrated DC/DC boost converter for variable voltage scheduling," in Proceedings of the 2001 Asia and South Pacific Design Automation Conference, 2001, pp. 19-20. Article (CrossRef Link) [2] L. Ming-Xiang, H. Bo-Han, C. Jiann-Jong, and H. Yuh-Shyan, "A sub-1V voltage-mode DC-DC buck converter using PWM control technique," in Electron Devices and Solid-State Circuits (EDSSC), 2010 IEEE International Conference of, 2010, pp. 1-4. Article (CrossRef Link) [3] E. Bonizzoni, F. Borghetti, P. Malcovati, F. Maloberti, and B. Niessen, "A 200mA 93% Peak Efficiency Single-Inductor Dual-Output DC-DC Buck Converter," in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, 2007, pp. 526-619. Article (CrossRef Link) [4] E. Sánchez-Sinencio and A. G. Andreou, Lowvoltage/low-power integrated circuits and systems: low-voltage mixed-signal circuits vol. 4: Wiley-IEEE Press, 1999. Article (CrossRef Link) [5] L. Balogh, "Design and application guide for high speed MOSFET gate drive circuits," 2001. Article (CrossRef Link) [6] S. Mappus, "Predictive gate drive boosts synchronous DC/DC power converter efficiency," App. Rep. Texas Instruments SLUA281, 2003. Article (CrossRef Link) [7] Y. H. Ko, Y. S. Jang, S. K. Han, and S. G. Lee, "Non-load-balance-dependent high efficiency singleinductor multiple-output (SIMO) DC-DC converters," in Custom Integrated Circuits Conference (CICC), 2012 IEEE, 2012, pp. 1-4. Article (CrossRef Link) [8] A. T. L. Lee, J. K. O. Sin, and P. C. H. Chan, "Scalability of Quasi-Hysteretic FSM-Based Digitally Controlled Single-Inductor Dual-String Buck LED Driver to Multiple Strings," Power Electronics, IEEE Transactions on, vol. 29, pp. 501-513, 2014. Article (CrossRef Link) [9] S. A. Hayder, K. Hongjin, K. Ji-Hun, and L. KangYoon, "3.7V High Frequency DC-DC Synchronous Boost Converter with Smooth Loop Handover," in 50 Hayder et al.: A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation … ISCIT 2014, Incheion, 2014. Article (CrossRef Link) [10] S. A. Hayder, P. Hyung-Gu, L. Dong Soo, P. YoungJun, and L. Kang-Yoon, "High Speed DC-DC Synchronous Boost Converter Using Type-III Compensation for Low Power Applications," in IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, USA, 2015. Article (CrossRef Link) [11] L. Cao and A. P. Power, "Design Type II Compensation In A Systematic Way." Article (CrossRef Link) [12] B. Hauke, "Basic calculation of a boost converter's power stage," Texas Instruments, Application Report November, pp. 1-9, 2009. Article (CrossRef Link) Abbas Syed Hayder received his Master’s degree from Hanyang University in Electrical Electronics Control and Instrumentation Engineering in 2010. He is now pursuing his PhD degree from Sungkyunkwan University, in Analog Integrated Circuits. His research interests include high speed SIMO (Single Inductor Multiple Output) dc-dc Converters, PLL, VGA (Variable Gain Amplifier), Digital circuit design, BGR (Band Gap References), and Control techniques for dc-dc converters. Hyung-Gu Park was born in Seoul, Korea. He received his B.S. degree from the Department of Electronic Engineering at Konkuk University, Seoul, Korea, in 2010, where he is currently working toward a Ph.D. degree in School of Information and Communication Engineering, Sungkyunkwan University. His research interests include highspeed interface IC and CMOS RF transceiver HongJin Kimwas born in Seoul, Korea, in 1985. He received his B.S. degree from the Department of Electronic Engineering at Chungju University, Chungju, Korea, in 2010, where he is currently working toward a M.S& Ph.D. degree in electronic engineering at SungKyunKwan University, Suwon, Korea. His research interests include on CMOS RF IC and analog/mixed-mode IC design for lowpower application and Power Management IC Copyrights © 2015 The Institute of Electronics and Information Engineers Dong-Soo Lee received his B.S., and M.S. degree from the Department of Electronic Engineering at Konkuk University, Seoul, Korea, in 2012, and the electronic engineering at Sungkyunkwan University, Suwon, Korea, in 2014, respectively. He is currently working toward a Ph.D. degree in electronic engineering at Sungkyunkwan University. His research interests include CMOS RF IC and Phase Locked Loop, Silicon Oscillator and Sensor Interfaces design Hamed Abbasizadeh was born in Suq, Iran, in 1984. He received his B.S. degree in Electronic Engineering from Bushehr Azad University, Bushehr, Iran in 2008 and M.S. degree in Electronic Engineering from Qazvin Azad University, Qazvin, Iran in 2012, where he is currently working toward the Ph.D. degree in School of Information and Communication Engineering, Sungkyunkwan University. His research interests include design of high-performance data converters, bandgap reference voltage, low-dropout regulators and ultra-low power CMOS RF transceiver. Kang-Yoon Lee received the B.S. M.S., and Ph.D. degrees in the School of Electrical Engineering from Seoul National University, Seoul, Korea, in 1996, 1998, and 2003, respectively. From 2003 to 2005, he was with GCT Semiconductor Inc., San Jose, CA, where he was a Manager of the Analog Division and worked on the design of CMOS frequency synthesizer for CDMA/PCS/PDC and single-chip CMOS RF chip sets for W-CDMA, WLAN, and PHS. From 2005 to 2011, he was with the Department of Electronics Engineering, Konkuk University as an Associate Professor. Since 2012, he has been with the School of Information and Communication Engineering, Sungkyunkwan University, where he is currently an Associate Professor. His research interests include implementation of power integrated circuits, CMOS RF transceiver, analog integrated circuits, and analog/digital mixed-mode VLSI system design. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.051 51 IEIE Transactions on Smart Processing and Computing Digital Error Correction for a 10-Bit Straightforward SAR ADC Behnam Samadpoor Rikan, Hamed Abbasizadeh, Sung-Han Do, Dong-Soo Lee, and Kang-Yoon Lee College of Information and Communication Engineering, Sungkyunkwan University / Suwon, South Korea {behnam, hamed, dsh0308, blacklds, klee}@skku.edu * Corresponding Author: Kang-Yoon Lee Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: This paper proposes a 10-b SAR ADC. To increase the conversion speed and reduce the power consumption and area, redundant cycles were implemented digitally in a capacitor DAC. The capacitor DAC algorithm was straightforward switching, which included digital error correction steps. A prototype ADC was implemented in CMOS 0.18-μm technology. This structure consumed 140μW and achieved 59.4-dB SNDR at 1.25MS/s under a 1.8-V supply. The figure of merit (FOM) was 140fJ/conversion-step. Keywords: Redundancy, Digital error correction, Straightforward, SAR ADC 1. Introduction High performance analog-to-digital conversion on the nanometer scale is performed according to the high speed switching rather than amplifying. The nature of a charge redistribution successive-approximation-register (SAR) analog-to-digital converter (ADC) is based on high speed switching. As the demand for low power consumption and high precision is increasing, the prominent power efficiency of this structure has made it popular for high resolution analog-to-digital conversion applications. Moreover, a verification of recent publications confirmed that the attitude toward SAR ADCs has increased rapidly. While the low-power characteristics of SAR ADC has increased the applications of this structure, it suffers from low speed because for each conversion, a large number of decision cycles are required. The necessity and interest in high speed structures and low power consumption for SAR ADCs, has prompted many efforts to increase the speed of this block and overcome this drawback. In addition, many techniques have been used to further decrease the power consumption of these blocks [1]. Applying redundancy to the conversion steps is an alternative approach for SAR ADC design. A redundant structure has two benefits. First, a redundancy decision causes in built-in error resilience, in which a bounded decision error in the early steps can be absorbed into the redundant conversion ranges of the later steps. The settling accuracy in most MSB conversion steps of the DAC can largely be relaxed and the conversion speed can be expedited. Second, the redundancy allows overlapped conversion curves in parts of the ADC transfer function, which is instrumental to a digital domain treatment of DAC mismatch errors. In this paper for DAC settling error problem in binary decision SAR ADCs, a digital error correction (DEC) technique has been applied. The DEC technique leaves the capacitor intact and employs digital addition only [2]. The organization of this paper is as follows. Section 2 reviews the conventional structure and straightforward algorithms for SAR ADCs. Section 3 proposes the digital error correction technique used for the SAR ADC structure. A 10-bit prototype SAR ADC using digital error correction technique is implemented in section 4. Section 5 reports the simulation results and section 6 concludes the paper. 2. Conventional vs Straightforward SAR structures Fig. 1 shows the block diagram of the SAR ADC. Input is sampled and held in the S&H sub-block. DAC is a capacitor array, in which according to the control bits from SAR logic, a voltage is created. The output voltage of the DAC is compared with the sampled input in a comparator. The SAR logic decides the control bits according to the 52 Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC INPUT + S&H Comparator SAR LOGIC OUTPUT ‐ DAC Control Bits Fig. 1. Structure of SAR ADC. comparator output in each cycle. The output of the ADC will be available after the decision of the N bit finished. Fig. 2 summarizes the structure and switching sequence of a 3 bit dual sampling SAR ADC as an example. Initially, the input is sampled in C1, and CDAC is sampled to VREFT-VIN. At the hold step, the MSB is decided and according to the MSB bit, the SAR logic decides how to redistribute the charge in the next step. If the MSB is 1, (1/2)VREFT is added to the CDAC voltage by connecting bottom plate of one half of the unit capacitors to VREFT. In this step (3/2)VREFT-VIN is compared with the VIN and the second bit is decided. In the case that MSB is 0, the CDAC voltage again boosts to 1.5VREFT-VIN, and the bottom plate of the sample capacitor (C1) is connected to the VREFT, which causes an increase in the input voltage at the positive input of the comparator to the VREFT+VIN. For the third bit decision, the voltage in C1 will be unchanged and if the output of the comparator is 1, one more capacitor in the cap-bank will be connected to the VREFT, and if the output of the comparator is 0, one of the capacitors, in which bottom plate is connected to the VREFT, will go back to the VREFB. This structure is called dual sampling because the analog signals are sampled and held asymmetrically at each input side. In this VIN VREFT VIN structure, the MSB bit is decided with no switching so the power consumption of this structure will be lower than the conventional SAR ADCs. In addition, the number of unit capacitors in CDAC has been decreased to half compared to conventional structures. Here VREFT and VREFB are a substitution of VDD and VSS, respectively, because for some error correction reasons, the exact VDD and VSS values cannot be used. Therefore, VREFT, which is slightly lower than VDD, and VREFB, which is slightly higher than VSS, were used. The reason that this structure is reviewed is to give the reader some insight into SAR ADC and switching sequence in a capacitor bank. Moreover, in most SAR logics the switching algorithm is to some extent similar to this structure. For more detailed information reader can refer to [3]. For each cycle, we need to charge or discharge some of the capacitors from VREFT to VREFB and vice versa, which requires a long time and consumes large amounts of energy. Moreover, according to these simulations, to achieve a high performance structure we need to increase the unit capacitor value. In addition, the amount of C1 needs to be several pF. When small values are used for this capacitor, the performance of the SAR ADC degrades. The structure used for the proposed ADC is called straightforward DAC switching. This reduces the wasted power and increases the speed of switching. Note that a similar concept has been reported [4-6]. The positive input of the comparator is connected to the VCM which is created in the reference generator and there is no need to be sampled. The other inputs are connected to the CDAC. Similar to the dual sampling structure, Fig. 3 summarizes the detailed switching sequence for the 3bit straightforward SAR ADC. At the sampling mode, the output of CDAC is charged to VCM-VIN and in hold VREFB VREFB c1 c1 c c c c 2 1/4CV ref + ‐ YES VREFB c1 2 CVref VREFB c c VREFT c c + ‐ VREFB c c c c + ‐ c c VREFT c c ‐ VIN > 7/8 Vref ? VIN > 3/4 Vref ? VREFB c1 2 1/4CV ref YES + NO VREFB c1 VREFB + VREFB c VREFT c ‐ c VIN > 5/8 Vref ? c VIN > 1/2 Vref ? VREFT 2 CV ref 1/4CV ref NO YES VREFT c1 VREFB c c VREFT c c + ‐ VREFB c c VREFT c c + ‐ VIN > 3/8 Vref ? VIN > 1/4 Vref ? VREFT c1 2 1/4CV ref Sample mode (cycle 0) Hold mode (cycle 1) c1 2 Redistribution mode (cycle 2) NO VREFB c c VREFT c c + ‐ VIN > 1/8 Vref ? Redistribution mode (cycle 3) Fig. 2. 3 bit dual sampling switching sequence. 53 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 2c ‐ c 2c + VCM VX=2VCM‐VIN+VCM/2 ‐ 2c c c VREFT VCM 2 C(VCM) VCM VCM 2c c + VCM VX=2VCM‐VIN+(1/4)VCM 2c c VCM > VX ? ‐ c VREFT 2 (1/4)C(VCM) ‐ c VCM > VX ? YES + c VCM > VX ? ‐ VREFT c VIN VCM VX=2VCM‐VIN + VCM VX=2VCM‐VIN+(3/4)VCM YES VX=VCM‐VIN VCM 2 (1/4)C(VCM) + VCM NO VREFB VCM VCM > VX ? + VCM VX=2VCM‐VIN‐(1/4)VCM ‐ 2c c c VREFB 2 (1/4)C(VCM) c YES C(VCM) 2 NO VREFT + VCM VX=2VCM‐VIN‐VCM/2 ‐ 2c c c VCM VCM > VX ? + VCM VREFB VCM VX=2VCM‐VIN‐(3/4)VCM 2c 2 NO (1/4)C(VCM) Sample mode (cycle 0) Hold mode (cycle 1) VCM > VX ? c VCM > VX ? ‐ c VREFB VCM Redistribution mode (cycle 2) Redistribution mode (cycle 3) Fig. 3. 3 bit straightforward switching sequence. mode it goes to 2VCM-VIN and can be compared with the VCM. According to the decision of the comparator, the bottom plate of 2C will connect to VREFT or VREFB at the next cycle. This will continue in the next steps as well. In each step, the switching is performed from VCM to VREFT or VREFB. Therefore, the amount of charge or discharge voltage decreases to half. Consequently, the amount of energy will be a quarter of the previous case. Moreover, the speed of the switching can be increased. Therefore, this structure is more energy efficient than the conventional structures and has a higher speed. The bottom-plate sampling-based straightforward structure can be used for higher resolution as it provides better linearity in principle [7]. 2. Digital error correction technique using redundant cycles In high resolution ADCs, the difference between the coarse capacitor values and fine capacitor values normally causes decision errors in MSB bits. In particular, when the speed increases, the DAC is not settled properly in the decision phases for MSBs. This will cause error at MSB codes, which cannot be recovered in the following cycles. To solve this problem, some redundant cycles are normally predicted. Redundancy can be applied by adding some extra capacitors to the hardware and predicting some redundant decision cycles. Digital error correction was used to solve this problem. This does not require any extra hardware and the CDAC structure remains intact. The concept is explained for a 4 bit example, which has 4 bits and 5 decision cycles. VIN 1111 1110 1101 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 0000 + C1 C2 C3 C4 C5 10 DAC settling error 01 00 V‐SHIFT 0 0 1 1 0 2 coarse bit 3 fine bits 0 0 1 1 0 0 1 1 0 DAC settling Error Correction Fig. 4. DAC settling error correction concept using redundant cycle (C3). As implemented in Fig. 4, the decision procedure in the SAR ADC was divided into two steps. The first step decides the 2 coarse bits and the second step is a decision of the 3 fine bits. One bit redundancy was predicted. Coarse decision thresholds were at 6/16 and 10/16. The decision levels were shifted 0.5LSB of the coarse resolution here. Therefore, the MSB decision with the DAC level is started at 5/8 VREF instead of 1/2 VREF. In the example, the SAR ADC has a slow VDAC settling, 54 Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC VCM VCM + VX=VCM‐VIN ‐ 4c 2c c 11 1 C(VCM) 2 4c VCM + VX=2VCM‐VIN+VCM/2+VCM/4 4c C(VCM) 2c VREFT VREFT 2 c c + 4c 2c VCM VREFT c c VREFT c c VCM VCM 10 + VCM VCM C(VCM) YES VCM VCM > VX ? ‐ 2c NO 2 VCM > VX ? ‐ YES c VIN VX=2VCM‐VIN+VCM/4 10 + VCM VCM > VX ? ‐ 4c VREFT 2c c c VCM VCM VCM > VX ? ‐ 0 01 VCM C(VCM) C(VCM) 2 YES VCM 4c 2c VREFB VREFT c + c 2c 2c 2c c c VCM > VX ? VCM VREFB VREFT VCM VCM > VX ? ‐ 00 + VCM VCM C(VCM) 2 NO Sample mode (cycle 0) Hold mode (cycle 1) ‐ 2 NO VX=2VCM‐VIN‐VCM/2+VCM/4 + VCM 4c VREFB Redistribution mode (cycle 2) 2c VCM c c ‐ VCM > VX ? VCM Redundancy cycle (cycle 3) Fig. 5. Switching sequence with the error correction technique for 2 coarse and 1 redundancy bits in a 4 bit structure. which results in decision error. For the given VIN, 00 is created for two MSB bits, which is not correct. The redundancy will be implemented by one additional decision step. To determine the three fine bits, the VDAC decision level will shift back to its regular decision level by 0.5LSB, shifting it down, which has been done at the beginning to determine the coarse bits. A redundant decision is then implemented. With two more decisions after this step, 3 LSB bits are determined. By adding two coarse bits and three fine bits with one bit overlap, the coarse bit error can be eliminated. As long as the error amount is less than 0.5LSB of the coarse resolution, the error can be eliminated. This will relax the DAC settling and the sampling decision speed can be increased, despite there being more conversion steps. A similar concept was implemented in [2, 7]. Fig. 5 presents the switching sequence for 2 coarse and 1 redundancy decisions. The switching sequence for 2 fine bits will be the same as the straightforward structure as explained before. This structure is the same as the straightforward structure and the redundancy has been implemented by only changing the decision levels. The capacitor DAC has been segmented into two subDACs virtually. One of these sub-DACs decides the 2 coarse bits and the other one decides the 3 fine bits. After the input signal is sampled, except for the largest capacitor of second sub-DAC, all other capacitors are connected to VCM. 2C is connected to VREFT and implements the threshold offset (V-SHIFT) for digital error correction purposes. The MSB is decided in this step. According to the MSB bit value, the 4C capacitor is connected to VREFT or VREFB and the MSB-1 will be decided. As shown in Fig. 5, the CDAC switching operation for the redundant decision phase will be different for each case. In this step, for a further bit decision, the VDAC should be located at the center of determining the input range for the digital error correction algorithm. Moreover, 2C should go back to VCM for the next decision steps of the following LSB bits. In the case that MSB-1 bit is 0, these two conditions are satisfied by switching 2C back to the VCM. As shown in Fig. 6(a), by switching 2C to VCM, the determined input range will be double, and the VDAC will be located at the center of this range. The redundancy bit and 2 other fine bits will then be decided. The final bits can be obtained by adding 2MSB and 3LSB bits with a one bit overlap. As shown in Fig. 6(a), the error in MSB-1 bit has been corrected in the redundancy decision step safely. In the case that two MSB bits are 01 (Fig. 6(b)), if 2C is switched to the VCM, the VDAC will not be in the center of the determined input range (it will be at the 55 IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 VIN 1111 1110 1101 1100 1011 10 1010 1001 1000 0111 01 0110 0101 0100 0011 0010 00 0001 0000 C1 C2 C3 C4 C5 + 0 0 1 1 0 VIN 0 1 1 0 V‐SHIFT 0 0 1 1 0 2 coarse bit 3 fine bits 1111 1110 1101 1100 1011 10 1010 1001 1000 0111 01 0110 0101 0100 0011 0010 00 0001 0000 C1 C2 C3 C4 C5 1111 1110 1101 1100 1011 10 1010 1001 1000 0111 01 0110 0101 0100 0011 0010 00 0001 0000 VIN 0 1 0 1 0 + 0 1 1 0 V‐SHIFT 1 1 + 1 0 1 1 1 1 1 1 1 V‐SHIFT 1 1 1 1 1 2 coarse bit 3 fine bits 0 1 0 1 0 2 coarse bit 3 fine bits b) MSB, MSB‐1=01 a) MSB, MSB‐1=00 C1 C2 C3 C4 C5 c) MSB, MSB‐1=11 Fig. 6. Determined input ranges for different MSB and MSB-1 bits in a 4 bit digital error corrected SAR ADC. CDAC1 (MSB‐Bits) 32c 224c 32c 32c VIN VCM VREFB VREFT CLK1 CLK CDAC2 _ ADCout 32c 96c 28c SAR LOGIC 4c 12c 4c VCM 4c 4c Comparator + VCM VIN VCM VREFB VREFT CDAC3 (LSB‐Bits) 4c 2c c c Control Bits VIN VCM VREFB VREFT CLK CLK1 START RESET Sampling Control Bits ADCout ` D11 – D0 B9 – B0 ` ` ` Fig. 7. 10-bit error corrected SAR ADC structure and timing diagram. bottom of this range). On the other hand, for a further bit decision, 2C should return to the VCM. To solve this problem and satisfy the abovementioned conditions, as the required amount of VDAC change is double that of the returning 2C from VREFT to VCM, 4C was divided into two 2C capacitors and one of them was switched from VREFB to VREFT. Moreover, 2C in the lower sub-DAC was switched back to VCM. This moved the VDAC up to the desired level. The remainder of the decisions were identical to the previous case. Finally, the last case is where all the MSB bits are 1. In this case, MSB capacitor was connected to VREFT, so the same procedure as the previous case cannot be followed. On the other hand, the MSB-1 bit can simply be changed to 0 and a similar operation can be performed, as done for the case when this bit was 0. The details for this case can be found in Fig. 6(c). For similar concepts, the reader can refer to [8]. In [8], during the redundancy cycle, if all the MSB bits of higher sub-DAC become 1, a problem occurs. In this case, in the redundant cycle, the MSB capacitor of the lower sub-DAC that is connected to VREFT should switch back to the VCM and at the same time, a capacitor with the same size in the higher sub-DAC that is connected to the VREFB should be found and switched to the VREFT. On the other hand, this is impossible because all the bits that have been decided by higher sub-DAC are 1, so all the capacitors of this sub-DAC are already connected to the 56 Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC Fig. 9. Ramp Simulation Result. VREFT. The proposed switching sequence of this paper solved this problem, as verified before. As shown in Fig. 5, the switching sequence for each case has been illustrated clearly and all the cases have been considered. 0.5 0 ‐0.5 ‐1 0.4 INL(LSB) Fig. 8. Comparator structure. DNL(LSB) 1 0 ‐0.4 4. 10-bit prototype SAR ADC using digital error correction The 10-bit SAR structure using the explained error correction technique has been implemented in Fig. 7. The capacitor weights are intact compared to the straightforward structure. The CDAC has been segmented into three sub-DACs virtually. Each sub-DAC creates 4bits. CDAC1, CDAC2 and CDAC3 create D11-D8, D7-D4 and D3-D0, respectively, in which D7 and D3 are decided in the redundant cycles. The largest capacitor in CDAC2 is 32C (28C+4C), so each capacitor of CDAC1 is broken into two capacitors, of which one of them is 32C. This is for the purpose of redundancy in the case that D8 is 1 and D11D10D9D8 bits are not 1111 (as explained in the previous section). Therefore, for this case, one of the 32C capacitors that is connected to the VREFB can be found and switched to the VREFT in the redundant cycle. For a similar reason, the capacitors of CDAC2 were broken into two parts, of which one of them was 4C as the largest capacitor in CDAC3 is 4C. The reader might be concerned about the number of switches. Note that the number of switches has been increased but size of these switches has been decreased proportionally. Moreover, the unit capacitor size used in this structure is 29fF, which is the minimum capacitor size limited by the process. As the capacitor size is reduced, the speed can be increased and the power consumption and area can be decreased. Fig. 8 presents the comparator structure. 5. Simulation Results Simulation results for proposed ADC structure are implemented in this section. Finesim was used for the simulations. The process was 0.18-μm CMOS technology. 0 200 400 600 800 1000 Fig. 10. DC Analysis Results. Fig. 11. Sine Simulation Result. Fig. 9 shows the ramp simulation result for the proposed ADC. The sampling speed per cycle was 1.25MHz. The number of samples for this simulation was 1024. Fig. 10 shows the DC analysis results for this ADC. The results were extracted from the ramp simulation. DNL max and DNL min were 0.998 and -1.0 LSB, respectively. INL max and INL min were 0.562 and -0.719 LSB, respectively, and Sigma DNL and Sigma INL for this structure were 0.155 and 0.294 LSB, respectively. Fig. 11 presents a sine wave simulation. The input frequency for this simulation was 1.22KHz and the clock sampling was 1.25MHz. Fig. 12 presents the implementation of the FFT of the output. The input was a 62 KHz sine wave sampled at 1.25MHz. For this input frequency, the effective number of bits (ENOB) was 9.4. The SNDR and SNR were 58.3 and 60.2dB, respectively, and the amount of SFDR was 63.4dB. This ADC consumed approximately 80μA from 1.8V source. Therefore, the current consumption of this structure was IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Magnitude(DBFS) 6. Conclusion SFDR=63.4dB SNDR=58.3dB ENOB=9.4 ‐10 ‐30 ‐50 ‐70 ‐90 0 57 f/fs 0.25 0.5 Fig. 12. Spectrum of the output samples for the input 62 KHz at 1.25MHz. This paper proposed a 10 bit SAR ADC. A digital error correction was used for the straightforward structure. This ADC was implemented in CMOS 0.18-μm technology. The power consumption for this structure was 140μW under a 1.8-V supply. The structure had 59.4-dB SNDR, 61.3 SNR and 65.2 SFDR (peak values) for a 1.25MS/s conversion speed. The unit capacitor in the CDAC structure was 29fF. At this conversion speed, for all frequencies from DC to Nyquist rate, the ENOB was above 9 bits. The figure of merit (FOM) was 140fJ/conversion-step. Acknowledgement This study was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF2013R1A1A2010114). This work was supported by IDEC (IPC, EDA Tool, MPW). References Fig. 13. SNDR and SFDR for the different input ranges from DC to Nyquist rate. Table 1. Summary and comparison results. Parameters Process(nm) [9] 0.35 [10] 180 This work 180 Resolution(bit) 10 10 10 Sampling Rate (MS/S) 2 0.137 1.25 Supply voltage 3.3 1.5 1.8 ENOBPeak 8 8.65 9.57 SNDRpeak(dB) - 53.8 59.4 SFDRpeak(dB) - - 65.2 Sigma DNL(LSB) 0.5 0.56 0.155 Sigma INL(LSB) 0.5 0.38 0.294 Power(mW) 3 13.4 0.14 FOM(fJ/Conv-step) - 243 140 140μW. Fig. 13 illustrates the SNDR and SFDR for different input frequencies from DC to Nyquist rates. FOM = Power 2 ENOB × min{2 × ERBW , f s } (1) According to (1), the figure of merit (FOM) was calculated to be 140fF. Finally, Table 1 summarizes and compares the work with recent studies in SAR ADC. [1] W. Liuet, P. Huang and Y. Chiu, “A 12-bit, 45-MS/s, 3-mW Redundant Successive-ApproximationRegister Analog-to-Digital Converter with Digital Calibration,” IEEE Journal of Solid-State Circuits, VOL. 46, NO. 11, Nov. 2011. Article (CrossRef Link) [2] S-H. Cho, C-K. Lee, J-K. Kwon and S-T. Ryu, “A 550-uW 10-b 40-MS/s SAR ADC with Multistep Addition-Only Digital Error Correction,” IEEE Journal of Solid-State Circuits, VOL. 46, NO. 8, Aug. 2011. Article (CrossRef Link) [3] B. Kim, L. Yan, J. Yoo, N. Cho and H-J Yoo, “An Energy-Efficient Dual Sampling SAR ADC with Reduced Capacitive DAC,” Circuits and Systems, ISCAS. IEEE International Symposium, 2009. Article (CrossRef Link) [4] V. Hariprasath, J. Guerber, S-H. Lee and U-K. Moon, “Merged capacitor switching based SAR ADC with highest switching energy-efficiency,” Electronic Letters, 2010. Article (CrossRef Link) [5] Y. Chen, S. Tsukamoto and T. Kuroda, “A 9b 100MS/s 1.46mW SAR ADC in 65nm CMOS,” IEEE Asian Solid-State Circuits Conference pp. 1618, Nov. 2009. Article (CrossRef Link) [6] Y. Zhu, C-H. Chan, U-F. Chio, S-W. Sin, S-Pan U, R. P. Martins and F. Maloberti, “A 10-bit 100-MS/s Reference-Free SAR ADC in 90 nm CMOS,” IEEE Journal of Solid-State Circuits, VOL. 45, NO. 6, Jun. 2010. Article (CrossRef Link) [7] P. W. LI, M. J. Chin, P. R. Gray and R. Castello, “A Ratio-Independent Algorithmic Analog-to-Digital Conversion Technique,” IEEE Journal of Solid-State Circuits, VOL. SC-19, NO.6, Dec. 1984. Article (CrossRef Link) 58 Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC [8] S-H. Cho, C-K. Lee, J-K. Kwon, and S-T. Ryu, “A 550μW 10b 40MS/s SAR ADC with Multistep Addition-only Digital Error Correction,” Custom Integrated Circuits Conference (CICC), 2010. Article (CrossRef Link) [9] C. Jun, R. Feng and Xu M. H., “IC Design of 2Ms/s 10-bit SAR ADC with Low Power,” High Density Packaging and Microsystem Integration, 2007. Article (CrossRef Link) [10] H-K. Kim, Y-J. Min, Y-H. Kim and S-W. Kim, “A Low Power Consumption 10-bit Rail-to-Rail SAR ADC Using a C-2C Capacitor Array,” Electron Devices and Solid-State Circuits (EDSSC), IEEE International Conference on, 2008. Article (CrossRef Link) behnam Samadpoor Rikan was born in Urmia, Iran, in 1983. He received his B.S. degree in Electronic Engineering from Urmia University, Urmia, Iran in 2008 and M.S. degree in Electronic Engineering from Qazvin Azad University, Qazvin, Iran in 2013, where he is currently working toward a Ph.D. degree in School of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea. His research interests include the design of highperformance data converters, bandgap reference voltage, low-dropout regulators and ultra-low power CMOS RF transceivers. Hamed Abbasizadeh was born in Suq, Iran, in 1984. He received his B.S. degree in Electronic Engineering from Bushehr Azad University, Bushehr, Iran in 2008 and M.S. degree in Electronic Engineering from Qazvin Azad University, Qazvin, Iran in 2012, where he is currently working toward a Ph.D. degree in the School of Information and Communication Engineering, Sungkyunkwan University. His research interests include the design of highperformance data converters, bandgap reference voltage, low-dropout regulators and ultra-low power CMOS RF transceiver. Copyrights © 2015 The Institute of Electronics and Information Engineers Sung-Han Do was born in Jinju, Korea, in 1988. He received his B.S. degree from the Department of Semiconductor Systems Engineering at Sungkyunkwan University, Suwon, Korea, in 2014. He is currently working toward a M.S. degree in Semiconductor Display Engineering at Sungkyunkwan University. His research interests include Analog to Digital Converter and Sensor Interfaces design. Dong-Soo Lee was born in, Korea, in 1987. He received his B.S., and M.S. degree from the Department of Electronic Engineering at Konkuk University, Seoul, Korea, in 2012, and electronic engineering at Sungkyunkwan University, Suwon, Korea, in 2014, respectively. He is currently working toward a Ph.D. degree in electronic engineering at Sungkyunkwan University. His research interests include CMOS RF IC and Phase Locked Loop, Silicon Oscillator, and Sensor Interfaces design. Kang-Yoon Lee was born in Jeongup, Korea, in 1972. He received his B.S., M.S. and Ph.D. degrees in the School of Electrical Engineering from Seoul National University, Seoul, Korea, in 1996, 1998, and 2003, respectively. From 2003 to 2005, he was with GCT Semiconductor Inc., San Jose, CA, where he was a Manager of the Analog Division and worked on the design of CMOS frequency synthesizer for CDMA/PCS/PDC and single-chip CMOS RF chip sets for W-CDMA, WLAN, and PHS. From 2005 to 2011, he was with the Department of Electronics Engineering, Konkuk University as an Associate Professor. Since 2012, he has been with School of Information and Communication Engineering, Sungkyunkwan University, where he is currently an Associate Professor. His research interests include implementation of power integrated circuits, CMOS RF transceiver, analog integrated circuits, and analog/digital mixed-mode VLSI system. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 http://dx.doi.org/10.5573/IEIESPC.2015.4.1.059 59 IEIE Transactions on Smart Processing and Computing Design of 10-bit 10MS/s Time-Interleaved Flash-SAR ADC Using Sharable MDAC Sung-Han Do, Seong-Jin Oh, Dong-Hyeon Seo, Juri Lee, and Kang-Yoon Lee Sungkyunkwan University {dsh0308, geniejazz, dhblackbox, juri, [email protected]} * Corresponding Author: Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015 * Regular Paper Abstract: This paper presents a 10-bit 10 MS/s Time-Interleaved Flash-SAR ADC with a shared Multiplying DAC. Using shared MDAC, the total capacitance in the SAR ADC decreased by 93.75%. The proposed ADC consumed 2.28mW under a 1.2V supply and achieved 9.679 bit ENOB performance. The ADC was implemented in 0.13μm CMOS technology. The chip area was 760 × 280 μm2. Keywords: Time-Interleaved, Flash-SAR ADC, Multiplying DAC 1. Introduction Among the many types of ADCs (Analog to Digital Converters), the SAR ADC is used widely because of its advantages of low power consumption and simplicity. However, an exponential increase in the capacitance with increasing resolution restricts the use of applications requiring high conversion speeds. On the other hand, Flash ADC, having the highest conversion speed in many types of ADC, has great power consumption and chip area. Therefore, flash ADC is limited under 8 bit resolution. Pipeline ADC can be a good alternative to use a fast conversion speed in a small area. However, because the op-amp based MDAC should be used at each stage, it requires a relatively large current. Therefore, there has been some research on TimeInterleaving ADC, such as : Flash-SAR ADC [1, 2], pipelined-SAR ADC [3]. This paper presents a 10-bit 10MS/s Time-Interleaved Flash-SAR ADC architecture with the fast speed of a flash ADC and the low power of a SAR ADC. In addition, a Sharable Multiplying-DAC was added to remove the unnecessary increase in capacitance. The remainder of this paper is organized as follows. Section 2 presents the architecture of the proposed TimeInterleaved Flash-SAR ADC. Section 3 presents a detailed description of the proposed building blocks. Section 4 shows the experimental results. The conclusions are reported in Section 5. 2. Time-Interleaved Flash-SAR ADC Architecture Fig. 1(a) shows a conceptual block diagram of FlashSAR ADC. In the Flash-SAR ADC architecture, the frontend flash ADC receives an analog input signal (VIN) and decides the MSBs. Using these bits, MDAC conducts a residue process to make the input of the back-end SAR ADC. Finally, the remaining LSBs are achieved in SAR ADC. Fig. 1(b) shows the operation of the ADC. Note that because MDACOUT is amplified by a factor of 4 (in this example), the resolution of back-end SAR ADC can be decreased. As shown in Fig. 1(c), the flash ADC uses only 1 clock cycle for its MSBs conversion and the SAR ADC can start its operation. Therefore, if only 1 channel SAR ADC is used, the front-end flash ADC does not work during the SAR ADC’s successive process period. Using a multi-channel SAR ADC with the timeinterleaving method, the idle time of flash ADC can be removed and at every clock cycle, a new analog input (VIN) can be processed in the flash ADC. 60 Do et al.: Design of 10-bit 10MS/s Time-Interleaved Flash-SAR ADC Using Sharable MDAC (a) (a) (b) (b) Fig. 2. (a) Block diagram of the Time-Interleaved FlashSAR ADC architecture using a thermometer CDAC, (b) Block diagram of the proposed Time-Interleaved FlashSAR ADC Block Diagram. Table 1. Total Capacitance of each ADC Type (@ 10 bit). (c) Fig. 1. (a) Conceptual Block Diagram of the Flash-SAR ADC (5-bit example), (b) Operation of the Flash-SAR ADC, (c) Timing diagram of Flash-SAR ADC. On the other hand, using a multi-channel SAR ADC makes the architecture too bulky. Fig. 2(a) shows the 10bit Time-Interleaved flash-SAR ADC combined 4-bit flash ADC with a 6 channel 10-bit SAR ADC using thermometer code. Although this architecture has the advantage of a faster conversion speed than the conventional SAR ADC, the total capacitance, dominant area of SAR ADC, increases 6 times compared to the conventional 10-bit SAR ADC. On the other hand, the proposed 10-bit TimeInterleaved Flash SAR ADC has a sharable MDAC between 4-bit Flash ADC and 6-bit SAR ADC. By implementing the sharable MDAC, a 24-1 bit thermometer CDAC in the fine SAR ADC can be removed so that the size of the ADC can be decreased dramatically. Table 1 lists the comparison results of the total capacitance of each ADC type. Compared to the conventional 10-bit flash-SAR ADC using the Thermometer CDAC [1, 2], the proposed flash-SAR ADC in this paper can reduce the total capacitance 24 times, which leads Type of ADC Total Capacitance 10 Bit SAR ADC 210 C 10 Bit Flash SAR ADC (4b flash + 6*10b SAR) (Thermometer CDAC) [1, 2] 210 C × 6 10 Bit Flash SAR ADC (4b flash + 6*6b SAR) (Proposed) 26 C × 6 directly to the reduction of the total size of the circuit. Moreover, to minimize the area and current consumption of the SAR ADC in each channel, a DualSampling SAR ADC structure was adopted and Adaptive Power Control (APC) mode was added to the comparator of the SAR ADC. Fig. 3 shows the timing diagram of the proposed 10-bit 10MS/s Time-Interleaved Flash-SAR ADC. At a sample rate of 10MS/s, the duties of the Clock and Start of Conversion (SOC) signal were 50% and 25%, respectively, and the sampling frequency was 10MHz. When the SOC signal was high, the Sample & Hold circuit in the ADC samples the analog input signal VIN. On the other hand, when the SOC signal is low, the Sample & Hold circuit maintains the sampled signal VIN, and the Flash ADC and MDAC process the sampled signal VIN. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 61 Fig. 3. Timing diagram of the proposed 10 bit 10MS/s Time Interleaved Flash-SAR ADC. The front-end flash ADC decides 4-bit MSBs then these bits control the MDAC for the residue process. After the residue process, the back-end 6-bit SAR ADC can use the output voltage of the MDAC as the input voltage. Note that when the SAR ADC starts its MSB decision process using the previous one, the flash ADC samples the new input voltage simultaneously. In this architecture, a 7 clock cycle latency exists. Therefore, the output code of flash ADC should maintain its value 6 clock cycles and merge the back-end SAR ADC output code. Therefore, flip-flops are designed as an output merge block. These flip-flops use the EOC (end of conversion) signal of each channel for triggering. Fig. 4. Block Diagram of 6-bit SAR ADC. 3. Building Blocks To reduce the chip area, the dual-sampling SAR ADC structure was implemented [4]. Fig. 4 shows a block diagram of the implemented SAR ADC. In general, a 6-bit SAR ADC requires 26 C for its binary searching algorithm. On the other hand, in this architecture, using a single sampling capacitor on the opposite side of CDAC, the total capacitance of SAR ADC is reduced to almost a half. Therefore, the area of 6-channel SAR ADC can be decreased. In ADC, a comparator is one of the most important blocks because it greatly affects the accuracy and power consumption of ADC. Fig. 5 presents a schematic diagram of the comparator [5]. This is a combined structure of a preamplifier stage and dynamic latch stage. The use of a pre-amplifier at the input stage can reduce the effect of kickback noise. In addition, using the Adaptive Power Control (APC) function, the unnecessary power consumption can be reduced. Fig. 5. Schematic of the Comparator in SAR ADC. In this paper, a sharable MDAC was implemented to reduce the total capacitance in SAR ADC. 62 Do et al.: Design of 10-bit 10MS/s Time-Interleaved Flash-SAR ADC Using Sharable MDAC (a) (a) (b) (b) Fig. 8. (a) INL/DNL Results, (b) SNDR / ENOB Results. Fig. 6. (a) Residue Process in the MDAC, (b) Schematic diagram of the MDAC. Table 2. Performance Summary. Process Fig. 7. Layout of the proposed architecture. Fig. 6(a) describes the residue process in MDAC. After flash ADC decides the 4-bit MSB codes, these codes can be used in MDAC. Using these MSBs, the coefficient k is decided from 0 to 15/16, and calculated with an analog input VIN. Finally, by multiplying residue value by 24, the output of the MDAC is generated as follows: MDACOUT = ٛ 4 ( IN − ⋅ ) (1) After the MDACOUT is made, it is delivered to the input of the 6 channel SAR ADC and used to determine the remaining LSB 6-bits. 0.13 um Supply 1.2 V Resolution 10 bits Sampling rate 10 MS/s Power 2.28 mW DNL -1.000 ~ 0.998 LSB INL -0.955 ~ 0.937 LSB SNDR 60.025 dB ENOB 9.679 bits Core area 0.21 mm2 spectrum at a 10-MS/s sampling rate. The result of SNDR is 60.025 dB. The resulting ENOB is 9.679 bits. At a 1.2 V supply voltage, the proposed ADC consumes 2.28 mW. The figure-of-merit of the ADC was calculated using the following equation. FOM = Power 2 ENOB × fs (2) The FOM of the proposed Time-Interleaved Flash-SAR ADC is a 278.14fJ/conversion-step. 5. Conclusion 4. Experimental Results The proposed ADC designed with a 0.13 μm CMOS occupies 760 × 280 ㎛2 (Fig. 7). The size of the flash ADC, MDAC and SAR ADC is 170 × 280 ㎛2, 180 × 280 ㎛2 and 370 × 280 ㎛2, respectively. Fig. 8(a) shows the simulation results of the differential nonlinearity (DNL) and integral nonlinearity (INL). The peak DNL and INL were -1.000/0.998 LSB and 0.955/0.937 LSB, respectively. Fig. 8(b) shows the FFT A Time-Interleaved Flash SAR ADC using a sharable MDAC was presented. The proposed ADC architecture can lead to a faster conversion speed than the conventional SAR ADC and a smaller total capacitance than the other Time-Interleaving ADC architecture. The proposed ADC achieves a 10MS/s operation speed with a power consumption of 2.28mW. In addition. it has a FOM of 278.14 fJ/conversion-step and occupies an active area of only 0.21 mm2. IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015 Acknowledgement This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2013R1A1A2010114). This work was supported by IDEC (IPC, EDA Tool, MPW). References [1] Ba Ro Saim Sung, Sang-Hyun Cho, Chang-Kyo Lee, Jong-In Kim, and Seung-Tak Ryu “A Time-Interleaved Flash-SAR Architecture for High Speed A/D Conversion.” ISCAS 2009, Article (CrossRef Link) [2] Ba Ro Saim Sung, Chang-Kyo Lee, Wan Kim, Jong-In Kim, Hyeok-Ki Hong, Ghil-geun Oh, Choong-Hoon Lee, Michael Choi, Ho-Jin Park, and Seung-Tak Ryu “A 6 bit 2 GS/s Flash-Assisted Time-Interleaved (FATI) SAR ADC with Background Offset Calibration.” ASSCC 2013, Article (CrossRef Link) [3] Lu Sun, Yuxiao Lu and Tingting Mo “A 300MHz 10b Time-Interleaved Pipelined-SAR ADC.”, CARFIC 2013, Article (CrossRef Link) [4] Binhee Kim, Long Yan, Jerald Yoo, Namjun Cho, and Hoi-Jun Yoo “An Energy-Efficient Dual Sampling SAR ADC with Reduced Capacitive DAC.” ISCAS 2009, Article (CrossRef Link) [5] Song Lan, Chao Yuan, Yvonne Y.H.Lam and Liter Siek “An Ultra Low-Power Rail-to-Rail Comparator for ADC Designs.”, MWSCAS 2011, Article (CrossRef Link) Sung-Han Do was born in Jinju, Korea, in 1988. He received his B.S. degree from the Department of Semiconductor Systems Engineering at Sungkyunkwan University, Suwon, Korea, in 2014. He is currently working toward a M.S. degree in Semiconductor Display Engineering at Sungkyunkwan University. His research interests include Analog to Digital Converters and Sensor Interfaces designs. Seong-Jin Oh was born in Seoul, Korea. He received his B.S. degree from the Department of Electronic Engineering at Sungkyunkwan Univerity, Suwon, Korea, in 2014, where he is currently working toward the Combined Ph.D. and M.S degree in School of Information and Communiation Engineering. His research interests include CMOS RF transceiver, Phase Locked Loop, Push-Push Voltage Controlled Oscillator. Copyrights © 2015 The Institute of Electronics and Information Engineers 63 Dong-Hyeon Seo was born in Incheon, Korea, in 1988. He received his B.S. degree from the Department of Electronics Engineering at Incheon National University, Incheon, Korea, in 2014. He is currently working toward a M.S. degree in Electrical and Computer Engineering at Sungkyunwan University. His research interests include Analog to Digital Converter and Wireless Power Transfer. Juri Lee received her B.S. degree from the Department of Electronic Engineering at Konkuk University, Seoul, Korea, in 2013, where she is currently working toward a combined Ph.D. & M.S degree in the School of Information and Communication Engineering, Sungkyunkwan University. Her research interests include VCSEL drivers and CMOS RF transceivers. Kang-Yoon Lee was born in Jeongup, Korea, in 1972. He received his B.S., M.S. and Ph.D. degrees in the School of Electrical Engineering from Seoul National University, Seoul, Korea, in 1996, 1998 and 2003, respectively. From 2003 to 2005, he was with GCT Semiconductor Inc., San Jose, CA, where he was a Manager of the Analog Division and worked on the design of the CMOS frequency synthesizer for CDMA/PCS/PDC and single-chip CMOS RF chip sets for W-CDMA, WLAN, and PHS. From 2005 to 2011, he was with the Department of Electronics Engineering, Konkuk University as an Associate Professor. Since 2012, he has been with the School of Information and Communication Engineering, Sungkyunkwan University, where he is currently an Associate Professor. His research interests include implementation of power integrated circuits, CMOS RF transceiver, analog integrated circuits, and analog/digital mixed-mode VLSI systems.