Object Tracking using Adaptive Template Matching

Transcription

Object Tracking using Adaptive Template Matching
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.001
1
IEIE Transactions on Smart Processing and Computing
Object Tracking using Adaptive Template Matching
Wisarut Chantara, Ji-Hun Mun, Dong-Won Shin, and Yo-Sung Ho
Gwangju Institute of Science and Technology (GIST), 123 Cheomdan-gwagiro, Buk-gu, Gwangju 500-712, Korea
{wisarut, jhm, dongwonshin, hoyo}@gist.ac.kr
* Corresponding Author: Yo-Sung Ho
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: Template matching is used for many applications in image processing. One of the most
researched topics is object tracking. Normalized Cross Correlation (NCC) is the basic statistical
approach to match images. NCC is used for template matching or pattern recognition. A template
can be considered from a reference image, and an image from a scene can be considered as a source
image. The objective is to establish the correspondence between the reference and source images.
The matching gives a measure of the degree of similarity between the image and the template. A
problem with NCC is its high computational cost and occasional mismatching. To deal with this
problem, this paper presents an algorithm based on the Sum of Squared Difference (SSD) and an
adaptive template matching to enhance the quality of the template matching in object tracking. The
SSD provides low computational cost, while the adaptive template matching increases the accuracy
matching. The experimental results showed that the proposed algorithm is quite efficient for image
matching. The effectiveness of this method is demonstrated by several situations in the results
section.
Keywords: Template matching, Object tracking, Normalized cross correlation, Sum of squared difference,
Pattern recognition
1. Introduction
Nowadays, video surveillance systems are being
installed worldwide in many different sites, such as
airports, hospitals, banks, railway stations, and even at
homes. Surveillance cameras help a supervisor to oversee
many different areas in a single room and to quickly focus
on abnormal events taking place in the controlled space.
Intelligent video monitoring expresses a fairly large
research direction that is applied in different fields. On the
other hand, several question arise, such as how can the
intensity of the objects tracking with occlusion be
improved. When an occlusion occurs, some objects are
partially or totally invisible. As shown in Fig. 1, this
phenomenon makes it difficult to localize the real object
target accurately.
2. Object Tracking
Object tracking in real-time environment is a different
task in different computer vision applications. Object
Fig. 1. Shading real target by other objects.
tracking consists of an estimation of the trajectory of
moving objects in the sequence of frames generated from a
video [1].
Most tracking algorithms assume that the object motion
is smooth with no abrupt changes to simplify tracking [2].
For example, the traditional color histogram Mean Shift
(MS) algorithm only considers the object’s color statistical
information, and does not contain the object’s space
2
Chantara et al.: Object Tracking using Adaptive Template Matching
information, so when the object color is close to the
background color, or the object’s illumination is changed,
the traditional MS algorithm easily causes inaccurate
object tracking or it can be lost. Therefore, Mao et al. [3]
proposed an object tracking approach that integrated two
methods consisting of histogram-based template matching
method, and the mean shift procedure was used to estimate
the object location.
Tracking methods also are related to a traffic
surveillance systems. Choi et al. [4] proposed a vehicle
tracking scheme using template matching based on both
the scene and vehicle characteristics, including background
information, local position and size of a moving vehicle.
Alternative object tracking can be found in Refs. [1, 5, 6].
On the other hand, these basic tracking algorithms have
weak intensity when the other object occludes the real
source image. Automation of the computer object tracking
is a difficult task. Therefore, to solve these kind of
problems, this paper proposes the ‘template matching’
object tracking method. Before explaining the algorithm,
some brief concepts of ‘template matching’ are introduced.
3. Template Matching
Template matching is the process of finding the
location of a sub image, called a template, inside an image.
A number of methods for identical images can be used.
This section discusses the template matching application
for matching a small image, which is a part of a large
image. Once a number of corresponding templates are
found, their centers can be used as the corresponding
control points to determine the matching parameters.
Template matching involves comparing a given template
with windows of the same size in an image and identifying
the window that is most similar to the template.
The basic template matching algorithm consists of
calculating at each position of the image under
examination a distortion function that measures the degree
of similarity between the template and image. The
minimum distortion or maximum correlation position is
then taken to locate the template into the examined image.
Typical distortion measures are the Sum of Absolute
Differences (SAD) [7, 8] and the Sum of Squared
Differences (SSD) [9]. On the other hand, as far as
template matching is concerned, the Normalized Cross
Correlation (NCC) [10] is often adopted for similarity
measure. Essannoun et al. [11] proposed a fast frequency
algorithm to speed up the process of SAD matching. They
used an approach to approximate the SAD metric by a
cosine series, which could be expressed in correlation
terms. Furthermore, Hel-Or and Hel-Or [12] proposed a
fast template matching method based on accumulating the
distortion on the Walsh-Hadamard domain in the order of
the associated frequency using SSD.
The traditional NCC needs to compute the numerator
and denominator, which is a very time-consuming. Lewis
[7] employed the sum table scheme to reduce the
computation in the denominator. Although the sum table
scheme could reduce the computation of the denominator
in NCC, it was essential to simplify the computation
(a)
(b)
(c)
Fig. 2. (a) Template image, (b) Source image, (c)
Window sliding and matching.
(x, y)
(u, v)
(u, v)
(Static)
(Moving)
Template Image T
Scene Image I
Fig. 3. Matching procedure.
involved in the numerator of NCC. Wei and Lai [13]
proposed a fast pattern matching algorithm based on NCC
criterion by combining adaptive multilevel partition with
the winner update scheme to achieve a very efficient
search. Alternative similarity measures can be found in
Refs. [14-17].
In Fig. 2, to identify a matching area, the method needs
to compare a template image against a source image by
sliding it. By sliding the template, the process can measure
the similarity between the template image and a region in
the source image. At each location, a metric is calculated
so it represents how similar the patch is to that particular
area of the scene image. The process then finds the
maximum or the minimum location in the resulting image
depending on the measurement.
3.1 Matching Methods
Fig. 3 shows the matching procedure. Here, u and v
represent a horizontal and vertical position in the kernel,
respectively, and x and y correspond to a position of the
3
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
kernel. The procedure can calculate a result pixel using
various matching methods. Eqs. (1)-(4) show the matching
methods.
1. Sum of Squared Difference (SSD)
R ( x, y ) = ∑ (T (u, v) − I ( x + u, y + v) )
2
(1)
matched within the source image, I(x,y), of size p×q,
where (p>m and q>n). For each pixel location (x,y) in the
image, the SSD distance is calculated as follows:
R ( x, y ) = ∑ (T (u, v) − I ( x + u, y + v) )
2
(5)
u ,v
u ,v
2. Normalized Sum of Squared Difference (NSSD)
R ( x, y ) =
∑ (T (u, v) − I ( x + u, y + v) )
2
u ,v
∑ T (u, v) ⋅ ∑ I ( x + u, y + v)
2
u ,v
2
(2)
u ,v
3. Cross Correlation (CC)
R ( x, y ) = ∑ (T (u , v) ⋅ I ( x + u, y + v) )
(3)
u ,v
4.2 The best object location
4. Normalized Cross Correlation (NCC)
(T (u, v) ⋅ I ( x + u, y + v) )
∑
u ,v
R ( x, y ) =
∑ T (u, v)2 ⋅ ∑ I ( x + u, y + v)2
u ,v
The smaller the distance measured SSD at a particular
location, the more similar the local sub-image found is the
searched template. If the distance SSD is zero, the local
sub-image is identical to the template. The minimum
distance provides the location of the corresponding object
(CL) in the source image. Other small distances and
locations (SLs), which are less than the threshold value,
can also be found. These values will be proposed in a next
subsection.
(4)
u ,v
For SSD or NSSD, the process needs to find the
minimum value in the resulting image. Otherwise, if using
the CC or NCC, the process needs to find the maximum
value in the result.
This paper introduces object tracking using an adaptive
template matching technique. In this technique, SSD was
used to measure the similarity between a template image
and source image. The finest object location was selected
in the source image. Finally, the template image, which is
an adapted template image, was updated.
4. Proposed Methods
The objective of the proposed method is to reduce a
computational cost and increase accurate matching. As a
similarity measure, the SSD is the most popular and
widely used for several applications. NCC is more robust
against illumination changes than SSD; nevertheless, NCC
is more time-consuming than SSD.
The following three key steps are involved in
implementing of the proposed method.
・Detection of interesting object,
・Tracking of object from frame to frame,
・Updating of suitable template.
4.1 Template Matching
The SSD is a simple algorithm for measuring the
similarity between the template image (T) and sub-images
in the source image (I). It works by taking the squared
difference between each pixel in T and the corresponding
pixel in the sub-images used for the comparison in I. These
squared differences are summed to create a simple metric
of similarity. Assume a 2-D m×n template, T(x,y) is
This subsection proposes a method for locating the
position of an interesting object in an image (OL). The aim
of this process was to identify an appropriate coordinate
that finds in the source image.
Assumption: The previous location of object (PL) is
stored in a memory buffer.
Condition 1:
If the CL is the nearest of PL then,
Set OL = CL, and assign Location flag (FL) = 1
Condition 2:
If the SLs are the nearest of PL then,
Set OL = the nearest of SLs, and assign the FL = 0
Condition 3:
If the CL doesn’t match in Condition 1 and 2 then,
Set OL = PL, and assign FL = 0
Fig. 4 shows a flowchart of a method for finding the
best object location following the above-mentioned.
4.3 Adaptive Template Matching
In this subsection, the FL parameter from section 4.2 is
considered to determine an appropriate template image.
Location flag = 0, it means that
・The OL is the SLs or PL.
・Assign the previous template with the current template.
・Update the current template with the current object
template.
Location flag = 1, it means that
・The OL is the CL from the SSD method.
・Assign the previous template with the current template
・Update the current template with the original template.
The result image
・Green box is the PL position in the source image.
・Red box is the OL position in the source image.
The overall actions are described more detail in Fig. 5.
4
Chantara et al.: Object Tracking using Adaptive Template Matching
(a)
(b)
Fig. 6. (a) Source image, (b) Template image.
Condition 1:
Fig. 4. Algorithm to find the best object location.
(a)
Fig. 5. Algorithm to update template.
5. Experiment Results
This section shows the advantage of the proposed
method to full track an interested object. Experiments were
performed to examine the matching accuracy. A template
image of 27x67 pixels (Fig. 4(b)) was used to match in a
source image the sequences of size of 352x288 pixels, as
shown in Fig. 4(a).
Figs. 7-9 illustrate each condition that is explained in
(b)
(c)
Fig. 7. Condition 1 (a) SSD result, (b) Object location,
(c) Update the current template with the original
template.
section 4.2. An illumination of (a), (b), and (c) in each
figure shows the result of the SSD method, the position of
involved object and the updated template respectively.
To illustrate the performance of the proposed algorithm,
5
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Condition 2:
(a)
(b)
(c)
Fig. 8. Condition 2 (a) SSD result with the SLs, (b) Object location shows the PL in a green box and the current
location in a red box, (c) Update the current template with the current object template.
Condition 3:
(a)
(b)
(c)
Fig. 9. Condition 3 (a) SSD result with the SLs, (b) Object location shows the current location with the PL in a green
box, (c) Update the current template with the current object template.
Fig. 10. Object’s positions on consecutive frame sequence.
the template matching algorithm like the NSSD and NCC
were compared. The result is shown in Fig. 11. Fig. 10
shows x and y object positions curves on a consecutive
image frames. The proposed method outperforms the two
conventional methods.
In addition, other experiment samples consisting of
6
Chantara et al.: Object Tracking using Adaptive Template Matching
F.1
F.10
F.35
F.64
F.177
F.250
(a)
(b)
(c)
(d)
Fig. 11. Experiment result (a) NSSD, (b) NCC, (c) Ground truth [18], (d) Proposed method, the results are captured at
1st, 10th, 35th, 64th, 177th, and 250th frame respectively.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
(a)
7
(b)
Fig. 12. Result related to condition 3 (a) Template image, (b) Object tracking.
(a)
(b)
Fig. 13. Result related to condition 2 (a) Template image, (b) Object tracking.
(a)
(b)
Fig. 14. Result related to condition 1 (a) Template image, (b) Object tracking.
three source images and their templates with the proposed
algorithm were implemented. These samples contain
different sizes and different illuminations. Figs. 12-14
show the outcomes.
8
6. Conclusion
This paper proposed an object tracking using an
adaptive template matching algorithm. The adaptive
template matching provides the proper object location in a
source image. The SSD uses a small number of operations
for a similarity purpose. Therefore, the proposed method
reduces the computational cost and increases the accuracy.
Based on the experiment results, the proposed method is
more efficient than the conventional methods, such as
NSSD and NCC. Furthermore, a comparison of the
identification object in the source image confirmed that the
proposed method outperforms the conventional methods.
Acknowledgement
This work was supported by the National Research
Foundation of Korea (NRF) grant funded by the Korean
government (MSIP) (No. 2013-067321)
References
[1] N. Prabhakar, V. Vaithiyananthan, A.P. Sharma, A.
Singh, and P. Singhal, “Object Tracking Using Frame
Differencing and Template Matching,” Research
Journal of Applied Sciences, Engineering and
Technology, pp. 5497-5501, Dec. 2012. Article
(CrossRef Link)
[2] A.Yilmaz, O.Javed, and M.Shah, “Object Tracking: A
Survey,” ACM Computing Surveys, vol. 38(4), Article
13, pp.1–45, Dec. 2006. Article (CrossRef Link)
[3] D.Mao, Y.Y. Cao, J.H. Xu, and K. Li, “Object
tracking integrating template matching and mean
shift algorithm,” Multimedia Technology (ICMT),
2011 International Conference on, pp. 3583-3586,
Jul. 2011. Article (CrossRef Link)
[4] J.H. Choi, K.H. Lee, K.C. Cha, J.S. Kwon, D.W. Kim,
and H.K. Song, “Vehicle Tracking using Template
Matching based on Feature Points,” Information
Reuse and Integration, 2006 IEEE International
Conference on, pp. 573-577, Sep. 2006. Article
(CrossRef Link)
[5] S. Sahani, G. Adhikari, and B. Das, “A fast template
matching algorithm for aerial object tracking,” Image
Information Processing (ICIIP), 2011 International
Conference on, pp. 1–6, Nov. 2011. Article
(CrossRef Link)
[6] H.T. Nguyen, M. Worring, and R.V.D. Boomgaad,
“Occlusion robust adaptive template tracking,”
Proceedings. Eighth IEEE International conference
on Computer Vision (ICCV), pp. 678-683, 2001.
Article (CrossRef Link)
[7] J. P. Lewis, “Fast template matching,” Vis. Inf., pp.
120-123, 1995. Article (CrossRef Link)
[8] F. Alsaade and Y.M. Fouda, “Template Matching
based on SAD and Pyramid,” International Journal
of Computer Science and Information Security
(IJCSIS), vol. 10 no.4, pp. 11-16, Apr. 2012. Article
(CrossRef Link)
Chantara et al.: Object Tracking using Adaptive Template Matching
[9] J. Shi and C.Tomisto, “Good feature to track,”
Proceedings of IEEE Computer Society Conference
on Computer Vision Pattern Recognition, pp. 593600, Jun. 1994. Article (CrossRef Link)
[10] W. K. Pratt, “Correlation techniques of image
registration,” IEEE Trans. On Aerospace and
Electronic Systems, vol. AES-10, pp. 353-358, May
1974. Article (CrossRef Link)
[11] F. Essannouni, R. Oulad Haj Thami, D. Aboutajdine,
and A. Salam, “Adjustable SAD matching algorithm
using frequency domain” Journal of Real-Time
Image Processing, vol. 1, no. 4, pp. 257-265, Jul.
2007. Article (CrossRef Link)
[12] Y. Hel-Or and H. Hel-Or, “Real-time pattern
matching using projection kernels,” IEEE Trans.
PAMI, vol. 27, no. 9, pp. 1430-1445, Sep. 2002.
Article (CrossRef Link)
[13] S. Wei and S. Lai, “Fast template matching based on
normalized cross correlation with adaptive multilevel
winner update” IEEE Trans. Image processing, vol.
17, No. 11, pp. 2227-2235, Nov. 2008. Article
(CrossRef Link)
[14] N.P. Papanikolopoulos, “Selection of Features and
Evaluation of Visual Measurements During Robotic
Visual Servoing Tasks,” Journal of Intelligent and
Robotic System, vol.13 (3), pp. 279-304, Jul. 1995.
Article (CrossRef Link)
[15] P. Anandan, “A computational framework and an
algorithm for the measurement of visual motion,”
International Journal of Computer Vision, vol. 2 (3),
pp. 283-310, Jan. 1989. Article (CrossRef Link)
[16] A. Singh and P. Allen, “Image flow computation: an
estimation-theoretic framework and a unified
perspective,” Computer Vision Graphics and Image
Processing: Image Understanding, vol. 56 (2), pp.
152-177, Sep. 1992. Article (CrossRef Link)
[17] G. Hager and P. Belhumeur, “Real-time tracking of
image regions with changes in geometry and
illumination,” Proceedings of IEEE Computer
Society Conference on Computer Vision Pattern
Recognition, pp. 403-410, Jun. 1996. Article
(CrossRef Link)
[18] Y. Wu, J. Lim, and M.H. Yang, “Online Object
Tracking: A Benchmark,” Computer Vision and
Pattern Recognition (CVPR), IEEE Conference on,
pp. 2411-2418, Jun. 2013. Article (CrossRef Link)
Wisarut Chantara received the B.S.
and M.S. degrees in Computer Engineering from Prince of Songkla
University (PSU), HatYai, Thailand in
2004 and 2008, respectively. From
2008 to 2013, he worked as a lecturer
at Department of Computer Engineering, PSU, Phuket Campus, Phuket,
Thailand. He is currently a Ph.D. student in the School of
Information and Communications at Gwangju Institute of
Science and Technology (GIST), Korea. His research
interests are object tracking, computer vision, light field
camera, digital signal, and image processing.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Ji-Hun Mun received his B.S. degree
in Electric and Electrical Engineering
from Chonbuk National University,
Korea, in 2013. He is currently a M.S.
student in the School of Information
and Communications at Gwangju
Institute of Science and Technology
(GIST), Korea. His research interests
include digital signal and image processing.
Dong-Won Shin received his B.S.
degree in Computer Engineering from
Kumoh National Institute of Technology, Korea, in 2013. He is currently
a M.S. student in the School of
Information and Communications at
Gwangju Institute of Science and
Technology (GIST), Korea. His
research interests include 3D image processing and depth
map generation.
Yo-Sung Ho received the B.S. and
M.S. degrees in Electronic Engineering
from Seoul National University, Seoul,
Korea, in 1981 and 1983, respectively,
and Ph.D. degree in electrical and
computer engineering from the
University of California, Santa Barbara,
in 1990. He joined the Electronics and
Telecommunications Research Institute (ETRI), Daejeon,
Korea, in 1983. From 1990 to 1993, he was with Philips
Laboratories, Briarcliff Manor, NY, where he was
involved in development of the advanced digital highdefinition television system. In 1993, he rejoined the
Technical Staff of ETRI and was involved in development
of the Korea direct broadcast satellite digital television and
high-definition television systems. Since 1995, he has been
with the Gwangju Institute of Science and Technology,
Gwangju, Korea, where he is currently a Professor in the
Department of Information and Communications. His
research interests include digital image and video coding,
image analysis and image restoration; advanced coding
techniques, digital video and audio broadcasting, 3-D
television, and realistic broadcasting.
Copyrights © 2015 The Institute of Electronics and Information Engineers
9
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.010
10
IEIE Transactions on Smart Processing and Computing
Accelerating Self-Similarity-Based Image SuperResolution Using OpenCL
Jae-Hee Jun1, Ji-Hoon Choi1, Dae-Yeol Lee2, Seyoon Jeong2, Suk-Hee Cho2, Hui-Yong Kim2, and Jong-Ok Kim1
1
2
School of Electrical Engineering, Korea University / Seoul, South Korea {indra, risty, jokim}@korea.ac.kr
Electronics and Telecommunications Research Institute / Dae-jeon, South Korea {daelee711, jsy, shee, hykim5}@etri.re.kr
* Corresponding Author: Jong-Ok Kim
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: This paper proposes the parallel implementation of a self-similarity based image SR
(super-resolution) algorithm using OpenCL. The SR algorithm requires tremendous computations
to search for a similar patch. This becomes a bottleneck for the real-time conversion from a FHD
image to UHD. Therefore, it is imperative to accelerate the processing speed of SR algorithms. For
parallelization, the SR process is divided into several kernels, and memory optimization is
performed. In addition, two GPUs are used for further acceleration. The experimental results shows
that a GPGPU implementation can speed up over 140 times compared to a single-core CPU.
Furthermore, it was confirmed experimentally that utilizing two GPUs can speed up the execution
time proportionally, up to 277 times.
Keywords: Super-resolution, OpenCL, Parallelization, Multiple GPUs
1. Introduction
As the complexity of image processing algorithms
increases, their running time on the CPU has been
multiplied accordingly. This has led to an increase in the
interest in GPGPUs (general-purpose computing on
graphics processing units). GPGPU is a technique to utilize
a GPU device for general computational applications,
which is typically handled by a CPU.
Although a GPU has a much lower clock frequency
than a CPU, it has a very large number of cores, and can
accelerate complex computations significantly by
parallelization. Researchers have recently examined the
acceleration of image processing algorithms using GPGPU
techniques [8, 9, 11, 12]. The GPGPU technology is
categorized into two different techniques. One is CUDA,
which was made public in 2007 by Nvidia, and the other is
OpenCL, which was released by the Khronos group in
2008 [7]. Some studies have compared CUDA with
OpenCL, such as [10]. OpenCL is an open, generalpurpose parallel framework that is renowned for its good
portability.
This paper proposes the parallel implementation of the
self-similarity based image SR (super resolution)
algorithm using OpenCL. The SR algorithm, which was
proposed previously exploits the inherent self-similarity of
an image [1, 2], but it requires tremendous computations to
search for a similar patch. In addition, the SR technique is
applied primarily to the resolution conversion from FHD
(Full HD) to UHD (Ultra HD) which has just begun to be
commercialized popularly. The amount of the UHD image
data is 4 times that of FHD. Therefore, common software
implementation makes it difficult to work in a real-time
manner. Accordingly, accelerating the processing speed of
SR algorithms is indispensable.
For parallelization, the SR process is divided into
several kernels, and memory optimization is performed. In
addition, two GPUs are utilized for further acceleration.
The experimental results show that GPGPU
implementation can speed up the SR process significantly
compared to a single-core CPU.
The remainder of this paper is organized as follows.
Section 2 introduces the super resolution algorithm that
was implemented with OpenCL. Section 3 briefly explains
the OpenCL implementation in detail, particularly focusing
on memory optimization and support to the multiple GPU
devices. Section 4 presents the experimental results.
Section 5 concludes the paper.
11
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Fig. 1. Tasks of the self-similarity based SR algorithm.
2. Super-Resolution Algorithm
HFLR = ILR - LFLR
The goal of SR is to reconstruct a HR (high resolution)
image from one or more observed LR (low resolution)
images. Recently, a number of SR methods have been
proposed to exploit the property of self-similarity in an
image.
In natural images, small patches tend to recur
redundantly across the scale as well as in-scale. This
observation is referred to as self-similarity, and has been
exploited popularly for image processing, such as denoising, edge detection and retargeting [3]. In addition, it
has been applied to learning-based image SR in recent
years [6].
SR methods based on self-similarity can be classified
into two categories, depending on the domain to which the
self-similarity assumption is applied; spatial domain and
LF (low frequency)-HF (high frequency) domain.
Self-similarity with the LF-HF domain is almost equal
to the learning based method, such as [4]. Unlike the
traditional example-based method, however, self-similarity
based SR on the LF-HF domain does not require any prior
databases. Therefore, this method can reduce the
computational complexity and memory consumption,
compared to the conventional learning based approach.
The self-similarity-based SR method consists of three
parts overall, as follows: the generation of image pyramids
for both LF and HF components, HF reconstruction-based
on self-similarity, and back projection after combining the
LF and HF components. First of all, an input LR image is
decomposed into LF and HF components, which are given
by
LFLR = H(ILR)
(1)
(2)
where H is a blurring operator, LFLR is the blurred image
of an input image. After decomposition, an HR image is
initially obtained by simply interpolating the input image.
This HR image can be considered the LF component of the
target HR image, which is denoted by LFHR. Scaling is
increased gradually to a target scale by a factor of 1.25. An
incremental coarse-grained method using a small scale
factor was reported to produce a more elaborate result [5].
For the reconstruction of HFHR, for every patch pi in
the image LFHR, the most similar patch pj is searched on
the LFLR domain. The corresponding HFLR patch is then
copied to query the patch. By merging the LFHR and HFHR
image, an HR image, IHR, can be reproduced as follows.
HFHR (pi) = HFLR (pj)
IHR = LFHR + HFHR
(3)
(4)
In this way, hierarchical HF reconstructions in the
image pyramid are repeated until the target HR is achieved.
Fig. 1 presents the overall processes of the SR algorithm.
3. Implementation Using OpenCL
This section present a parallel implementation of the
SR algorithm. First, the SR process is divided into a couple
of kernels, and the problem of the race condition, which
actually occurs in searching similar patches, is then
addressed. Next, memory optimization is performed based
on the memory model. Finally, we deal with how to handle
multiple GPU devices for further acceleration.
A kernel is a unit of code that represents a single
12
Jun et al.: Accelerating Self-Similarity-Based Image Super-Resolution Using OpenCL
Table 1. Kernel design and computational time.
Procedure
Kernel name
LF/HF subtract
Conv & Sub
6,237
Interpolation
imgScaler
3,611
Find similar patch
SSSR & Setzero
353,347
Patch mapping
Add & Div
2,897
LF/HF merge
Add
1,179
Back Projection
Conv & Add & Sub
4,834
Time(㎲)
3.1 Race Condition
When the ‘Find similar patch’ kernel is running, two or
more work-items may reference the address value of a
single pixel at the same time. At this time, the race
condition occurs, which indicates the situation where
multiple work-items simultaneously accesses a certain
pixel to be declared in the global memory. To overcome
this conflict, a serialization method is adopted by the atom
command provided by OpenCL. Note that the use of the
atomic command ensures the correctness of local memory
by serialization.
Fig. 2 illustrates the occurrence of the race condition
and its solution.
3.2 Memory Optimization
Fig. 2. Race condition and its solution.
executing instance in a GPU. This single kernel instance
corresponds to a work item in OpenCL. When the workitem and work-group are designed, both GPU specification
and image resolution should be considered simultaneously.
32x32 work-items are designated as a work-group. The
kernels for the SR algorithm are designed, as listed in
Table 1 according to tasks and synchronization points. For
further optimization of the OpenCL implementation, the
individual kernel running time is measured using
performance profiler, as listed in Table 1.
The kernel used to find the best matching patch is the
most time-consuming execution in the SR process. Hence,
special optimization of this kernel is essential for better
speed acceleration. This will be discussed in Section 3.2.
Next, the memory optimization step is described.
OpenCL defines a hierarchical memory model, as shown
in Fig. 3. Local memory is manufactured by on chip, and it
is much faster than global memory. In addition, OpenCL
provides image memory that is specialized for accessing
image data. Image memory requires fewer GPU cycles
when data is read, so its access speed is much faster than
global memory. In this paper, speed acceleration is
optimized based on two memory model, which are image
buffer and local memory.
At the ‘Find similar patch’ kernel, similar patches are
searched in the surrounding local area of a target patch,
and all pixel data in the search area are loaded into the
local memory at one time. This can accelerate the memory
access speed better than global memory.
The local memory is divided into two memory banks.
To achieve high memory bandwidth (or high speed), all
work-items should be split evenly into each bank to reduce
the number of work-items per bank as much as possible. If
many work-items are assigned to a certain bank unevenly,
it takes more time due to serialization. This phenomenon is
the so called bank conflict, and it should be avoided when
local memory is used.
Fig. 3. OpenCL memory model.
13
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
3.3 Support to Multiple GPUs
To utilize multiple GPU devices, a command queue
should be prepared for each GPU individually. On the
other hand, context can be shared among multiple
command queues or each queue can have its own context.
The former is a single context-based implementation. This
is a typical way to work with multiple devices. When
multiple devices use the same context, most of the
OpenCL objects are shared (e.g., kernel, program, and
memory objects). As mentioned above, one command
queue should be generated per GPU; hence, multiple
command queues are needed within the same context. This
approach is suitable for the algorithm that uses less
memory.
Another implementation approach is to define the
different contexts per device separately. This is commonly
used in computing on a heterogeneous device system (e.g.,
CPU and GPU). Data is handled independently among
multiple contexts, and system crashes can be avoided.
Fig. 4 illustrates the actual implementation of the
above-mentioned two approaches in the case of two GPUs.
Suppose that video frames are super-resolved on two
GPUs. As shown in Fig. 4(a), the image data is split into
two devices at a frame level. In other words, the odd
frames are processed into one GPU, and the even frames
are processed into the other one. When a single context is
adopted, a video frame is split equally into two segments,
and each segment is processed independently of each GPU,
as shown in Fig. 4(b).
(a) Multiple context
4. Experimental Results
The experiments are performed on an Nvidia GeForce
GTX-780 device. The host machine is an Intel® Core™
i5-4590 CPU @ 3.30Ghz, 64 bits compatible. The
OpenCL programs were developed using Visual Studio
2012 and OpenCL version 1.1. The test application was to
super-resolve from a 1920 x 1080 FHD image to a
3840x2160 UHD using the self-similarity based SR
method in Section 2. The performance was measured by
the total execution time of all kernel functions in a GPU.
The execution time was measured by Nsight, a profiling
tool provided by Nvidia.
The parallel implementation in GPU speeds up the
running time, as confirmed in Table 2. In particular, if the
local memory based optimization is performed, the
algorithm running speed can be accelerated further by
approximately 2 times compared to the non-memoryoptimization. When each GPU uses its own context
separately, the execution speed can be made approximately
277 times faster than that of a single core CPU.
In a single context approach, an image is partitioned
equally into two segments, and half of the image is
processed into each GPU. When there are super-resolving
pixels near the border line between two image segments,
the pixel data of the other image segment is needed to find
a similar patch. Therefore, for a single context approach,
more than half an image is actually transferred to
individual GPUs. This leads to more transfer time between
(b) Single context
Fig. 4. Two ways to utilize multiple devices in OpenCL.
Table 2. Running time comparisons.
# of device
Implementation method
Time(s)
CPU (1 core)
25.209
1
2
GPU
0.388
GPU (image buffer)
0.273
GPU (local memory)
0.169
2 GPU (2 context)
0.091
2 GPU (1 context)
0.098
the CPU and GPU.
5. Conclusion
This paper proposed a parallel implementation of a
complex image SR algorithm using OpenCL for speed
acceleration. Global memory synchronization was solved
14
Jun et al.: Accelerating Self-Similarity-Based Image Super-Resolution Using OpenCL
using the race condition. In the SR algorithm, finding a
similar patch was very time-consuming; its speed could be
accelerated using the local memory. In addition, two GPUs
were utilized for further acceleration. The experimental
results showed that GPGPU implementation can speed up
the process by more than 140 times compared to that using
a single-core CPU. Furthermore, it was confirmed
experimentally that the use of two GPUs can accelerate the
execution time proportionally, by up to 277 times.
Acknowledgement
This work was supported by the ICT R&D program of
MSIP/IITP. [13-912-02-002, Development of Cloud
Computing Based Realistic Media Production technology]
Conference on Parallel Processing, ICPP’11, Sep.
2011. Article (CrossRef Link)
[11] G. Wang, Y. Xiong, J. Yun, and J. R. Cavallaro,
“Accelerating computer vision algorithms using
OpenCL framework on the mobile GPU – a case
study”, in IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP),
May. 2013. Article (CrossRef Link)
[12] J.H. Jun, J.H. Choi, D.Y. Lee, S.Y. Jeong, J.S. Choi,
J.O. Kim, “Implementation Acceleration of SelfSimilarity Based Image Super Resolution Using
OpenCL” , in International Workshop on Advanced
Image Technology (IWAIT) & International Forum
on Medical Imaging in Asia (IFMIA), Jan. 2015.
Jae-Hee Jun received his B.S. degree
in electronic engineering from Korea
University, Seoul, Korea, in 2013, He
is currently pursuing the M.S. degree
in electronic engineering at Korea
University, Seoul, Korea. His research
interests are image signal processing
and multimedia communications.
References
[1] S.J. Park, O.Y. Lee, and J.O. Kim, “Self-similarity
based image super-resolution on frequency domain,”
Proc. of APSIPA ASC 2013, pp. 1-4, Nov. 2013.
Article (CrossRef Link)
[2] J.H. Choi, S.J. Park, D.Y. Lee, S.C. Lim, J.S. Choi,
and J.O. Kim, “Adaptive self-similarity based image
super-resolution using non local means,” Proc. Of
APSIPA ASC 2014, Dec. 2014. Article (CrossRef
Link)
[3] A. Buades, B. Coll and J.M Morel, “A non-local
algorithm for image denoising”, IEEE Int. Conf. on
Computer Vision and Pattern Recognition, vol. 2, pp.
60 – 65, June. 2005. Article (CrossRef Link)
[4] W.T. Freedman, T.R Jones, and E.C. Pasztor
“Example-based super-resolution,” IEEE Computer
Graphics and Applications, vol. 22, pp. 56-65, Mar.
2002. Article (CrossRef Link)
[5] D. Glasner, S. Bagon, and M.Irani, “Super-resolution
from a single image,” 12th IEEE Int. Conf. on
Computer Vision, pp. 349 – 356, Sep. 2009. Article
(CrossRef Link)
[6] G. Freeman, and R. Fattal, “Image and video
upscaling from local self-examples,” ACM Trans. on
Graphics, vol. 30, Article No. 12, Apr. 2011. Article
(CrossRef Link)
[7] Khronos OpenCL Working Group. The OpenCL
Specification Version 1.2. Khronos Group, 2012.
http://www.khronos.org/opencl
[8] J. Zhang, J. F. Nezan, and J.-G. Cousin,
“Implementation of motion estimation based on
heterogeneous parallel computing system with
OpenCL,” in Proc. IEEE Int. Conf. High
Performance Computing and Commun., 2012, pp.
41–45. Article (CrossRef Link)
[9] N.-M. Cheung, X. Fan, O. C. Au, and M.-C. Kung,
“Video coding on multicore graphics processors,”
IEEE Signal Processing Magazine, vol. 27, no. 2, pp.
79–89, 2010. Article (CrossRef Link)
[10] J. Fang, A. L. Varbanescu, and H. Sips, “A
Comprehensive Performance Comparison of CUDA
and OpenCL,” in Proceedings of the International
Ji-Hoon Choi received his B. S.
degree in electronic engineering from
Korea University, Seoul, Korea, in
2011, He is currently pursuing a Ph.D.
degree in electronic engineering at
Korea University, Seoul, Korea. His
research interests are image signal
processing and multimedia communications.
Dae-Yeol Lee received his BS and
MEng degrees in electrical and
computer engineering from Cornell
University, United States, in 2012 and
2013 respectively. In August 2013, he
joined Broadcasting and Telecommunications Media Research Department,
ETRI, Korea, where he is currently a
researcher. His research interests include image/video
processing, video coding and computer vision.
Seyoon Jeong received his BS and MS
degrees in electronics engineering
from Inha University, Korea, in 1995
and 1997 respectively, and received
hsi PhD degree in electrical
engineering from Korea Advanced
Institute of Science and Technology
(KAIST) in 2014. He joined ETRI,
Daejeon, Rep. of Korea in 1996 as a senior member of the
research staff and is now a principal researcher. His
current research interests include video coding, video
transmission and UHDTV.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Sukhee Cho received her BS and MS
in computer science from Pukyong
National University, Busan, Rep. of
Korea, in 1993 and 1995, and her PhD
in electrical and computer engineering
from Yokohama National University,
Yokohama, Kanagawa, Japan, in 1999.
Since 1999, she has been with the
Broadcasting and Telecommunications Media Research
Department at ETRI, Daejeon, Rep. of Korea. Her current
research interests include realistic media processing
technologies, such as video coding and the systems of
3DTV and UHDTV.
Hui Yong Kim received his BS, MS,
and Ph.D degrees from Korea
Advanced Institute of Science and
Technology (KAIST), Korea, in 1994,
1998, and 2004, respectively. From
2003 to 2005, he was the leader of
Multimedia Research Team of AddPac
Technology Co. Ltd. In November
2005, he joined the Broadcasting and Telecommunications
Media Research Laboratory of Electronics and
Telecommunications Research Institute (ETRI), Korea,
and currently serves as the director of Visual Media
Research Section. From 2006 to 2010, he was also an
affiliate professor in University of Science and Technology
(UST), Korea. From September 2013 to August 2014, he
was a visiting scholar at Media Communications Lab of
University of Southern California (USC), USA. He has
made many contributions to the development of
international standards, such as MPEG Multimedia
Application Format (MAF) and JCT-VC High Efficiency
Video Coding (HEVC), as an active technology
contributor, editor, and Ad-hoc Group co-chair in many
areas. His current research interest include image and
video signal processing and compression for realistic
media applications, such as UHDTV and 3DTV.
Copyrights © 2015 The Institute of Electronics and Information Engineers
15
Jong-Ok Kim (S’05-M’06) received
his B.S. and M.S. degrees in electronic
engineering from Korea University,
Seoul, Korea, in 1994 and 2000,
respectively, and Ph.D. degree in
information networking from Osaka
University, Osaka, Japan, in 2006.
From 1995 to 1998, he served as an
officer in the Korean Air Force. From 2000 to 2003, he
was with SK Telecom R&D Center and Mcubeworks Inc.
in Korea, where he was involved in research and
development on mobile multimedia systems. From 2006 to
2009, he was a researcher in ATR (Advanced
Telecommunication Research Institute International),
Kyoto, Japan. He joined Korea University, Seoul, Korea in
2009, and is currently an associate professor. His current
research interests are in the areas of image processing and
visual communications. Dr. Kim was a recipient of a
Japanese Government Scholarship during 2003-2006.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.016
16
IEIE Transactions on Smart Processing and Computing
Whole Frame Error Concealment with an Adaptive
PU-based Motion Vector Extrapolation for HEVC
Seounghwi Kim, Dongkyu Lee, and Seoung-Jun Oh
Department of Electronic Engineering, Kwangwoon University / Seoul, Korea {k2090535, dongkyu, sjoh}@media.kw.ac.kr
* Corresponding Author: Seounghwi Kim
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: Most video services are transmitted in wireless networks. In a network environment, a
packet of video is likely to be lost during transmission. For this reason, numerous error
concealment (EC) algorithms have been proposed to combat channel errors. On the other hand,
most existing algorithms cannot conceal the whole missing frame effectively. To resolve this
problem, this paper proposes a new Adaptive Prediction Unit-based Motion Vector Extrapolation
(APMVE) algorithm to restore the entire missing frame encoded by High Efficiency Video Coding
(HEVC). In each missing HEVC frame, it uses the prediction unit (PU) information of the previous
frame to adaptively decide the size of a basic unit for error concealment and to provide a more
accurate estimation for the motion vector in that basic unit than can be achieved by any other
conventional method. The simulation results showed that it is highly effective and significantly
outperforms other existing frame recovery methods in terms of both objective and subjective
quality.
Keywords: Wireless networks, Error concealment, Motion vector extrapolation, HEVC, Prediction unit
1. Introduction
Many users are provided with video services in
wireless networks. Although a size of video data is
increasing gradually, high compression is essential to
transfer over communication channels. Because highly
compressed video data is more sensitive to channel errors
than non-compressed video data, errors can occur in
wireless transmission, which will degrade the entire video
quality significantly. To reduce this effect, many error
concealment algorithms have been proposed in various
ways.
The HEVC standard is the newest joint video project of
the ITU-T Video Coding Expert Group (VCEG) and the
ISO/IEC Moving Picture Experts Group (MPEG)
standardization organization, working with in a partnership
known as the Joint Collaborative Team on Video Coding
(JCT-VC) [1]. The main goal of HEVC is a bit-rate
reduction of 50% with similar video quality to H.264/AVC.
Such considerably higher compression efficiency of the
HEVC will potentially lead to packet loss, which will have
a large impact on the users of portable devices, such as
tablets and smartphones. In HEVC, each slice is encoded
in a single network abstraction layer (NAL) unit. Each
NAL unit of the HEVC stream is encapsulated in a single
real-time transport protocol (RTP) packet using the RTP
payload format for HEVC organized by Schierl et al. [2].
The multiple packet loss due to channel errors is more
likely to occur than single packet loss in the wireless
networks. In this situation, multiple packet loss results in a
loss of an entire frame. Some conventional block-based
concealment algorithms are not satisfied with whole frame
concealment [3-5]. Therefore, it is essential to conceal the
entire frame error.
Error concealment (EC) is an essential process to
recover channel errors and several EC algorithms have
been proposed. Some algorithms proposed a spatial EC
using interpolation [3, 6-10], others exploited a temporal
error concealment using MVE [11-15]. Spatial EC
algorithms are inappropriate for concealing whole frame if
the channel errors are spread to the entire frame. Although
conventional MVE methods are suitable to conceal the
entire frame, these methods have a drawback in their
computational complexity or subjective quality.
Unlike existing video coding standards, such as
H.264/AVC, one of the characteristics of HEVC is the
17
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
more flexible block partitioning. The block structure of
HEVC supports coding tree units (CTUs), a CTU is split
into a quad-tree structure with four square coding units
(CUs). Within the CU level, a prediction unit (PU) and a
transform unit (TU) are composed to predict and transform,
respectively. PU decides whether a corresponding CU is
coding in intra or inter prediction mode. In the interprediction, a motion estimation and compensation are
performed for a PU, meaning that the same object in a
reference frame is likely to exist within the current PU.
This paper proposes an Adaptive PU-based Motion
Vector Extrapolation (APMVE) method to increase the EC
performance. The proposed approach consists of two
stages: 1) Error Concealment Basic Unit (ECBU) decision,
and 2) ECBU-based motion vector extrapolation for error
concealment. In the first stage, an MV of every PU in the
correctly received previous frame is extrapolated to a lost
frame. To decide the ECBU size of the lost CTU, the
relationship between the extrapolated PUs and the lost
CTU is then considered. After the ECBU decision stage,
the APMVE method is repeated to estimate the best MVs
of the ECBU blocks in the lost CTU. Finally, an MV of a
non-overlapped blocks (NOBs) is duplicated from an MV
of a co-located PU in the corrected received frame.
The remainder of this paper is organized as follows.
Section 2 briefly introduces the related works. Section 3
describes the proposed APMVE algorithm. Finally, the
simulation results and conclusion are reported in sections 4
and 5, respectively.
A
B
C
Fig. 1. Hybrid MV extrapolation (HMVE).
because this can to some extent indicate the object
information. To obtain an MV in a missing frame, Lin’s
algorithm looks for both a larger overlapped area and a
larger partition size to allocate higher priority to the MVs
associated with a larger object.
3. The Proposed Algorithm
2. Related Work
Some EC algorithms have been proposed to recover the
entire frame loss during transmission. Chen et al. proposed
a Pixel-based MVE (PMVE) method to recover a whole
missing frame [15]. For each pixel in the lost frame,
Chen’s algorithm extrapolates an MV from the MVs of the
previous reconstructed frame and the next frame. First, the
pixels in the missing frame can be categorized into two
parts: 1) For a pixel that is overlapped by at least an
extrapolated macroblock (MB), its MV is estimated by
averaging the MVs of all overlapped MBs. 2) For a pixel
that is not covered by any of the extrapolated MBs, its MV
is duplicated from the MV of the same pixel in the
previous frame. On the other hand, the drawback of this
algorithm is that the computational complexity is high
because of the pixel-based MV determination.
Yan et al. proposed a Hybrid Motion Vector
Extrapolation (HMVE) method based on PMVE, which
uses the extrapolated MVs of the pixels and blocks [11].
This method first classifies the pixels in the missing frame
into three parts: A, B, and C, as shown in Fig. 1. After
reliable MV candidates in each pixel are rejected by a
threshold, a final MV is then determined.
The abovementioned methods were implemented in
H.263 and H.264/AVC, respectively. Alternatively, Lin et
al. proposed a Motion Vector Extrapolation with Partition
(MVEP) information that considers the partition decision
information from the previous frame in HEVC [12]. The
extrapolated blocks (EBs) were selected to a PU size,
Although the MVEP algorithm takes advantage of the
characteristics in HEVC, it still needs to overcome a
drawback; the MVEP algorithm can degrade the subjective
quality of a concealment frame because its error
concealment basic unit (ECBU) is CTU, which is too
rough to present a motion of video contents. An APMVE
scheme, which uses the PU information of the previous
frame for both the EB and ECBU decision, is proposed to
overcome this drawback. This proposed algorithm can
decide the ECBU of the lost CTU using the relationship
between the extrapolated PUs and the lost CTU in the
corrupted frame and assign an MV to each ECBU within
the CTU.
3.1 ECBU Decision
In this algorithm, a PU size is exploited as an ECBU
size. The PU is suitable for an EB because the same object
likely exists in a PU that contains its motion information.
To decide an ECBU, MVE is performed for each PU of
a previous frame. Fig. 2 gives an example of the PU-based
MVE. The shaded CTU is being recovered. PU Ni −1 denotes
the ith PU in frame N–1, where i is 1 to K, and K is the total
number of PUs in that frame. MV ( PU Ni −1 ) denotes the
MV of PU Ni −1 . The extrapolated block of PU Ni −1 and its
MV
are
denoted
by
EPU Ni
and
MV ( EPU Ni −1 ) ,
respectively. MV ( EPU Ni −1 ) can be obtained using Eq. (2).
18
Kim et al.: Whole Frame Error Concealment with an Adaptive PU-based Motion Vector Extrapolation for HEVC
MV(PUNa−1)
MV(PUNb −1)
MV(PUNc −1)
PUNa−1 MV(EPUa )
N
MV(EPUNb )
PUNb −1
PUNc −1
EPUNb
EPUNc
EPU Na
EPUNc
frame N
frame N
Fig. 2. Example of the PU-based MVE.
⎧⎪1 p ∈ EPU Ni
f Ni ( p) = ⎨
,
i
⎪⎩0 p ∉ EPU N
2
ECBU 1N ECBU N
ECBU N3 ECBU N4
(4)
p is a pixel in the ECBU Nj . The best MV ( ECBU Nj ) is
then obtained by selecting the MV of the EPU with the
maximum weight wNi . This is obtained as follows:
frame N
frame N
frame N
Fig. 4. Example of the MV estimation for the ECBU Nj .
b
N
EPU Nc
MV(ECBUN4 )
MV(ECBUN3 )
EPUNb
MV(EPUNc )
frame N -1
frame N -2
EPU
MV(ECBU1,2
N )
EPUNa
EPUNa
MV ( ECBU Nj ) = MV ( EPU Ni* ) ,
(5)
i* = arg max{wNi } ,
(6)
Fig. 3. Example of the ECBU decision in the lost CUT.
where
MV ( EPU Ni ) = MV ( PU Ni −1 ) ,
(2)
where i is 1 to K. Some EPUs can be overlapped with the
lost CTU. The MVs in the previous frame can be extended
linearly to the current frame in MVE. Therefore, the lost
CTU can be overlapped with various sized EPUs. In this
paper, one of the overlapped EPUs is used to select the
ECBU size of the lost CTU. To achieve the best quality,
the smallest EPU size is selected as the ECBU size. On the
other hand, if the ECBU size is very small, it is possible to
cause block-artifacts in the concealed frame. The smallest
ECBU size was restricted to M×M to overcome this
problem. In other words, if the ECBU size is smaller than
M×M, it is changed to M×M. Fig. 3 gives an example of
the ECBU decision in the lost CTU. The lost CTU is
divided into the size of EPU Na . Each block is denoted by
ECBU Nj , where j is 1 to L and L is the total number of
j
N
ECBU .
3.2 ECBU-based Motion Vector
Extrapolation for Error Concealment
To decide the MV of ECBU Nj , which is defined by
MV ( ECBU Nj ) , we take advantage of the degree of overlap
(the number of overlapped pixels) wNi between EPU Ni and
ECBU Nj . wNi is given by
wNi =
∑
p∈ECBU Nj
where
f Ni ( p), i = 1, ", K and j = 1, ", L ,
(3)
Fig. 4 gives an example of the MV ( ECBU Nj )
estimation. Because EPU Na are overlapped with ECBU 1,2
N
a
the most, MV ( ECBU 1,2
N ) are set to MV ( EPU N ) . In the
same manner, MV ( ECBU N3 ) and MV ( ECBU N4 ) are
applied to MV ( EPU Nb ) and MV ( EPU Nc ) , respectively.
If ECBU Nj has not been overlapped by any EPUs, the
MV of this block is duplicated from an MV of a co-located
PU in the corrected received frame.
4. Simulation Results
The proposed APMVE method was simulated using the
HEVC test model (HM) 11.0 and evaluated under the lowdelay P configuration under a common test condition with
some changes; The number of reference frames was 1 and
asymmetric motion partitioning (AMP) was off. Two
standard video sequences “BasketballDrill” and “BQMall”
were used to evaluate the performance of the proposed
algorithm. The frame rates were 50 frames/s and 60
frame/s, respectively. The frame size of both
“BasketballDrill” and “BQMall” is 832×480 and
quantization parameter (QP) was 32. In addition, the intra
mode in P frame is disable when evaluating the accuracy
of the results of APMVE method.
To decide the minimum ECBU size of the proposed
method, an experiment was performed to select the optimal
size among 4×4, 8×8 and 16×16. Fig. 5 shows the average
PSNR values between the original sequence and the
19
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
(a)
Fig. 5. PSNR value for “BQMall” sequence versus
ECBU size.
(b)
Fig. 6. PSNR quality of the EC algorithms for the
“BasketballDrill” (BBD) and “BQMall” sequences.
concealed sequence according to the minimum ECBU
sizes for the “BQMall” sequence. Whole frame loss occurs
in every 15 frame in the original sequence. MN denotes the
minimum ECBU size of N. Although M16 produces similar
performance to both M4 and M8 in the “BasketballDrill”
sequence, M16 produces better performance than M4 and
M8 in the “BQMall” sequence, as shown in Fig. 5, while
providing similar results in the other sequences. Therefore,
M16 is used as the minimum ECBU size.
To explain the effectiveness of the proposed method,
the performance of the proposed APMVE algorithm was
compared with that of the MVEP, frame copy (FC) and
motion compensation (MC). FC means copying the
previous frame for concealment. MC means that original
MVs are received correctly, but residual information has
been lost. As it is very difficult to estimate the MVs of a
frame concealment than that of a block concealment, MC
could be regarded as the upper bound. Fig. 6 shows the
simulation results with PSNR for different test sequences.
The proposed APMVE algorithm significantly outperformed FC and MVEP.
Figs. 7 and 8 show the results of one frame extracted
from different sequences, where (a) is the error-free frame,
and (b) to (d) are the images reconstructed using MC,
MVEP and the proposed APMVE algorithm, respectively.
Owing to the PU-based ECBU decision, the proposed
APMVE algorithm can show good subjective quality,
(c)
(d)
Fig. 7. Restored 130th frame of “BasketballDrill”
sequence with QP of 32 (a) error-free frame, (b) MC
frame, (c) concealed with MVEP, (d) concealed with
APMVE.
particularly around the edges of the image objects, as
highlighted by the solid circles. In the “BasketballDrill”
sequence, APMVE method conceals the basketball well,
whereas MVEP generates a distorted sequence, resulting
from an excessively rough ECBU size of the CTU.
20
Kim et al.: Whole Frame Error Concealment with an Adaptive PU-based Motion Vector Extrapolation for HEVC
addition, the MV of a non-overlapped block duplicated
from a co-located MV can be improper. This block can be
compensated for by a post-process, such as boundary
matching algorithm and spatial-based error concealment
algorithm.
5. Conclusion
(a)
This paper proposes a new EC algorithm, APMVE
method, to conceal the whole missing frame encoded by
HEVC. The PU information was used to adaptively decide
the ECBU under the assumption that the same object likely
exists in the PU. The MV of every PU in the correctly
received previous frame is extrapolated to the lost frame.
The temporal correlation between the extrapolated PU and
the lost CTU was used to decide the ECBU. After the
ECBU decision, the best MV of each ECBU in the lost
CTU was estimated using degree of overlap and used for
the lost frame concealment. The simulation results have
showed that APMVE method provides better objective and
subjective quality than the frame copy and MVEP method,
particularly around the edges of the image objects.
(b)
Acknowledgement
The present Research has been conducted by the
Research Grant of Kwangwoon University in 2014.
References
(c)
(d)
Fig. 8. Restored 34th frame of “BQMall” sequence with
QP of 32 (a) error-free frame, (b) MC frame, (c)
concealed with MVEP, (d) concealed with APMVE.
Although a PU represents an object’s shape, APMVE
method can estimate the wrong MV, as shown by dotted
circles. This results from using a simple criterion, degree
of overlap, in an MV estimation, which is sensitive to
outliers. This limitation can be compensated for by other
methods that focus on how to find the best MV [11, 14]. In
[1] High Efficiency Video Coding (HEVC), document
Rec. ITU-T H.265 and ISO/IEC 23008-2, Jan. 2013.
Article (CrossRef Link)
[2] T. Schierl, S. Wenger, Y. –K, Wang and M. M.
Hannuksela, “RTP payload format for high efficiency
video coding,” IETF Internet Draft Sept. 2013.Article
(CrossRef Link)
[3] J. W. Park and S. U. Lee, “Recovery of corrupted
image data based on the NURBS interpolation,”
IEEE Trans. Circuits Syst. Video Technol., vol. 9, no.
10, pp. 1003-1008, Oct. 1999.Article (CrossRef Link)
[4] W. –M. Lam, A. R. Reibman and B. Liu, “Recovery
of lost or erroneously received motion vectors,”
Acoustics, Speech, and Signal Processing (ICASSP),
vol. 5, pp. 417-420, Apr. 1993.Article (CrossRef
Link)
[5] C. Cai, j. Zhang, “Improved error concealment
method with weighted boundary matching
algorithm,” ICMP, pp. 384-386, July. 2011.Article
(CrossRef Link)
[6] P. Salama, N. B. Shroff, E. J. Coyle, and E. J. Delp,
“Error concealment techniques for encoded video
streams,” in Proc. IEEE Int. Conf. Image Processing,
1995, pp. 9-12.Article (CrossRef Link)
[7] J. W. Park, J. W. Kim, and S. U. Lee, “DCT
coefficients recovery based error concealment
technique and its application to the MPEG-2 bit
stream error,” IEEE Trans. Circuits Syst. Video
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Technol., vol. 7, no. 12, pp. 845-854, Dec. 1997.
Article (CrossRef Link)
H. Sun and W. Kwok, “Concealment of damaged
block transform coded images using projections onto
convex sets,” IEEE Trans. Image Process., vol. 4, no.
4, pp. 470-477, Apr. 1995. Article (CrossRef Link)
J. Baek, P. S. Fisher, and M. Jo “An enhancement of
mSCTP handover with an adaptive primary path
switching scheme,” in Proc. of IEEE VTC Fall, pp.
1–5, Sep. 2010. Article (CrossRef Link)
H. Gharavi and S. Gao, “Spatial interpolation
algorithm for error concealment,” in Proc. IEEE
ICASSP, pp. 1153-1156, Apr. 2008. Article
(CrossRef Link)
B. Yan and H. Gharavi, “A hybrid frame
concealment algorithm for H.264/AVC,” IEEE
Transactions on Image Processing, vol. 19, pp. 98107, Jan 2010. Article (CrossRef Link)
T. L. Lin, N. C. Yang, R. H. Syu, C. C Liao and W. L.
Tsai, “Error concealment algorithm for HEVC coded
video using block partition decisions,” IEEE ICSPCC,
pp. 1-5, 2013 Article (CrossRef Link)
Q. Peng, T. Yang, and C. Zhu, “Block-based
temporal error concealment for video packet using
motion vector extrapolation,” in Proc. IEEE Int. Conf.
Communications, Circuits and Systems and West
Sino Expositions, vol. 1, pp. 10-14, Jun. 2002. Article
(CrossRef Link)
Zhang Y., Cui H. and Tang K., “Whole frame error
concealment with refined motion extrapolation and
prescription at encoder for H.264/AVC,” Tsinghua
Science and Technology, vol. 14, pp. 691-697, Dec.
2009. Article (CrossRef Link)
Y. Chen, K. Yum J. Li, and S. Li, “An error
concealment algorithm for entire frame loss in video
transmission,” presented at the IEEE Picture Coding
Symp., 2004. Article (CrossRef Link)
Copyrights © 2015 The Institute of Electronics and Information Engineers
21
Seounghwi Kim was born in Seoul,
Korea. He received his B.S. degrees in
Electronic Engineering from Kwangwoon University, Seoul, Korea, in
2014. He is a Master student at the
Kwangwoon University. His research
interests are video processing, video
compression, and video coding.
Dongkyu Lee was born in Seoul,
Korea. He received his B.S. and M.S.
degrees in Electronic Engineering
from Kwangwoon University, Seoul,
Korea, in 2012 and 2014, respectively.
He is a PhD student at the Kwangwoon University. His research interests
are image and video processing, video
compression, and video coding.
Seoung-Jun Oh was born in Seoul,
Korea, in 1957. He received both the
B.S. and the M.S. degrees in
Electronic Engineering from Seoul
National University, Seoul, in 1980
and 1982, respectively, and the Ph.D.
degree in Electrical and Computer
Engineering from Syracuse University,
New York, in 1988. In 1988, he joined ETRI, Daejeon,
Korea, as a Senior Research Member. Later he was
promoted to a Program of Multimedia Research Session in
ETRI. Since 1992, he has been a Professor of Department
of Electronic Engineering at Kwangwoon University,
Seoul, Korea. He has been a Chairman of SC29-Korea
since 2001. His research interests include image and video
processing, video compression, and video coding
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.022
22
IEIE Transactions on Smart Processing and Computing
Analysis of the JND-Suppression Effect in Quantization
Perspective for HEVC-based Perceptual Video Coding
Jaeil Kim and Munchurl Kim
Department of Information and Communications, KAIST, South Korea {jaeil1203, mkimee}@kaist.ac.kr
* Corresponding Author: Jaeil Kim, Munchurl Kim
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: Transform-domain JND (Just Noticeable Difference)-based for PVC (Perceptual Video
Coding) is often performed in quantization processes to effectively remove perceptual redundancy.
This study examined the JND-suppression effects on quantized coefficients of transform in HEVC
(High Efficiency Video Coding). To reveal the JND-suppression effect in quantization, the
properties of the floor functions were used for modeling the quantized coefficients, and a JNDadjustment process in an HEVC-compliant PVC scheme was used to tune the JND values by
analyzing the JND suppression effect. In the experimental results, the bitrate reduction decreases
slightly, but the PSNR and perceptual quality are improved significantly when the proposed JND
adjustment process is applied.
Keywords: Perceptual video coding, Quantization, HEVC
1. Introduction
Recently, the standardization of next-generation video
coding, called High Efficiency Video Coding (HEVC) [1],
has been finalized with the goal of improving its coding
efficiency more than 2 times compared to that of
H.264/AVC [2]. To improve the coding performance, the
following advanced techniques have newly been adopted
for HEVC: a flexible structure of the Coding Unit (CU)
with skip modes; squared and asymmetric prediction units
(PUs) with advanced motion vector prediction and merge
modes; a residual quad-tree transform structure with 4×4, 8
×8, 16× 16, and 32×32; 33 directional intra prediction with
planar and DC modes; Transform Skipping Mode (TSM)
for 4×4 inter and intra prediction mode; and additional inloop filtering of sample adaptive offset [1]. Fig. 1 shows
the CU, PU and TU partition structures of HEVC. On the
other hand, such efforts toward improving the coding
efficiency have led to dramatic increases in the encoder
complexity, and further improvement of the coding
efficiency becomes increasingly difficult in a ratedistortion (MSE: Mean Squared Error) sense, even with
more elaborate coding tools at the price of complexity
increases.
Another axis toward coding efficiency improvement is
perceptual video coding (PVC), where perceptual
distortion can be incorporated instead of MSE. Therefore,
considerable effort has been made in this direction in the
sense of PVC [3-7]. In PVC, how to effectively
incorporate visual perception models into the coding
process is one of the most important issues in reducing the
perceptual redundancy. Among the techniques for reducing
the perceptual redundancy, JND is an efficient model to
assess the perceptual redundancy of human visual systems
(HVS) [3-7]. In recent years, several methods have
focused on PVC based on HVS [3-7]. In [3], a JNDadaptive residue pre-processor in the image-domain was
proposed for a motion-compensated residue signal with a
nonlinear additive model to improve the compression
performance. Chen et al. [5] introduced a foveation JND
model to improve the existing spatial and temporal JND
models. The proposed JND model is used for macroblock
quantization adjustment in H.264/AVC.
Naccari et al. [6] reported a decoder-side JND model
estimation to avoid additionally required signal bits for the
adjustment of the quantization matrices. Based on this,
they proposed a H.264/AVC-based perceptual video
coding architecture. In [7], a perceptual visual model was
applied for the earliest reference software version of
HEVC, called Test Model under Consideration (TMuC), to
improve the coding performance. On the other hand, all
previous methods are only applied for fixed and small-
23
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
where wn,i,j and ln,i,j are the transformed and quantized
transform coefficients at (i, j) in block n, respectively, and
q and f are the quantization step size and rounding offset
value for quantization process, respectively, and ⎣⎢ ⎦⎥ is the
CTB
2N×2N(SKIP)
PU
2N×2N(inter)
2N×N(inter)
N×2N(inter)
N×N(inter)
2N×2N(intra)
floor function (called integer operation). HEVC and
H.264/AVC use a shift operation for quantization process
that are different from Eq. (1) to avoid division operation
which causes mismatch between quantization processes for
the encoder and decoder, and has heavy computation
complexity. In this paper, the simple quantization model in
Eq. (1) was used to analyze the effects from the
quantization process and JND suppression for simple
derivation.
N×N(intra)
2.1 HEVC-Compliant Transform-domain
JND suppression
TU split flag = 0
TU split flag = 1
TU
TU split flag = 0
TU split flag = 1
Fig. 1. CU, PU and TU Partition Structures.
sized transforms (4x4 and 8x8). Luo et al. [8] proposed a
PVC scheme determining the optimal quantization levels
limited in JND error. Nevertheless, to find the optimal
quantization levels, this PVC scheme required a recursive
structure that involved significant computational
complexity. A previous study [9] proposed a JND-based
HEVC-compliant PVC scheme with simple computation
complexity. On the other hand, the effects of JND
suppression and quantization errors were not investigated
for human perception. Therefore, this study analyzed the
JND suppression with a quantization error limited in JND
and modified the proposed JND value.
The remainder of this paper is organized as follows:
Section 2 briefly describes the previous JND-based
HEVC-compliant PVC scheme [9] and proposes a new
PVC scheme with a JND-suppression adjustment taking
the effects of JND suppression and quantization in Section
3 into account. Section 4 presents the performance
evaluations and discussion of the effect of the proposed
scheme. Finally, this paper is concluded in Section 4.
2. The proposed HEVC-Compliant PVC
scheme
To consider the quantization and JND suppression
effects, a simple model of quantization process was
defined as follows:
ln,i , j = ⎢ wn,i , j / q + f ⎥
⎣
⎦
(1)
The PVC schemes can be categorized into two:
standard-incompliant PVC schemes and standardcompliant PVC schemes. The standard-incompliant PVC
schemes have been applied for quantization and inverse
quantization in encoder processing and inverse
quantization in decoder processing. Therefore, the related
works for standard-incompliant PVC schemes require the
modification of encoders and decoders to reduce the
perceptual redundancy in quantization. On the other hand,
the related works for standard-compliant PVC schemes
have only been applied for quantization in encoder
processing. The standard-incompliant PVC schemes have
more freedom to reduce the bitrate, but cannot be
applicable in real applications of the video coding standard.
The standard-compliant PVC schemes can be applicable in
actual video coding encoders in a compliant manner with
video coding standards, such as H.264/AVC and HEVC.
Among the related works, Luo’s PVC scheme [8]
proposed a standard-compliant JND-suppression scheme
by modifying the quantization process only in the encoder
sides. In this scheme, the quantized coefficients, ln′,i , j were
suppressed in the direction by JND as
ln′,i , j = ln,i , j − hn ,i , j
(2)
where the tuning value, hn,i,j, is derived from the following
equation:
hn ,i , j = max h
s.t.0 ≤ h ≤ ln,i , j , h ∈ Z
(3)
en ,i , j (h) / Tn,i , j ≤ JNDn,i , j
In Eq. (3), Tn,i,j(h) is the probability of detecting
distortion based on a visual saliency map and en,i,j(h) is the
error function between transformed coefficient and JNDsuppressed reconstruction coefficient in terms of the tuning
value, h, for the quantized coefficient. This means that
Luo’s PVC scheme reduces the quantized coefficient until
the error by quantization and JND-suppression are the
lowest, and can control error less than the JND value. On
the other hand, Luo’s PVC scheme needs to find a
24
Kim et al.: Analysis of the JND-Suppression Effect in Quantization Perspective for HEVC-based Perceptual Video Coding
where wˆ n ,i , j and wˆ n′ ,i , j are the reconstruction coefficient
quantized JND-suppression level recursively, requiring
high computational complexity.
To solve this problem, a previous work [9] proposed a
JND-suppression method as follows:
(
ln′,i , j
)
⎧⎢ w
⎥
− JNDnscale
,i , j
⎪ ⎢ n ,i , j
⎥,
+
f
⎪
⎥
= ⎨⎢
q
⎦⎥
⎪ ⎣⎢
⎪⎩
0,
and JND-suppressed reconstruction coefficient with JND
adjustment. In the right-hand side, the reconstruction
coefficient is used rather than the transformed coefficient
because the purpose of the proposed PVC scheme is to
reduce the difference between the JND-suppressed
reconstruction coefficient and reconstruction coefficient to
a level virtually undetectable by the human visual system.
From Eq. (1), Eq. (5) can be redefined as follows:
wn ,i , j > JNDnscale
,i , j
otherwise
(
(4)
scale
n,i , j
where JND
(
q ⎢ wn,i , j − JNDn ,i , j − α
⎣
))
q + f ⎥ ≤ wˆ n ,i , j − JNDn ,i , j (6)
⎦
is a scaled JND value to consider a
where α is the tuning value to adjust the JND value,
JNDn,i,j. To consider the rounding offset, f, for the
transformed coefficient, wn,i,j, and JND value, JNDn,i,j, in
Eq. (6), Eq. (7) can be defined as
normalization factor for the HEVC transform in [9]. In Eq.
(4), subtraction operation is also needed for the PVC-based
quantization process. Nevertheless, perceptual distortion
can be detected when the error by quantization and JNDsuppression is larger than a JND value. Therefore, these
effects are considered in the following chapter.
(
3.2 Analysis of JND suppression in terms
of the quantization effects
))
q + 2 f ⎥ ≤ wˆ n,i , j − JNDn,i , j
⎦
(7)
where αʹ can be defined in
To consider the error effect by quantization and JND
suppression and improve the previous PVC scheme, this
section presents the proposed HEVC-compliant JNDsuppression PVC scheme by considering the JND suppression
effect with quantization error, as shown in Fig. 2.
The proposed JND adjustment process for a transformdomain JND model tunes the quantized level according to
its JND value, quantization step size and rounding offset.
Firstly, a lower bound is defined as follows:
wˆ n′ ,i , j ≤ wˆ n ,i , j − JNDn ,i , j
(
q ⎢ wn ,i , j − JNDn,i , j − α ′
⎣
α′ = α − f ⋅q
(8)
To find the optimal α, the floor function in [10] can be
used as follows:
⎣⎢ x + y + z ⎦⎥ ≤ ⎣⎢ x ⎦⎥ + ⎣⎢ y ⎦⎥ + ⎣⎢ z ⎦⎥ + 2
for real numbers, x, y, z. From Eq. (9), Eq. (7) can be
rewritten as follows:
(5)
Transform-domain JND Model for non-TSMs
JND Adjustment
TSM
+
TU
+
Transform
-
Quantization
ME/MC
DPB
Intra prediction
Deblocking filter
Inv. Quant.
t+1
+
Fig. 2. Proposed PVC scheme considering JND suppression effect.
Inv. Trans.
CABAC
Pixel-domain JND Model for TSM
t
(9)
25
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
JNDn,i,j
+
Shift operation
scale
JND'n,i,j
α
-
Calculate α
q, f
Fig. 3. Proposed JND adjustment process.
(
(
q ⎢ wn,i , j − JNDn ,i , j − α ′
⎣
))
q+2f ⎥
⎦
≤ q ⎢ wn ,i , j q + f ⎥ − q ⎢ JNDn ,i , j q + f ⎥ + q ⎢⎣α ′ q ⎥⎦ + 2q
⎣
⎦
⎣
⎦
≤ wˆ n ,i , j − JNDn ,i , j
(10)
and
ˆ
wˆ n,i , j − JND
n,i , j + q ⎢
⎣α ′ q ⎥⎦ + 2q ≤ wˆ n,i , j − JNDn,i , j (11)
ˆ
(12)
− JND
n ,i , j + + q ⎢
⎣α ′ q ⎥⎦ + 2q ≤ − JNDn,i , j
ˆ
In Eqs. (11) and (12), JND
n , i , j is the reconstructed JND
value. Finally,
ˆ
k ≤ JND
n , i , j q − JNDn , i , j q − 2q
(13)
where k is defined in
3. Experimental Results
To evaluate the performance improvement achieved by
the proposed JND-suppression adjustment scheme, the
proposed model in Eqs. (14) and (15) were implemented
into the HM 11.0 [11] and the JND-based HEVCcompliant PVC scheme with JND-suppression adjustment
process was compared with the original HM 11.0 and the
JND-based HEVC-compliant PVC without a JNDsuppression adjustment in the TU blocks. For the
experiments, four FHD (Full High Definition –
1920×1080) test sequences of the ParkScene (PkS), Cactus
(Ca), BQTerrace (BQT) and BasketballDrive (BDv) and
four FWVGA (Full Wide Video Graphics Array –
832×480) test sequences of RaceHorses (RH), BQMall
(BQM), PartyScene (PtS) and BasketballDrill (BDl) were
used, which are all in 4:2:0 color format. As the test
conditions, the default Low-Delay (LD) main
configuration [11] and the first 100 frames were used for
all experiments.
The original HM11.0, PVC (PVC without JNDsuppression adjustment process) and PVCAP (PVC with
JND-suppression adjustment process) were compared with
respect to PSNR, bitrate reduction and encoding time. The
encoding times were measured on a PC with an Inter
2.5GHz HexaCore processor with 32GB RAM. The bitrate
reduction (ΔBitrate), the encoding time (ΔTime) and the
PSNR difference (ΔPSNR) were calculated as
ΔBitrate =
Bitrate pro − Bitrateref
ΔTime =
k = ⎢⎣α ′ q ⎦⎥
(14)
From Eqs. (8) and (14), the range of α can be derived
from the definition of floor function in [10] as follows:
k ≤ α ′ q < k +1,
(15)
qk + qf ≤ α < kq + q + fq .
(16)
and
From Eq. (13), the optimal k can be found first as
ˆ
⎥
k = ⎢ JND
n , i , j q − JNDn , i , j q − 2q ⎦
⎣
(17)
From Eq. (16), α can be found as the smallest value as
α = ⎡⎢ qk + qf ⎤⎥
(18)
where ⎡⎢ ⎤⎥ is the ceiling function. As shown Fig. 3, the
proposed JND adjustment process uses the JND value,
JNDn,i,j, quantization step size, q, and rounding offset, f, to
derive α in Eq. (15) for tuning the value for JND.
Bitrateref
TimePro − Timeref
Timeref
× 100
× 100
ΔPSNR = PSNR pro − PSNRref
(16)
(17)
(18)
Table 1 presents the performance comparison for
PSNR and bitrate reduction between PVC and PVCAP
under the LD configuration. As shown in Table 1, the
average bitrate reduction was 23.66% for PVC and 18.53%
for PVCAP, compared to the original HM 11.0. This
suggests that the PVC outperforms PVCAP with 5.13%
more bitrate reduction but PVCAP outperforms PVC with a
0.39dB PSNR increase. As shown in Table 1, PVCAP
increases the encoder complexity by 29.9% in execution
compared to the HM 11.0 whereas PVC increases 22.7 %,
which has similar computation complexity.
In terms of perceptual verification, however, PVCAP
can improve the visual quality significantly, as shown in
Figs. 4 and 5. Fig. 4(a) shows the 51st frame of the original
RaceHorses sequence with an image region (some grass)
in the bounded box. Figs. 4(b)-(d) show some image
regions of the reconstructed frames by the original HM
11.0, PVC and PVCAP, respectively. From a comparison of
the visual quality, the proposed PVCAP produces fewer
visual artifacts in the reconstructed region of some grass
compared to the PVC. Also, Fig. 5(a) shows the 30st frame
of the original BQMall sequence with an image region
(back of one’s head) in the bounded box. Figs. 5(b)-(d)
show some image regions of the reconstructed frames by
26
Kim et al.: Analysis of the JND-Suppression Effect in Quantization Perspective for HEVC-based Perceptual Video Coding
Table 1. Experimental Results.
Seq. QP
PkS
Ca
BQT
BDv
RH
BQ
M
PtS
BDl
Avg
ΔBitrate(%)
ΔPSNRY(dB)
22
27
32
22
27
32
22
27
32
22
27
32
22
27
32
22
27
32
22
27
32
22
27
32
PVC
-33.73
-23.12
-9.00
-46.77
-19.74
-5.76
-65.33
-40.86
-12.09
-36.06
-15.57
-6.57
-35.84
-25.86
-5.92
-28.34
-20.42
-7.82
-32.45
-25.33
-12.29
-17.40
-18.00
-18.50
PVCAP
-32.75
-16.33
-7.08
-48.37
-13.63
-5.76
-64.13
-18.78
-7.59
-38.09
-12.16
-3.65
-34.71
-2.67
-1.68
-27.50
-8.99
-2.97
-30.29
-11.08
-3.49
-23.63
-10.96
-3.54
PVC
-2.75
-1.73
-0.74
-2.00
-1.32
-0.71
-2.82
-1.12
-0.45
-1.97
-1.46
-0.98
-4.27
-2.65
-0.91
-3.54
-2.46
-1.30
-4.61
-2.71
-1.13
-2.44
-1.40
-0.55
-
-23.66
-18.53
-1.98
ΔTime
PVCAP PVC PVCAP
-2.62
23.0 26.9
-1.36
32.6 36.9
-0.58
38.1 40.8
-1.89
13.0 14.9
-0.93
30.3 32.1
-0.47
36.5 37.4
-2.73
7.7
8.5
-0.64
29.2 29.4
-0.32
40.1 37.4
-1.88
23.0 22.6
-1.07
32.2 34.1
-0.56
38.3 38.9
-4.01
8.6
19.8
-1.06
16.9 30.8
-0.60
22.7 36.2
-3.32
10.7 22.1
-1.67
18.8 31.8
-0.84
22.8 36.4
-4.31
5.3
17.4
-1.72
16.0 28.2
-0.61
23.6 35.5
-2.31
12.6 26.3
-0.97
19.1 34.9
-0.29
23.9 39.5
-1.59
22.7
29.9
(a)
(b)
(c)
(d)
Fig. 5. Comparisons of subjective qualities at QP27 (a)
st
original 30 frame of the BQMall sequence, (b) the
reconstructed region of the back of one’s head by the
original HM 11.0, (b) the reconstructed region of the
back of one’s head by the PVC, (c) the reconstructed
region of the back of one’s head by the PVCAP.
the original HM 11.0, PVC and PVCAP, respectively. From
a comparison of the visual quality, the proposed PVCAP
produces fewer ringing artifacts in the reconstructed region
of the back of one’s head compared to the PVC.
4. Conclusion
This study examined the effects of JND suppression in
terms of the quantization effect in the transform domain,
and proposed a JND value adjustment method for a JNDbased HEVC-compliant PVC scheme. The bitrate
reduction decreased in some QP ranges but the PSNR and
visual quality were improved significantly.
(a)
Acknowledgement
This study was supported by the IT R&D program of
MSIP/IITP (10039199, A Study on Core Technologies of
Perceptual Quality based Scalable 3D Video Codecs),
Korea.
(b)
(c)
(d)
Fig. 4. Comparisons of subjective qualities at QP27 (a)
st
original 51 frame of the RaceHorses sequence, (b) the
reconstructed region of the grass by the original HM
11.0, (b) the reconstructed region of the grass by the
PVC, (c) the reconstructed region of the grass by the
PVCAP.
References
[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand,
“Overview of the High Efficiency Video Coding
(HEVC) Standard.” IEEE Trans. Circuits Syst. Video
Technol., vol. 22, no. 12, pp. 1649-1668, Dec. 2012.
Article (CrossRef Link)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
[2] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, T.
Wiegand, “Comparison of the Coding Efficiency of
Video Coding Standards – Including High Efficiency
Video Coding (HEVC),” IEEE Trans. Circuits Syst.
Video Technol., vol. 22, no. 12, pp. 1669-1684, Dec.
2012. Article (CrossRef Link)
[3] Y. Xiaokang, et al., “Motion-compensated residue
preprocessing in video coding based on justnoticeable-distortion profile,” IEEE Trans. Circuits
Syst. Video Technol., vol.15, no. 6, pp. 742-752, Jun.
2005. Article (CrossRef Link)
[4] W. Zhenyu and K. N. Ngan, “Spatio-Temporal Just
Noticeable Distortion Profile for Grey Scale
Image/Video in DCT Domain,” IEEE Trans. Circuits
Syst. Video Technol., vol.19, no. 3, pp. 337-346, Mar.
2009. Article (CrossRef Link)
[5] Z. Chen and C. Guillemot, “Perceptually-Friendly
H.264/AVC Video Coding Based on Foveated JustNoticeable-Distortion Model,” IEEE Trans. Circuits
Syst. Video Technol., vol. 20, no. 6, pp. 806-819, Jun.
2010. Article (CrossRef Link)
[6] M. Naccari and F. Pereira, “Advanced H.264/AVCBased Perceptual Video Coding: Architecture, Tools,
and Assessment,” IEEE Trans. Circuits Syst. Video
Technol., vol. 21, no. 6, pp. 766-782, Jun. 2011.
Article (CrossRef Link)
[7] M. Naccari and F. Pereira, “Integrating a spatial just
noticeable distortion model in the under development
HEVC codec,” Proc IEEE Int. Conf. on Acoustics,
Speech, and Signal Processing, pp. 817-820, Prague,
Czech Republic, May 2011. Article (CrossRef Link)
[8] Z. Luo, L. Song, S. Zheng, and N. Ling,
“H.264/Advanced
Video
Control
Perceptual
Optimization Coding Based on JND-Directed
Coefficient Suppression,” IEEE Trans. Circuits Syst.
Video Technol., vol. 23, no. 6, pp. 935-948, Jun. 2013.
Article (CrossRef Link)
[9] J. Kim, M. Kim, “An HEVC-Compliant Perceptual
Video Coding Scheme based on JND Models for
Variable Block-sized Transform Kernels”, IEEE
Trans. Circuits Syst. Video Technol., to be published,
2015. Article (CrossRef Link)
[10] B. M. Stephen, R. Anthony, Discrete Algorithmic
Mathematics, Third Edition, CRC Press, 2005.
Article (CrossRef Link)
[11] HM reference software 11.0. Article (CrossRef Link)
Copyrights © 2015 The Institute of Electronics and Information Engineers
27
Jaeil Kim received his B.S. degree in
electrical engineering from the Korea
University of Technology and
Education, Cheonan, Korea, in 2006.
He received a Ph.D. degree in
Department of Information and
Communications
Engineering
of
Korea Advanced Institute of Science
and Technology (KAIST), Daejeon, Korea. His current
research interests include perceptual video coding,
hardware friendly video codec design, transform coding
for video, image/video processing, and pattern recognition.
Munchurl Kim received his B.E.
degree in electronics from Kyungpook
National University, Daegu, Korea, in
1989, and has M.E. and Ph.D. degrees
in electrical and computer engineering
from the University of Florida,
Gainesville, in 1992 and 1996,
respectively. After his graduation, he
joined the Electronics and Telecommunications Research
Institute, Daejeon, Korea, as a Senior Research Staff
Member, where he led the Realistic Broadcasting Media
Research Team. In 2001, he was an Assistant Professor
with School of Engineering, Information and
Communications University, Daejeon, Korea. Since 2009,
he has been an Associate Professor and is now a Full
Professor of Department of Electrical Engineering, KAIST,
Daejeon, Korea. His current research interests include
2D/3D perceptual video coding, visual perception
modeling on 2D/3D video, high dynamic range video
processing, machine learning, and pattern recognition.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.028
28
IEIE Transactions on Smart Processing and Computing
Performance Analysis of HEVC Parallelization Methods
for High-Resolution Videos
Hochan Ryu, Yong-Jo Ahn, Jung-Soo Mok, and Donggyu Sim
Department of Computer Engineering, Kwangwoon University / Seoul, South Korea
{jolie, yongjoahn, jsmok, dgsim}@kw.ac.kr
* Corresponding Author: Donggyu Sim
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: Several parallelization methods that can be applied to High Efficiency Video Coding
(HEVC) decoders are evaluated. The market requirements of high-resolution videos, such as Full
HD and UHD, have been increasing. To satisfy the market requirements, several parallelization
methods for HEVC decoders have been studied. Understanding these parallelization methods and
objective comparisons of these methods are crucial to the real-time decoding of high-resolution
videos. This paper introduces the parallelization methods that can be used in HEVC decoders and
evaluates the parallelization methods comparatively. The experimental results show that the
average speed-up factors of tile-level parallelism, wavefront parallel processing (WPP), frame-level
parallelism, and 2D-wavefront parallelism are observed up to 4.59, 4.00, 2.20, and 3.16,
respectively.
Keywords: HEVC decoder, Parallelism, Tile, WPP, Wavefront
1. Introduction
Aideo services have evolved rapidly with the demands
of high-quality videos. Higher resolution and higher
bitrates are required with the evolution of video services.
In the current market, there are many video streaming
services and the market requires high-resolution video,
such as Full HD and 4K-UHD. To correspond to the
requirements, the Joint Collaborative Team Video Coding
(JCT-VC) of the ISO/IEC Moving Picture Experts Group
(MPEG) and the ITU-T Video Coding Experts Group
(VCEG) developed High Efficiency Video Coding
(HEVC), which is the next generation video coding
standard [1]. HEVC is based on quad-tree based coding
tree units (CTUs) and coding units (CUs), which can be
larger than macroblocks of the previous video coding
standards. In addition, the prediction unit (PU) and
transform unit (TU) that can be split from CU are used. In
addition, 8-tap filters are used to interpolate the fractional
samples, compared to 6-tap filtering of H.264/AVC [2]. 35
modes are then used in the intra prediction [3] and sample
adaptive offset (SAO) is used in the in-loop filtering
process [4]. These new features and additional tools
compared to the previous video coding standards increase
the decoder complexity [5]. To solve these problems,
many studies have focused on fast HEVC encoders and
decoders [6, 7]. However, it is difficult to make real-time
codecs without parallelization. Therefore, parallelization of
the video decoder could be essential for the real time
decoding of high resolution videos. Currently, the multicore system can be used for the parallelization of a video
decoder. There are several conventional parallelization
methods for video decoders [8, 9]. For efficient
parallelization, an objective comparison of the parallel
methods on the multi-core system is very important.
In this paper, several parallelization methods of HEVC
were analyzed and evaluated comparatively. In the process
of establishing a HEVC standard, several techniques were
proposed for the parallelism. Finally, the tile and
wavefront parallel processing (WPP) were adopted for the
HEVC standard. The tile is a rectangular region of coding
tree blocks (CTBs) within a particular tile column and a
particular tile row in a picture [10]. There are no data
dependencies among the partitioned regions. Therefore,
the partitioned regions can be decoded in parallel [11].
WPP is a parallelization technique that uses a separate
CABAC context for each CTU line. Based on this method,
entropy decoding of each CTU line can be conducted
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
29
Fig. 1. Standardization flow of HEVC.
simultaneously [12]. Another parallelization method of a
HEVC decoder is frame-level parallelization. Using framelevel parallelization, several pictures can be decoded in
parallel. On the other hand, for the parallel decoding of
each picture, consideration of the structure of reference
pictures is necessary. The parallelization methods
introduced above are encoder-dependent methods. On the
other hand, 2D-wavefront method is not an encoderdependent method. For this reason, a 2D-wavefront
method is used most widely as a parallelization method in
video decoders. Using 2D-wavefront parallelization, pixel
decoding of the entropy decoded CTUs can be conducted
simultaneously in the CTU line level. This paper compares
several parallelization methods and evaluates parallel
performance of all the parallelization methods objectively.
This paper is organized as follows. Section 2 reviews
the HEVC standard. In Section 3, the parallelization
methods of HEVC decoder are introduced. The advantages
and drawbacks of the methods are then compared. In
Section 4, all the parallelization methods are evaluated in
parallel and RD performance is assessed. Finally, Section
5 concludes the paper.
2. HEVC Overview
HEVC is the newest video coding standard jointly
developed by the ISO/IEC MPEG and ITU-T VCEG
focused on significantly better coding efficiency. The first
version of the HEVC standard was completed on January,
2013. The second version of the HEVC standard was
finalized on June, 2014 including range extensions,
scalable coding extensions, and multi-view extensions [14].
In addition, the screen content coding that is designed for
computer generated images and videos is being developed
[15].
HEVC has new features for the efficiency of highresolution and high-quality videos. First, HEVC employs
hierarchical variable blocks of CU, PU, and TU; whereas,
the previous video codecs use the fixed size macroblock
[16]. Because the blocks of HEVC can be larger than the
previous macroblock, HEVC can compress the highresolution videos more efficiently. Fig. 2 presents one of
Fig. 2. Block partition of HEVC.
Fig. 3. Block diagram of the HEVC decoder.
the block partition examples based on HEVC. A picture
can be partitioned with multiple CTUs and each CTU
splits into multiple CUs of that size is variable. In addition,
PU and TU are split from each CU.
Secondly, a more accurate intra prediction can be
performed with 35 intra prediction modes. In the case of
inter prediction, the precision of motion compensation is
increased using the 8-tap interpolation filter. Merge and
advanced motion vector prediction (AMVP) are used for
efficient syntax in representing the inter prediction. Third,
HEVC uses an additional in-loop filter called SAO to
improve the subjective quality and coding efficiency at the
same time. On the other hand, these additional features
make the encoder and decoder more complex [17]. The
main sources of the HEVC decoder complexity are entropy
decoding and pixel decoding parts, as shown in Fig. 3. For
high-resolution and high-bitrate videos, the complexity of
the entropy decoding part is more complicated [18].
Several parallelization methods can be used to reduce the
complexity of entropy decoding and pixel decoding parts.
30
Ryu et al.: Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos
(a)
(b)
Fig. 4. Example of tile partitioning in a picture (a) No
tile splits, (b) Use of 4 tile splits.
Fig. 5. CABAC context model initialization positions
and dependencies of among CTUs.
Table 1. BD rates according to the number of tiles.
BD-rate
Sequence
BasketballDrill
4 tiles
2.03
8 tiles
3.09
1.75
BQMall
0.85
PartyScene
0.60
0.80
Racehorse
0.99
1.34
Avg.
1.12
1.75
3. Parallelization Methods of HEVC
Decoder
The HEVC standard is designed to be suitable for
parallel processing. Tile and WPP were adopted for
better parallel processing. In addition, HEVC supports
a parallel friendly in-loop filter structures [13]. In this
section, four parallelization methods that can be used
in HEVC decoders for the real-time decoding of highresolution videos are introduced and compared.
Fig. 6. Example of the GOP structure for the frame-level
parallelism.
each CTU line, the CABAC context model is copied from
the updated CABAC context that is located at the right-top
CTU. If WPP is used, the efficiency of CABAC can be
degraded slightly. In addition, the dependence among the
CTUs should be considered. Fig. 5 shows the CABAC
context model initialization positions and the dependencies
of the WPP method. Because the CABAC contexts should
be initialized at the first CTU of each CTU line, the
entropy decoding of each CTU line cannot begin at the
same time. In addition, the dependencies among the CTUs
should be checked in a pixel decoding process.
3.3 Frame-level Parallelism
3.1 Tile-level Parallelism
Tile-level parallelism can be used for parallel decoding
in a picture. A picture can be partitioned into a number of
rectangular blocks composed of CTBs.
Fig. 4 gives an example of tile partitioning. No data
dependencies exist among the partitioned regions. In
addition, the scan order of the CTUs is changed to the tiles
scan order. Each tile can be decoded independently by
splitting into multiple cores or threads. Therefore, this
method is suitable for multi-core-based parallelization. The
parallel performance increases with increasing number of
tile splits; however, an increase in the tile splits depletes
the compression performance. Table 1 lists the
compression performance degradation according to the
number of tiles. The BD rate loss of 4 tiles and 8 tiles are
1.12% and 1.75% on average.
Frame-level parallelism is suitable for the special
purpose applications. For the frame-level parallelism, a
special GOP structure should be considered in an encoder.
Fig. 6 gives an example of the GOP structure for framelevel parallelism. In Fig. 6, all the B-pictures in a GOP
only refer to the previous I-picture. Using this structure,
the following B-pictures can be decoded in parallel. On the
other hand, a limited reference picture can lead to a rapid
decrease in compression performance. If the number of Bpictures that refer to previous I-pictures are increased,
good parallel performance can be achieved in the decoding
parts of the B-pictures. On the other hand, complexity of
the I-Picture is much higher than that of the B-picture.
Therefore, this parallel method cannot beat the decoding
time of the I-picture that is decoded sequentially.
3.4 2D-wavefront Parallelism
3.2 Wavefront Parallel Processing (WPP)
WPP is an adopted parallelization method of the HEVC
standard. Using this method, the entropy decoding of each
CTU line can be conducted in parallel. At the first CTU of
2D-wavefront parallelism is the most widely used
parallel method in video decoders. This method is an
encoder independent method compared to the above three
methods. Using 2D-wavefront parallelism, the pixel
31
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Fig. 7. Dependencies among the neighboring CTUs in
pixel decoding.
Fig. 9. Distribution of the CTUs that can be decoded at
the same time.
Table 2. Comparisons of the four parallel methods.
Compression
performance
degradation
Parallelization method
Encoder
dependency
Tile-level parallelism
○
○
WPP
○
○
Frame-level parallelism
○
○
2D-wavefront parallelism
None
None
Fig. 8. Problem of 2D-wavefront parallelism.
Table 3. Experimental conditions.
decoding process can be conducted simultaneously on a
CTU line level. Like the WPP method, the dependencies
among the CTUs should be considered. Fig. 7 presents the
neighboring CTUs that should be decoded before decoding
the current CTU.
Fig. 8 shows the problem of 2D-wavefront parallelism.
Because of the dependencies of the neighboring CTUs,
idle threads can occur in the prologue and epilogue parts in
a picture. The maximum number of CTUs (P_CTU) that
can conduct pixel decoding in parallel is calculated by Eq.
(1). 〖nCTU〗_h is the number of CTUs in the picture
height and 〖nCTU〗_w is the number of CTUs in the
picture width.
PCTU
nCTU w
⎧
+
, 1,ٛ h
⎪⎪min(nCTU
2
=⎨
nCTU w
, (nCTU,h
ٛ⎪ min
⎪⎩
2
if nCTU w is odd
Processor
Intel Xeon CPU E5-2687W v2 3.40GHz
OS
Windows Server 2008 R2 Standard
Compiler
Intel C++ Compiler XE 13.0
Parallel tool
OpenMP 2.0
Resolution
FHD (1920×1080), 4K-UHD (3840×2160)
Coding mode
Low Delay
QP
22, 27, 32, 37
dependent. On the other hand, the 2D-wavefront
parallelization method has no degradation of compression
performance and it is not dependent on the encoders.
4. Performance Evaluation
otherwise
(1)
In a Full HD resolution, there are 30×17 CTUs that the
size is 64×64 in a picture. In this case, the number of
CTUs that can be decoded simultaneously are represented
in Fig. 9. In the prologue and epilogue parts of a picture,
the CTUs that can be decoded at the same time are
insufficient. For this reason, the threads can be idle in
these parts.
3.5 Comparisons of parallelism
As introduced above, all the parallelization methods
have advantages and disadvantages. Table 2 compares the
four parallel methods. Note that tile-level parallelism,
WPP and frame-level parallelism have compression
performance degradation, and these methods are encoder
For a performance evaluation of the four parallelism
methods, all the parallelization methods were implemented
on ANSI C-based private decoder developed based on the
HM12.0 decoder [19]. For the parallel implementation,
OpenMP was used with Microsoft Visual Studio 2012 [20].
Table 3 lists the experimental conditions.
The parallelism performance was measured by Eq. (2).
Decoding timeref represents the decoding time of the
HM12.0 decoder without any parallelism methods.
Decoding time prop denotes the decoding time of the
parallelized private decoder with a parallelization method.
Speed − up =
Decoding timeref
Decoding time prop
(2)
Table 4 lists the speed-up factors of tile-level
32
Ryu et al.: Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos
Table 4. Speed-up factors of tile-level parallelism.
Resolution
Sequence
BasketballDrive
FHD
4K-UHD
Speed-up according to the
number of tiles
2
2.81
4
3.61
4K-UHD
Sequence
BasketballDrive
Speed-up according to
GOP size
4
2.12
6
2.21
8
2.33
BQTerrace
2.37
4.08
4.80
BQTerrace
2.01
2.13
2.37
Cactus
3.04
3.68
4.26
Cactus
1.96
2.06
2.28
Kimono1
3.00
4.14
4.65
ParkScene
3.12
3.84
4.78
FHD
Kimono1
1.71
1.83
1.93
ParkScene
1.82
1.94
2.07
Avg.
2.87
3.87
4.59
Avg.
1.92
2.03
2.20
Beauty
2.89
4.15
4.56
Beauty
1.68
1.85
1.89
Jockey
3.06
4.02
4.87
Jockey
1.71
1.83
1.85
ShakeNDry
2.97
3.64
4.26
Avg.
2.97
3.94
4.56
Sequence
BasketballDrive
FHD
Resolution
6
4.45
Table 5. Speed-up factors of WPP.
Resolution
Table 6. Speed-up factors of frame-level parallelism.
4K-UHD
ShakeNDry
1.67
1.84
1.91
Avg.
1.69
1.84
1.88
Table 7. Speed-up factors of 2D-wavefront parallelism.
Speed-up according to the
number of threads
Resolution
Sequence
2
4
6
2.47
3.07
3.40
BasketballDrive
Speed-up according to the
number of threads
2
2.75
4
3.43
6
3.47
BQTerrace
2.09
2.93
3.07
BQTerrace
3.00
3.61
3.67
Cactus
2.49
2.95
3.73
Cactus
2.13
2.50
2.47
Kimono1
2.10
2.80
2.82
Kimono1
2.81
3.50
3.57
ParkScene
1.98
2.88
2.85
ParkScene
2.17
2.62
2.63
Avg.
2.23
2.93
3.17
Avg.
2.57
3.13
3.16
Beauty
2.71
3.58
4.60
Beauty
2.04
1.94
2.03
Jockey
2.36
3.30
2.93
FHD
Jockey
2.67
3.66
4.11
ShakeNDry
2.11
2.63
3.28
ShakeNDry
2.03
2.39
2.35
4.00
Avg.
2.14
2.54
2.44
Avg.
2.50
3.29
parallelism. The speed-up factors increase with increasing
number of tiles. The results shows that the average of the
speed-up factors are 4.59 and 4.56 for FHD and 4K-UHD
with 6 tiles, respectively.
Table 5 lists the speed-up factors using the WPP. The
results show that the speed-up factors for the 4K-UHD
sequences are higher than those for FHD sequences. The
speed-up factors are influenced by the entropy decoding
complexity between the FHD and 4K-UHD sequences.
The entropy decoding part of the 4K-UHD sequences is
more complex than the entropy decoding of the FHD
sequences. Therefore, the parallel performance of the WPP
for parallel entropy decoding showed reasonably good
parallel performance for the 4K-UHD sequences.
Table 6 presents the results of frame-level parallelism.
The speed-up factors are not large. These results are due to
the decoding overhead of the I-picture. Table 7 lists the
speed-up factors of the 2D-wavefront parallelism. The
mean speed-up factors for the FHD videos were higher
than those for the 4K-UHD videos. This means that the
parallel performance of 2D-wavefront parallelism is
affected by the entropy decoding part that is not a parallel
region based on the 2D-wavefront parallelism.
4K-UHD
5. Conclusion
This paper introduced several parallel methods for
HEVC decoders and evaluated the real-time decoding of
high-resolution videos. Tile-level parallelism, WPP,
frame-level parallelism, and 2D-wavefront parallelism
were implemented on the ANSI C-based private decoder.
The parallel performance of several parallel methods were
compared. For the FHD test sequences, the mean speed-up
factors of tile-level parallelism, WPP, frame-level
parallelism, and 2D-wavefront parallelism were evaluated
and found to be up to 4.59, 3.17, 2.20, and 3.16,
respectively. In addition, the speed-up factors of the 4KUHD sequences are observed up to 4.56, 4.00, 1.88, and
2.54 for tile-level parallelism, WPP, frame-level
parallelism, and 2D-wavefront parallelism, respectively.
Acknowledgement
This research was supported by Basic Science Research
Program through the National Research Foundation of
Korea(NRF) funded by the Ministry of Science, ICT &
Future Planning(NRF-2014R1A2A1A11052210)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
References
[1] High Efficiency Video Coding, Rec. ITU-T H.265
and ISO/IEC 23008-2, Jan 2013. Article (CrossRef
Link)
[2] J. Lainema, F. Bossen, W. Han, J. Min, and K. Ugur,
“Intra Coding of the HEVC Standard,” IEEE Trans.
Circuits Syst. Video Technol., vol. 22, no. 12,
pp.1792-1801, Dec. 2012. Article (CrossRef Link)
[3] G.J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand,
“Overview of the High Efficiency Video Coding
(HEVC) standard,” IEEE Trans. Circuits Syst. Video
Technol., vol. 22, no. 12, pp.1648-1667, Dec. 2012.
Article (CrossRef Link)
[4] C. Fu, E. Alshina, A. Alshin, Y. Huang, C. Chen, and
C. Tsai, C. Hsu, S. Lei, J. Park, and W. Han, “Sample
adaptive offset in the HEVC standard,” IEEE trans.
on Circuits and Systems for Video Technol., vol. 22,
no. 12, pp. 1755-1764, Dec. 2012. Article (CrossRef
Link)
[5] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T.K. Tan, and
T. Wiegan, “Comparison of the coding efficiency of
video coding standards-Including High Efficiency
Video Coding (HEVC),” IEEE Trans. Circuits Syst.
Video technol., vol. 22, no. 12, pp. 1668-1683, Dec.
2012. Article (CrossRef Link)
[6] Y. Ahn, T. Hwang, D. Sim, and W. Han,
“Implementation of fast HEVC encoder based on
SIMD and data-level parallelism,” EURASIP Journal
on Image and Video Processing, March. 2014.
Article (CrossRef Link)
[7] E. Ryu, J. Nam, S. Lee, H. Jo, and D. Sim, "Sample
adaptive offset parallelism in HEVC," International
Conference on Green and Human Information
Technology (ICGHIT 2013), Hanoi, Vietnam, Feb. 27
~ Mar. 1, 2013. Article (CrossRef Link)
[8] H. Jo and D. Sim, “Hybrid Parallelization for HEVC
Decoder,” Image and Signal Processing (CISP)., vol.
1, pp. 170-175, Dec. 2012. Article (CrossRef Link)
[9] E. Baaklini, H. Sbeity, and S. Niar, “H.264
Macroblock Line Level Parallel Video Decoding on
Embedded Multicore Processors,” Digital System
Design (DSD), pp. 902-906, Sep. 2012. Article
(CrossRef Link)
[10] B. Bross, W.-J, Han, G.J. Sullivan, J.-R. Ohm, and T.
Wiegand, “High Efficiency Video Coding (HEVC)
Text Specification Draft 10 (for FDIS & Last
Call,”document JCTVC-L1003_v34, Geneva, CH,
Jan. 2013. Article (CrossRef Link)
[11] K. Misra, A. Segall, M. Horowiz, S. Xu, A. Fuldseth
and M. Zhou, “An Overvidw of Tiles in HEVC,”
IEEE Selected Topics in Signal Processing, vol. 7, pp.
969-977, Nov. 2013. Article (CrossRef Link)
[12] C. Chi, M. Alvarez-Mesa, B. Juulink, G. Clare, F.
Henry, S. Pateux, and T. Schierl, “Parallel Scalability
and
Efficiency
of
HEVC
Parallelization
Approaches,” IEEE Trans. Circuits Syst. Video
Technol., vol. 22, no. 12, pp. 1827-1838, Dec. 2012.
Article (CrossRef Link)
[13] A. Norkin, G. Bjøntegaard, A. Fuldseth, M.
Narroschke, M. Ikeda, K. Andersson, M. Zhou, and
[14]
[15]
[16]
[17]
[18]
[19]
[20]
33
G. Auwera, “HEVC deblocking filter,” IEEE trans.
on Circuits and Systems for Video Technol., vol. 22,
no. 12, pp. 1746-1754, Dec. 2012. Article (CrossRef
Link)
J. Boyce, et al., "Draft high efficiency video coding
(HEVC) version 2, combined format range
extensions (RExt), scalability (SHVC), and multiview (MV-HEVC) extensions," JCTVC-R1013, 18th
JCT-VC meeting, Sapporo, JP, June 2014. Article
(CrossRef Link)
ITU-T Q6/16 and ISO/IEC JTC1/SC29/WG11
document N14175, "Joint Call for Proposals for
Coding of Screen Content," San Jose, USA, Jan. 2014.
Article (CrossRef Link)
T. Wiegand, G. J. Sullivan, G.Bjontegaard, and A.
Luthra, “Overview of the H.264/AVC Video Coding
Standard,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 13, no. 7, pp.
560–576, Jul. 2003. Article (CrossRef Link)
F. Bossen, B. Bross, K. S¨uhring, and D. Flynn,
“HEVC complexity and implementation analysis,”
IEEE trans. on Circuits and Systems for Video
Technology, vol. 22, no. 12, pp. 1685-1696, Dec.
2012. Article (CrossRef Link)
H. Jo and D. Sim, "Bitstream decoding processor for
fast entropy decoding of variable length coding–
based multiformat videos," Optical Engineering on
Imaging Components, Systems, and Processing, vol.
53, no. 6, June 2014. Article (CrossRef Link)
HM-12.0,
https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftwar
e/tags/HM-12.0 Article (CrossRef Link)
L. Dagum and R. Menon, “OpenMP: an industry
standard API for shared-memory programming,”
IEEE Computational Science & Engineering, vol. 5,
no. 1, pp.46-55, Jan. 1998. Article (CrossRef Link)
Hochan Ryu received his B.S. degree
in Computer Engineering from Kwangwoon University, Seoul, Republic of
Korea, in 2013. Currently, he is
working toward a M.S degree in
computer engineering at the same
university. His current research
interests include video coding and
computer vision.
Yong-Jo Ahn was born in Jeju
Province, Republic of Korea, in 1984.
He received his B.S. and M.S. degree
in Computer Engineering from Kwangwoon University, Seoul, Korea, in
2010 and 2012, respectively. Now, he
is working toward a Ph.D. degree at
the same university. His current
research interests include high-efficiency video compression
techniques, parallel processing for video coding and
multimedia system.
34
Ryu et al.: Performance Analysis of HEVC Parallelization Methods for High-Resolution Videos
Jung-Soo Mok received his B.S.
degree in Computer Engineering from
Kwangwoon
University,
Seoul,
Republic of Korea, in 2013. Currently,
he is working towards a M.S degree in
computer engineering at the same
university. His current research
interests include video coding and
computer vision.
Donggyu Sim received his B.S. and
M.S. degrees in Electronic Engineering from Sogang University,
Korea, in 1993 and 1995, respectively.
He also received the Ph.D. degree at
the same University, in 1999. He was
with Hyundai Electronics Co., Ltd.
From 1999 to 2000, being involved in
MPEG-7 standardization. He was a senior research
engineer at Varo Vision Co., Ltd, working on MPEG-4
wireless applications from 2000 to 2002. He worked for
the Image Computing Systems Lab. (ICSL) at the
University of Washington as a senior research engineer
from 2002 to 2005. He researched ultrasound image
analysis and parametric video coding. Since 2005, he has
been with the Department of Computer Engineering at
Kwangwoon University, Seoul, Korea. In 2011, he joined
the Simon Frasier University, as a visiting scholar. He was
elevated to an IEEE Senior Member in 2004. He is one of
main inventors in many essential patents licensed to
MPEG-LA for HEVC standard. His current research
interests include video coding, video processing, computer
vision, and video communication.
Copyrights © 2015 The Institute of Electronics and Information Engineers
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.035
35
IEIE Transactions on Smart Processing and Computing
Deep Convolution Neural Networks in Computer Vision:
a Review
Hyeon-Joong Yoo
Department of IT Engineering, Sangmyung University / Chonan, Choongnam-Do 330-720 Korea [email protected]
* Corresponding Author:
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
* Review Paper: This paper reviews the recent progress possibly including previous works in a particular research topic, and
has been accepted by the editorial board through the regular reviewing process.
Abstract: Over the past couple of years, tremendous progress has been made in applying deep
learning (DL) techniques to computer vision. Especially, deep convolutional neural networks
(DCNNs) have achieved state-of-the-art performance on standard recognition datasets and tasks
such as ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). Among them, GoogLeNet
network which is a radically redesigned DCNN based on the Hebbian principle and scale
invariance set the new state of the art for classification and detection in the ILSVRC 2014. Since
there exist various deep learning techniques, this review paper is focusing on techniques directly
related to DCNNs, especially those needed to understand the architecture and techniques employed
in GoogLeNet network.
Keywords: Deep learning, Convolutional neural network, ImageNet, Computer vision, GoogLeNet
1. Introduction
Over the last couple of years, deep learning techniques
have made tremendous progress in computer vision,
especially in the field of object recognition. Deep learning
is a family of methods that uses deep architectures to learn
high-level feature representation. (Several much lengthier
definitions can be found in [1].) When a network has more
than one hidden layer, it is of deep architecture. The
essence of deep learning is to compute hierarchical
features or representations of the observational data, where
the higher-level features or factors are defined from lowerlevel ones.
The family of deep learning methods have been
growing increasingly richer, encompassing those of neural
networks, hierarchical probabilistic models, and a variety
of unsupervised and supervised feature learning algorithms.
Active researchers in this area include those at
University of Toronto, New York University, University of
Montreal, Stanford University, Google, Baidu, Microsoft
Research, Facebook, just to name a few. These researchers
have demonstrated empirical successes of deep learning in
diverse applications of computer vision, phonetic
recognition, voice search, conversational speech
recognition, speech and image feature coding, robotics,
and so on.
In this paper we focus on the architecture and training
methods of deep networks, specifically deep convolutional
neural networks.
1.1 History of Neural Networks
Historically, the concept of deep learning originated
from artificial neural network research. The history of
artificial neural network is filled with individuals from
many different fields, including psychologists and
physicists. The early model of an artificial neuron is
introduced by McCulloch and Pitts in 1943 [2]. Their work
is often acknowledged as the origin of the neural network
field. They showed that networks of artificial neurons
could, in principle, compute any arithmetic or logical
function.
In 1949, Hebb [3] proposed a mechanism for learning
in biological neurons, and introduced the Hebb learning
rule. The Hebb rule is often paraphrased as "Neurons that
fire together wire together." If the set of input patterns
used in training are mutually orthogonal, the association
can be learned by a two-layer pattern associator using
36
Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review
Hebbian learning. However, if the set of input patterns are
not mutually orthogonal, interference may occur and the
network may not be able to learn associations, resulting in
a low absolute capacity of the Hebb rule.
The first practical application of artificial neural
network came with the invention of the perceptron
network and associated learning rule by Rosenblatt in the
late 1950s [4]. With the perceptron network, he
demonstrated its ability to perform pattern recognition. At
about the same time, Widrow and Hoff [5] introduced a
new learning algorithm called the Widrow-Hoff learning
rule, and used it to train adaptive linear neural networks,
which were similar to Rosenblatt’s perceptron.
Unfortunately, both of the networks suffered from the
same inherent limitations, which were widely publicized in
a book by Minsky and Papert [6]. They discovered two key
issues with the computational machines that processed
neural networks. The first issue was that single-layer
neural networks were incapable of solving the exclusive-or
(XOR) problem. The second significant issue was that
computers were not sophisticated enough to effectively
handle the long run time required by large neural networks.
Many people, influenced by Minsky and Papert, believed
that further research on neural networks was a dead end.
The book caused many researchers to leave the field and
nearly killed neural net research for more than a decade.
However, some important work continued during the
1970s. Kohonen [7] and Anderson [8] independently and
separately developed new neural networks that could act as
memories. Grossberg [9] was also very active during this
period in the investigation of self-organizing networks.
During the 1980s research in neural networks increased
dramatically. The impediments in the 1960s were
overcome, and, in addition, important new concepts were
introduced and responsible for the rebirth of neural
networks. Firstly, in 1982, physicist Hopfield used
statistical mechanics to explain the operation of a certain
class of recurrent network, which could be used as an
associative memory [10]. Another key development was
the backpropagation algorithm, a generalized form of the
delta rule, for training multilayer perceptron networks.
Although the derivation procedure had previously been
published by Werbos in [11], the most influential
publication of the backpropagation algorithm was by
Rumelhart and McClelland [12]. They showed that this
method works for the class of semilinear activation
functions (non-decreasing and differentiable). It was the
answer to the criticisms Minsky and Papert had made in
1969. These new development reinvigorated the field of
neural networks.
One long-term goal of machine learning research is to
produce methods that are applicable to highly complex
tasks, such as perception (vision, audition), reasoning,
intelligent control, and other artificially intelligent
behaviors. In order to progress toward this goal, algorithms
that can learn highly complex functions with minimal need
for prior knowledge, and with minimal human intervention
must be discovered. However, shallow architectures can be
very inefficient in terms of required number of
computational elements and examples. However, for deep
architectures, there are such backpropagation-specific
properties that can occasionally be a problem as slow
learning speed (the further the weights are from the output
layer, the slower backpropagation learns) and overfitting.
Researchers tried using stochastic gradient descent and
backpropagation to train deep networks. Unfortunately,
except for a few special architectures, they didn't have
much luck. The networks would learn, but very slowly,
and in practice often too slowly to be useful.
Building on Rumelhart et al. [12], LeCun et al. [13]
showed
that
stochastic
gradient
descent
via
backpropagation was effective for training convolutional
neural networks (CNNs). CNNs saw commercial use with
[14], but then fell out of fashion with the rise of support
vector machines and other, much simpler methods such as
linear classifiers.
Promising new methods like Hinton et al.’s [15] have
been developed that enable learning in deep neural nets,
and in 2012, Krizhevsky et al. [16] rekindled interest in
CNNs by showing substantially higher image classification
accuracy on the ImageNet Large Scale Visual Recognition
Challenge (ILSVRC, http://www.image-net.org/). These
techniques have enabled much deeper (and larger)
networks to be trained - people now routinely train
networks with many hidden layers. And, it turns out that
these perform far better on many problems than shallow
neural networks [16-18].
The rest of this paper is organized as follows. The next
section describes in chronological order important
historical events and techniques needed to understand the
architecture and training methods of current state of the art
deep convolutional neural network, GoogLeNet. Then the
GoogLeNet network is investigated deeper in Section 3.
Finally, we conclude this paper in Section 4.
2. Deep Convolution Neural Networks
Shallow architectures have been shown effective in
solving simple or well-constrained problems, but their
limited modeling and representational power can cause
difficulties when dealing with more complicated realworld applications involving natural signals such as natural
image and visual scenes. Human information processing
mechanisms suggest the need of deep architectures for
extracting complex structure and building internal
representation form rich sensory inputs. However, training
deep neural networks is hard, as backpropagated gradients
quickly vanish exponentially in the number of layers. A set
of techniques has been developed that enable learning in
deep neural networks. People now routinely train networks
with many hidden layers. And, it turns out that these
perform far better on many problems than shallow neural
networks. Deep neural networks have finally attracted
wide-spread attention. In this section, we describe some
important techniques and events necessary to understand
the deep convolutional neural network called GoogLeNet.
Convolutional networks are an attempt to solve the
dilemma between small networks that cannot learn the
training set, and large networks that seem overparameterized.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
37
Fig. 1. Hierarchical structure of the neocognitron
network (Adapted from [19]).
Fig. 2. Architecture of CNN in 1989 (Adapted from [13]).
2.1 Neocognitron
A hierarchical multilayered artificial neural network
called neocognitron was proposed by Fukushima in 1980
[19]. Local features in the input are integrated gradually
and classified in the higher layers. Fig. 1 shows its
architecture. Later, in 1989, the idea of local feature
integration is adapted in LeCun et al.’s convolutional
neural networks [13].
The neocognitron was specifically proposed to address
the problem of handwritten character recognition. It is a
hierarchical network with many pairs of layers
corresponding to simple (S layer) and complex (C layer)
cells with a very sparse and localized pattern of
connectivity between layers.
The number of planes of simple cells and of complex
cells within a pair of S and C layers being the same, these
planes are paired, and the complex plane cells process the
outputs of the simple plane cells. The simple cells are
trained so that the response of a simple cell corresponds to a
specific portion of the input image. If the same part of the
image occurs with some distortion, in terms of scaling or
rotation, a different set of simple cells responds to it. The
complex cells output to indicate that some simple cell they
correspond to did fire. While simple cells respond to what is
in a contiguous region in the image, complex cells respond
on the basis of a larger region. As the process continues to
the output layer, the C-layer component of the output layer
responds, corresponding to the entire image presented in the
beginning at the input layer. The neocognitron, however,
lacked a supervised training algorithm.
2.2 CNN
Building on Rumelhart et al. [12], LeCun et al. [13]
showed
that
stochastic
gradient
descent
via
backpropagation was effective for training convolutional
neural networks, a class of models that extend the
neocognitron. LeCun et al. presented a convolutional
neural network consisting of 3 hidden layers including 2
convolutional layers, and 64,660 connections. They
applied it to handwritten zip code recognition. Fig. 2
shows its architecture. Unlike previous works, the network
was directly trained on a low-level representation of data
that had minimal preprocessing (as opposed to elaborate
feature extraction), thus demonstrating the ability of
backpropagation networks to deal with large amounts of
low-level information.
The first hidden layer is composed of several planes
called feature maps. Distinctive features of an object can
appear at various locations on the input image. It seems
judicious to have all units in a plane share the same set of
weights, thereby detecting the same feature at different
locations. Thus the detection of a particular feature at any
location on the input can be easily done using the "weight
sharing" technique. Since the exact position of the feature
is not important, the feature maps need not have as many
units as the input. However, units do not share biases. Due
to the weight sharing, the network has few free parameters.
For example, though layer H1 has 19,968 connections, it
has only 1,068 free parameters (768 biases plus 25 times
12 feature kernels).
The connection scheme between H1 and H2 is slightly
more complicated: Each unit in H2 combines local
information coming from 8 of the 12 different feature
maps in H1. Its receptive field is composed of eight 5 by 5
neighborhoods centered on units that are at identical
positions within each of the eight maps. Once again, all
units in a given map are constrained to have identical
weight vectors. Layer H3 has 30 units, and is fully
connected to H2. The output layer is also fully connected
to H3. In summary, the network has 1,256 units, 64,660
connections, and 9,760 independent parameters.
The target values for the output units were chosen
within the quasilinear range of the sigmoid. This prevents
the weights from growing indefinitely and prevents the
output units from operating in the flat spot of the sigmoid.
During each learning experiment, the patterns were
repeatedly presented in a constant order. The weights were
updated according to the stochastic gradient or "on-line"
procedure. From empirical study (supported by theoretical
arguments), the stochastic gradient was found to converge
much faster than the true gradient, especially on large,
redundant data bases. It also finds solutions that are more
robust.
38
Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review
2.3 LeNet-5
Convolutional networks combine three architectural
ideas to ensure some degree of shift, scale, and distortion
invariance: local receptive fields, shared weights (or
weight replication), and spatial or temporal sub-sampling.
Fig. 3 shows the architecture of LeNet-5, a convolutional
neural network proposed by LeCun et al. in 1998 [14],
which was used commercially for reading bank checks.
A convolutional layer is composed of several feature
maps (with different weight vectors), so that multiple
features can be extracted at each location. The receptive
fields of contiguous units in a feature map are centered on
correspondingly contiguous units in the previous layer.
Therefore receptive fields of neighboring units overlap. All
the units in a feature map share the same set of weights
and the same bias so they detect the same feature at all
possible locations on the input.
A sequential implementation of a feature map would
scan the input image with a single unit that has a local
receptive field, and store the states of this unit at
corresponding locations in the feature map. This operation
is equivalent to a convolution, followed by an additive bias
and squashing function, hence the name convolutional
network.
An interesting property of convolutional layers is that if
the input image is shifted, the feature map output will be
shifted by the same amount, but will be left unchanged
otherwise. This property is at the basis of the robustness of
convolutional networks to shifts and distortions of the
input. A large degree of invariance to geometric
transformations of the input can be achieved with this
progressive reduction of spatial resolution compensated by
a progressive increase of the richness of the representation
(the number of feature maps).
The convolution/subsampling combination, inspired by
Hubel and Wiesel’s notions of "simple" and "complex"
cells [20], was implemented in Fukushima's Neocognitron
[19], though no globally supervised learning procedure
such as backpropagation was available then.
Starting with LeNet-5 [14], convolutional neural
networks (CNN) have typically had a standard structure –
stacked convolutional layers (optionally followed by
contrast normalization and maxpooling) are followed by
one or more fully-connected layers. All the weights are
learned with backpropagation. Variants of this basic design
are prevalent in the image classification literature and have
yielded the best results to-date on MNIST, CIFAR and
most notably on the ImageNet classification challenge [16,
18, 21]. For larger datasets such as Imagenet, the recent
trend has been to increase the number of layers [22] and
layer size [23, 24], while using dropout [25] to address the
problem of overfitting.
2.4 Multi-Scale ConvNet
Sermanet et al. [26] modified the traditional ConvNet
(convolutional networks) architecture by feeding 1st stage
features in addition to 2nd stage features to the classifier as
shown in Fig. 4. Each stage is composed of a
(convolutional) filter bank layer, a non-linear transform
layer, and a spatial feature pooling layer. ConvNets are
Fig. 3. Architecture of LeNet-5, a Convolutional Neural Network, here for characters recognition. Each plane is a
feature map, i.e., a set of units whose weights are constrained to be identical (Adapted from [14]).
Fig. 4. A 2-stage ConvNet architecture. The input is processed in a feedforward manner through two stage of
convolutions and subsampling, and finally classified with a linear classifier. The output of the 1st stage is also fed
directly to the classifier as higher-resolution features. (Adapted from [26]).
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
generally composed of one to three stages, capped by a
classifier composed of one or two additional layers.
Contrary to Fan et al.’s [27], they used the output of the
first stage after pooling/subsampling rather than before.
Additionally, applying a second subsampling stage on the
branched output yielded higher accuracies than with just
one. Feeding the outputs of all the stages to the classifier
allows the classifier to use, not just high-level features,
which tend to be global, invariant, but with little precise
details, but also pooled low-level features, which tend to
be more local, less invariant, and more accurately encode
local motifs with more precise details. The motivation for
combining representation from multiple stages in the
classifier is to provide different scales of receptive fields to
the classifier.
When applied to the task of traffic sign classification as
part of the GTSRB competition, experiments produced a
new record of 99.17%, above the human performance of
98.81%.
2.4 ReLU and Dropout
Sparsity was
first introduced in computational
neuroscience in the context of sparse coding in the visual
system [28]. The neuroscience literature [29, 30] indicates
that cortical neurons are rarely in their maximum
saturation regime, and suggests that their activation
function can be approximated by a rectifier.
In 2011, Glorot et al. [31] showed that using a
rectifying non-linearity gives rise to real zeros of
activations and thus truly sparse representations. They
showed that rectifying neurons are an even better model of
biological neurons and yield equal or better performance
than hyperbolic tangent networks in spite of the hard nonlinearity and non-differentiability at zero, creating sparse
representations with true zeros, which seem remarkably
suitable for naturally sparse data. Their experiments on
image and text data indicated that training proceeds better
when the artificial neurons are either off or operating
mostly in a linear regime. Rectifying activation allowed
deep networks to achieve their best performance without
unsupervised pre-training on purely supervised tasks with
large labeled datasets.
In 2012, Krizhevsky et al. [16] rekindled interest in
CNNs by showing substantially higher image classification
accuracy on the ImageNet Large Scale Visual Recognition
39
Challenge (ILSVRC) with their model achieving an error
rate of 16.4%, compared to the 2nd place result of 26.1%.
They used purely supervised learning, suggesting that
classical backpropagation does well even without
unsupervised pre-training. Their success resulted from
training a large, deep CNN, shown in Fig. 5, on 1.2 million
labeled images, together with a few twists on LeCun’s
CNN, i.e., max(x, 0) rectifying non-linearities and
“dropout” regularization.
In terms of training time with gradient descent, the
saturating nonlinearities f(x) = tanh(x) or f(x) = (1 + e-x)-1
are much slower than the non-saturating nonlinearity f(x) =
max(x, 0). Following Nair and Hinton [32], they refer to
neurons with this nonlinearity as Rectified Linear Units
(ReLUs). Deep convolutional neural networks with ReLUs
train several times faster than their equivalents with tanh
units. Faster learning has a great influence on the
performance of large models trained on large datasets.
When a large feedforward neural network is trained on
a small training set, it typically performs poorly on heldout test data due to its high capacity. To prevent this
“overfitting”, they employed a regularization approach
called “dropout” that stochastically sets half the activations
within a hidden layer to zero for each training sample
during training. They used dropout in the first two fullyconnected layers (See Fig. 6). Thereby a hidden unit
cannot rely on other hidden units being present. This
prevents complex co-adaptations on the training data. It
has been shown to deliver significant gains in performance
across a wide range of problems.
Another way to view the dropout procedure is as a very
efficient way of performing model averaging with neural
networks. A good way to reduce the error on the test set is
to average the predictions produced by a very large
number of different networks. The standard way to do this
is to train many separate networks and then to apply each
of these networks to the test data, but this is
computationally expensive during both training and testing.
Random dropout makes it possible to train a huge number
of different networks in a reasonable time. There is almost
certainly a different network for each presentation of each
training case but all of these networks share the same
weights for the hidden units that are present.
Dropout does not seem to have the same benefits for
convolutional layers, which are common in many networks
designed for vision tasks. Dropout roughly doubles the
Fig. 5. Architecture of Krizhevsky et al.’s DCNN (Adapted from [16]).
40
Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review
distribution of the dataset is representable by a large, very
sparse deep neural network, then the optimal network
structure can be learned layerwise by analyzing the
correlation statistics of the activations of the last layer and
clustering neurons with highly correlated outputs. Szegedy
et al. [18] took inspiration and guidance from this
theoretical work, and their GoogLeNet DCNN
significantly outperformed the ILSVRC 2014 classification
and detection challenges.
Fig. 6. Sparse propagation of activations and gradients
(Adapted from [31]).
3. A Deeper and Wider CNN
number of iterations required to converge. Without
dropout, their network exhibited substantial overfitting.
They achieved record-breaking results on a highly
challenging dataset using purely supervised learning. They
found that the depth (five convolutional and three fullyconnected layers) really was important for achieving their
results by observing that removing convolutional layer
(each of which contains no more than 1% of the model’s
parameters) resulted in inferior performance (a loss of
about 2% for the top-1 performance of the network).
2.5 ConvNet Visualization
Though large ConvNets (convolutional networks) have
recently
demonstrated
impressive
classification
performance, there is no clear understanding of why they
perform so well, or how they might be improved, which is
deeply unsatisfactory from a scientific standpoint. Without
clear understanding of how and why they work, the
development of better models is reduced to trial-and-error.
In 2013, Zeiler et al. addressed both issues [24]. They
introduced a visualization technique that gives insight into
the function of intermediate feature layers and the
operation of the classifier. The visualization technique uses
a multi-layered deconvolutional network (deConvNet), as
proposed by Zeiler et al. [33], to project the feature
activations back to the input pixel space. DeConvNets are
not used in any learning capacity, just as a probe of an
already trained ConvNet. They also performed an ablation
study to discover the performance contribution from
different model layers. This enabled them to find model
architectures that outperformed Krizhevsky et al. [16] on
the ImageNet classification benchmark.
2.6 Sparsely Connected Architecture
Bigger size typically means two drawbacks: a larger
number of parameters, which makes the enlarged network
more prone to overfitting, especially if the number of
labeled examples in the training set is limited; and the
dramatically increased use of computational resources,
which is always finite, in a deep vision network where
convolutional layers are chained.
The fundamental way of solving both issues would be
by ultimately moving from fully connected to sparsely
connected architectures, even inside the convolutions.
Besides mimicking biological systems, this would also
have the advantage of firmer theoretical underpinnings due
to the work of Arora et al. [34]: If the probability
In this section the important factors in architecture and
training methods of GoogLeNet’s classification
submission and detection submission are described. Most
of the content in this section is from Szegedy et al.’s [18].
Szegedy et al.’s GoogLeNet won the classification and
object recognition challenges in the ILSVRC 2014 by
setting the new state of the art. GoogLeNet, a radically
redesigned DCNN, used a new variant of convolutional
neural network called “Inception” for classification, and
the R-CNN [17] for detection. GoogLeNet submission to
ILSVRC 2014 actually uses 12x fewer parameters than the
winning architecture of Krizhevsky et al. [16] from two
years ago, while being significantly more accurate.
Compared to the 2013 result, the detection accuracy has
almost doubled from 22.6% to 43.9%. The ILSVRC
detection task is to produce bounding boxes around objects
in images among 200 possible classes.
GoogLeNet uses a significantly deeper and wider
convolutional neural network architecture than traditional
DCNNs at the cost of a modest growth in evaluation time.
Fig. 7 shows a schematic view of GoogLeNet network
which includes 9 repeated layers of so-called Inception
module. Fig. 8 shows Inception module with dimension
reduction. Inspired by a neuroscience model of primate
visual cortex, the Inception model uses a series of trainable
filters of different sizes in order to handle multiple scales.
The main idea of the Inception architecture is based on
finding out how an optimal local sparse structure in a
convolutional vision network can be approximated and
covered by readily available dense components. An
alternative parallel pooling path in the module is added in
each stage since pooling operations have been essential for
the success in current state of the art convolutional
networks.
Since even a modest number of 5x5 convolutions can
be prohibitively expensive on top of a convolutional layer
with a large number of filters, 1x1 convolutions are placed
before the expensive 3x3 and 5x5 convolutions as
dimension reduction module. This allows for not just
increasing the depth, but also the width of the networks
without significant performance penalty. The resultant
architecture leads to over 10x reduction in the number of
parameters compared to most state of the art vision
networks, which reduces overfitting during training and
allows the system to perform inference with low memory
footprint.
All the convolutions, including those inside the
Inception modules, use rectified linear activation. However,
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
41
Fig. 7. A schematic view of GoogLeNet network (Adapted from [18]).
Fig. 8. Inception module with dimension reduction (Adapted from [18]).
given the relatively large depth of the network, the ability
to propagate gradients back through all the layers in an
effective manner was a concern. To solve this issue,
inspired by [34] in which a layer-by-layer construction is
suggested, they add auxiliary classifiers connected to the
layers in the middle of the network during training,
expecting to encourage discrimination in the lower stages
in the classifier, increase the gradient signal that gets
propagated back, and provide additional regularization.
However, the biggest gains in object-detection have
come from the synergy of deep architectures and classical
computer vision, like the R-CNN algorithm by Girshick et
al. [17]. For classification challenge entry, several ideas
from the work of [35] were incorporated and evaluated,
specifically as they relate to image sampling during
training and evaluation. Their approach yielded solid
evidence that moving to sparser architectures is feasible
and useful idea in general.
new ideas, algorithms and improved network architectures.
Although the tremendous progress of this new
technology is very impressive and encouraging, it is still
difficult to predict the future success of deep neural
networks. Remembering that we still know very little of
the brain mechanism and have many orders of magnitude
to go in order to match the infero-temporal pathway of the
primate visual system, the most important advances in
deep neural networks certainly lie in the future.
Acknowledgment
This material is based upon work supported by
Sangmyung University, Korea. Comments by Prof.
Dongsun Park at Chonbuk National University, Korea,
greatly helped to improve an earlier version of this
manuscript.
References
4. Conclusions
Deep learning techniques have made tremendous
progress in computer vision, and are significantly
outperforming other techniques, and even humans in
certain limited recognition tests. Most of this progress is
not just the result of more powerful hardware, larger
datasets and bigger models, but mainly a consequence of
[1] L. Deng and D. Yu, Deep Learning Methods and
Applications, now Publishers Inc., 2014.
[2] W. McCulloch and W. Pitts, “A Logical calculus of
the ideas immanent in nervous activity”, Bulletin of
Mathematical Biophysics, vol. 5, pp. 115-133, 1943.
[3] Donald O. Hebb, The Organization of Behavior: A
Neuropsychological Theory, Wiley, June 1949.
42
Yoo et al.: Deep Convolution Neural Networks in Computer Vision: a Review
[4] F. Rosenblatt, “The perceptron: A probabilistic model
for information storage and organization in the
brain,” Psychological Review, vol. 65, pp. 386-408,
1958. Article (CrossRef Link).
[5] B. Widrow and M.E. Hoff, Jr., “Adaptive Switching
Circuits,” IRE WESCON Convention Record, Part 4,
pp. 96-104, August 1960.
[6] M. Minsky and Seymour Papert, Percetrons,
Cambridge, MIT Press, 1969.
[7] T. Kohonen, "Correlation Matrix Memories," IEEE
Transactions on Computers, vol. 21, pp. 353-359,
April 1972. Article (CrossRef Link)
[8] James A. Anderson, “A simple neural network
generating an interactive memory,” Mathematical
Biosciences, vol. 14, pp. 197–220, 1972. Article
(CrossRef Link)
[9] S. Grossberg, "Adaptive pattern classification and
universal recoding: I. Parallel development and
coding of neural feature detectors," Biological
Cybernetics, vol. 23, pp. 121-134, 1976. Article
(CrossRef Link)
[10] J. J. Hopfield, “Neural Networks and Physical
Systems with Emergent Collective Computational
Abilities,” Proceedings of the National Academy of
Sciences of the United States of America (PNAS), vol.
79, pp. 2554–2558, 1982. Article (CrossRef Link)
[11] P. J. Werbos, Beyond Regression: New Tools for
Prediction and Analysis in the Behavioral Sciences,
PhD thesis, Harvard University, 1974.
[12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
“Learning internal representations by error
propagation,” Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, D. E.
Rumelhart and J. L McClelland, eds, vol. I, pp 318362, MIT, Cambridge, 1986.
[13] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R.
E. Howard, W. Hubbard, and L. D. Jackel,
“Backpropagation applied to handwritten zip code
recognition,” Neural Computation, vol. 1, pp. 541–
551, 1989. Article (CrossRef Link)
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
“Gradient based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, pp.
2278–2324, 1998. Article (CrossRef Link)
[15] Geoffrey E. Hinton and Simon Osindero, “A fast
learning algorithm for deep belief nets,” Neural
Computation, vol. 18, pp. 1527-1554, 2006. Article
(CrossRef Link)
[16] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton,
“Imagenet classification with deep convolutional
neural networks,” Advances in Neural Information
Processing Systems 25, pp. 1106–1114, 2012.
[17] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and
Jitendra Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” CVPR
2014, IEEE Conference on, 2014.
[18] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, Andrew Rabinovich,
“Going Deeper with Convolutions,” CoRR, 2014.
[19] K. Fukushima, “Neocognitron: A Self-organizing
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
Neural Network Model
for a Mechanism of
Pattern Recognition Unaffected
by
Shift
in
Position,” Biol. Cybernetics, vol. 36, pp. 193-202,
1980. Article (CrossRef Link)
D. H. Hubel and T. N. Wiesel, “Receptive fields,
binocular interaction, and functional architecture in
the cats visual cortex,” Journal of Physiology
(London), vol. 160, pp. 106-154, 1962. Article
(CrossRef Link)
M.D. Zeiler, R. Fergus, “Visualizing and
Understanding Convolutional Networks,” ECCV
2014 (Honorable Mention for Best Paper Award),
Arxiv 1311.2901, Nov 28, 2013.
Min Lin, Qiang Chen, and Shuicheng Yan, “Network
in network,” CoRR, abs/1312.4400, 2013.
Pierre Sermanet, David Eigen, Xiang Zhang,
Micha¨el Mathieu, Rob Fergus, and Yann LeCun,
“Overfeat: Integrated recognition, localization and
detection using convolutional networks,” CoRR,
abs/1312.6229, 2013.
Matthew D. Zeiler, Hierarchical Convolutional Deep
Learning in Computer Vision, Ph.D. Thesis, Nov. 8,
2013.
Geoffrey E. Hinton, Nitish Srivastava, Alex
Krizhevsky,
Ilya
Sutskever,
and
Ruslan
Salakhutdinov, “Improving neural networks by
preventing co-adaptation of feature detectors,” CoRR,
abs/1207.0580, 2012.
Pierre Sermanet and Yann LeCun, “Traffic Sign
Recognition
with
Multi-Scale
Convolutional
Networks,” IJCNN, 2011.
J. Fan, W. Xu, Y. Wu, and Y. Gong, “Human
tracking using convolutional neural networks,”
Neural Networks, IEEE Transactions on, vol. 21, pp.
1610 –1623, 2010.
B. A. Olshausen and D. J. Field, Sparse coding with
an overcomplete basis set: a strategy employed by
V1?, Vision Research, vol. 37, pp. 3311-3325, 1997.
P. C. Bush and T. J. Sejnowski, The cortical neuron,
Oxford University Press, 1995.
R. Douglas and K. Martin, “Recurrent excitation in
neocortical circuits,” Science, vol. 269, pp. 981-985,
1995.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio,
“Deep Sparse Rectifier Neural Networks”,
Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics (AISTATS) 2011,
Fort Lauderdale, FL, USA, vol. 15 of JMLR:W&CP
15, pp. 315-323, 2011.
V. Nair and G. E. Hinton, “Rectified linear units
improve restricted boltzmann machines,” Proc. 27th
International Conference on Machine Learning, 2010.
M. Zeiler, G. Taylor, and R. Fergus, “Adaptive
deconvolutional networks for mid and high level
feature learning,” ICCV, 2011.
Sanjeev Arora, Aditya Bhaskara, Rong Ge, and
Tengyu Ma, “Provable bounds for learning some
deep representations,” CoRR, abs/1310.6343, 2013.
Andrew G. Howard, “Some improvements on deep
convolutional neural network based image
classification,” CoRR, abs/1312.5402, 2013.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Hyeon-Joong Yoo is Professor of IT
Engineering Department at Sangmyung University, Chonan, Korea. He
received his B.S. degree in Electronics
Engineering from Sogang University,
Korea, in 1982 and his M.S. and Ph.D.
in Electrical and Computer Engineering from the University of
Missouri in 1991 and 1996, respectively. His current
research interests include computer vision, pattern
recognition, artificial neural network, and image
processing.
Copyrights © 2015 The Institute of Electronics and Information Engineers
43
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.044
44
IEIE Transactions on Smart Processing and Computing
A Single Inductor Dual Output Synchronous High Speed
DC-DC Boost Converter using Type-III Compensation for
Low Power Applications
Abbas Syed Hayder, Hyun-Gu Park, Hongin Kim, Dong-Soo Lee, Hamed Abbasizadeh, and Kang-Yoon Lee
College of Information and Communications Engineering, Sungkyunkwan University / Suwon, South Korea
{hayder,blacklds,squarejj,cbmass85, hamed, klee}@skku.edu
* Corresponding Author: Kang-Yoon Lee
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
* Extended from a Conference: Preliminary results of this paper were presented at the IEEE ISCIT2014.
Abstract: This paper presents a high speed synchronous single inductor dual output boost converter
using Type-III compensation for power management in smart devices. Maintaining multiple
outputs from a single inductor is becoming very important because of inductor the sizes. The uses
of high switching frequency, inductor and capacitor sizes are reduced. Owing to synchronous
rectification this kind of converter is suitable for SoC. The phase is controlled in time sharing
manner for each output. The controller used here is Type-III, which ensures quick settling time and
high stability. The outputs are stable within 58µs.The simulation results show that the proposed
scheme achieves a better overall performance. The input voltage is 1.8V, switching frequency is
5MHz, and the inductor used is 600nH. The output voltages and powers are 2.6V& 3.3V and
147mW&, 230mW respectively.
Keywords: DC converter, Switch mode power supplies, SIMO converters
1. Introduction
As the technology advances, the desire for small area
devices is increasing. Power management is the heart of all
electronic devices, which makes the other blocks alive by
supplying proper voltage levels. DC-DC converters
provide a regulated supply for the rest of the circuit
including the DSP chip in portable devices, with a
different output level from the unregulated input supply.
Voltage regulation for SoC (System-on-Chip) applications
is required to fulfil a variety of functions [2]. These are
classified into two categories, linear regulators and
switching regulators. Conventional power management
techniques consume high power under static conditions,
but switching regulators have high efficiency compared to
conventional techniques. Linear regulators waste most of
their energy in the form of heat, requiring extra heat sinks.
On the other hand switching regulators or choppers,
dissipate a negligible amount of power during conversion.
Chopper circuit requires at least two switches, one
switching source, control circuit, one energy storage
element. The input voltage is chopped off at a specific
frequency. As the switching frequency is increased the
charge and discharge region of the inductor shrinks and
demands very careful calculations, timing diagrams and
circuit design to generate proper signals for gate drives.
Power loss at the power stage increases roughly with
f s [4]. These include; conduction losses and switching
losses. Switching speed of MOSFET devices is calculated
by the time required to charge/discharge the gate capacitor.
Therefore, parallel multiple MOSFET’s can be used to
achieve a high switching speed. Bootstrap, ø-inverters,3rd
generation gate drive techniques [5, 6] are ideal candidates
for switching the gate voltage at higher speed. In this paper,
a small amount of efficiency is sacrificed to achieve high
speed and lower volume. In addition, efficiency is an
inverse function of the difference between the input and
output voltages. Conventional dc-dc converters suffer
power losses due to their bulky nature and require many
inductors as outputs. The popular control scheme for
45
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
DC/DC
Converters
SIMO
Buck/Boost
Single Inductor
Multiple Outputs
Wastes Energy
Dissipating excess
Power as Heat
Linear
Regulators
Switching
Regulators
SISO
Buck/Boost
Single Inductor
Single Output
Syn/Asynch
Fig. 1. Classification of DC-DC Converters.
Fig. 2. (a) The circuit of an ideal boost DC-DC converter, (b) The equivalent circuit of the inductor charge phase, (c)
the inductor discharge phase.
switching converters is pulse-width modulation (PWM).
The control uses a constant switching frequency and varies
the duty cycle as the load current varies. This scheme
achieves good regulation, low noise spectrum, and high
efficiency. A DC-DC converter, which is an essential
block in PMIC (Power Management IC), provides a
regulated output voltage from a linearly discharging
battery in portable devices [7]. An LED driver is
essentially a current source (or sink) that maintains a
constant current required for achieving the desired color
and luminous flux from an array of LEDs [8]. In the
proposed converter design the inductor and capacitor sizes
are reduced to 40% of the lowest average value published
thus far. In addition a smooth handover block in the
feedback loop was designed to assist the loop at start up
and avoid indefinite unpredictable behavior.
2. Classification of Converters
Initially there are only two categories of dc-dc
converters before the evolution of SIMO (Single Inductor
Multiple Output) converters, linear regulators and
switching regulators. Fig. 1 shows the classification of dcdc converters according to the present times. Topologies in
each category of switching regulators are divided into four
classes: Buck Converter, Boost Convert, Inverting
topology and transformer flyback topology. Inductor is
charged at the first phase and energy of inductor with input
supply is pushed to output node, just like fill and throw of
energy to the output.
The inductor charge phase of a boost converter is
shown in Fig. 2. The equivalent circuit for Φ1is shown in
Fig. 2(b), which is simplified by closing SW1 and opening
SW2 for a certain on-time ton. During Φ1 the inductor L is
46
Hayder et al.: A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation …
Fig. 4. Type-III Compensator.
Fig. 3. Block Diagram of the Proposed Converter in [9].
charged by the voltage source Uin, causing the inductor
current iL(t) to increase from its minimum value IL_min to
its maximum value IL_max. Simultaneously the output
capacitor C is discharged through the load RL. For the
inductor discharge phase Φ2 , equivalent circuit is shown
in Fig. 2(c), which is accomplished by opening SW1 and
closing SW2 for a certainoff-time toff . During L is
discharged into C and RL, causing iL(t) to decrease. As a
result iL(t) is divided over C and RL, thereby charging C
and providing power to RL. Because L is discharged in
series with Uin it can be intuitively seen that the output
voltage Uout will always be higher than Uin [9]
3. Related Work
The initial work with a single output has already been
presented in our previous work [10] and [11]. A 3.7V High
speed dc-dc synchronous boost converter with smooth loop
handover is discussed. Switching frequency was 5MHz,
inductor size was 600nH. Output of 3.7V is regulated from
a supply of 1.8V. The three main blocks of dc-dc converter
has been discussed. In the feedback loop PWM control is
used. On the other hand in [11] we used Type-III
compensation for single output converter to achieve quick
settling time. In both papers the dual output with Type-III
was not addressed.
In the architecture shown in Fig. 3 has three main
blocks: Power stage, Control stage, and Gate drive stage.
The control stage has further two sub-blocks: PWM
generation and SLHO (Smooth Loop Hand Over). The
current circulation in the power stage is much higher than
the control stage. Output voltage is sensed scaled and
compensated using error amplifier. It is compared with
carrier signal to generate a PWM signal proportional to the
output voltage. PWM signal generated from feedback loop
and constant duty cycle signal are fed to Smooth Loop
Mux (SLM). In the startup KHF is engaged, while after the
predetermined time which is set in DB (Delay Block) the
PWM loop is engaged for the rest of the operation. Delay
Block is triggered just after 20µs from power on of the
converter. Simultaneously while depending on the load the
Cod is connected and disconnected at the output using the
control signals Rc and Rcb to reduce the ripples. Ds1 and Ds2
are two non-overlapped signals generated to avoid any loss
or conflict in inductor current. This architecture works
with CCM (continuous conduction mode) operation. Sm
and Sa are used with anti parallel diodes, as MOSFET
channel has to carry the positive current, e.g., n-channel
MOSFET's drain to source. If the load is an inductive,
interval that switch is turned on, current flows in the
opposite direction, but this current will flow through the
diode. Without a diode, the induced current generates
peaks of high voltage, causing current to stop. At heavy
loads conduction losses dominate while switching losses
dominate at light loads. We used four stages of cascaded
inverters as gate drives [10].
4. Control Techniques and Efficiency
Model
There are many control techniques and compensation
methods which are being used in dc/dc converters. Some
of them are: Hysteretic, PI, Type-II and Type-III, with
combination of voltage mode and current mode sensors.
Peak inductor current and average current of inductor is
acquired and monitored to make it as input for controllers.
In the meantime feed-forward control also can be used for
dynamic performance of the dc/dc converter. But in our
case we used PI control technique.
Conventional PWM techniques have already been
discussed in our previous work. But due to slow response
time we re-designed our converter with Type-III controller.
This controller is comparatively fast. Type-III
compensation is used to improve the transient response,
and to boost the crossover frequency and phase margin. It
guarantees the targeted bandwidth and phase margin, as
well as an unconditionally stable control loop [12]. Fig. 4
shows the conventional Type III compensation using
voltage Op-Amp. Simulation results from discussed work
of single inductor single output (SISO) is shown in Fig. 5.
47
Vo (V)
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
2
0
0
100
Err (V)
Vcarr (V)
L
I (A)
0.35
0.25
0.1
0
150
1.5
1
0.5
0
s
1
200
(a)
152
1.5
300
400
154
(b)
2
2.5
(c)
500
156
3
600
158
3.5
4
160
4.5
(1)
5
5
0
-5
D 1 (V)
Transfer function with the feedback block is
4
0
100
200
(d)
300
400
266
(e)
267
Time(μs)
500
600
(2)
1
0.5
0
264
265
268
269
where R Y and R Z are obtained using (7) and (8)
270
(3)
Fig. 5. Simulations Results of SISO.
(4)
There are two poles ( f p1 and f p 2 , besides the pole-at-
Po : Output power
zero f p 0 ) and two zeros ( f z1 and f z 2 ) provided by this
Pc , Ps : Conduction and Switching losses
compensation. All passive components involved play a
dual role in determining the poles and zeros. So, the
calculation can become fairly cumbersome and iterative.
But a valid simplifying assumption that can be made is that
C1 is much greater than C3 [13]. So the locations of the
poles and zeros are finally:
trv : Rise time
t fl : Fall time
CGN : NMOS gate capacitance
CGP : PMOS gate capacitance
L
V01
Coa
RL
Sa
Sm
Vg
V02
V01
Cob
CLK
Sb
Sm Sa Sb Sp1 Sp2
Gate Drives
Type-III
Controller
Sp1
V02
D
SET
CL R
Q
Q
D
SET
CL R
Q
Q
Sp2
Sp1
Ref1
Sp2
Ref2
Fig. 6. Block diagram of proposed SIDO converter.
RL
48
Sp1(V)
Hayder et al.: A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation …
0
-0.5
Sp2(V)
and high inductor value. In contrast we designed with high
switching frequency, Type-III controller, lower inductor
value, low load capacitor and without current sensors.
Because current sensors occupy extra areas. Type-III
controllers show a rapid response. To comply with higher
switching frequency improved controllers are necessary.
During phase 1output vo1 is maintained with coordination
of Sm and Sa. In the meantime controller enables sp1 thus
by selecting Ref1. On the other hand output vo2 is
maintained with coordination of Sm and Sb. Charging time
of Sm is decided by the controller depending upon the
present output voltage at vo2. In this phase Ref2 is selected.
Rate of charging inductor is same for both outputs with
Vg/L, while discharge slope for phase1 is (Vg-Vo1)/L and
for phase2 is (Vg-V02)/L. Filtering capacitors Coa and
Cob smoothes the output glitches.
The overall controller output is shown in Fig. 7(c).
Comparison with other works is summarizes in Table 1.
For the time duration when sp1 is on, MOSFET Sm duty
ratio is calculated to generate output required for the Vo1,
similarly, when sp2 is on MOSFET Sm duty ratio is
calculated to stable the output Vo2. At a time one sensor
value is selected using sp1 and sp2. When sp1 is selected,
reference voltage Ref1 becomes input to Type-III
controller. In contrast when sp2 is selected Ref2 becomes
the input for the controller. Each output is scaled and
becomes input to Type-III controller where error signal is
generated after subtracted from reference signals.
This error signal becomes input to comparator and
1
0.5
0
0.2
0.4
0.6
0
0.2
0.4
0.6
(a)
0.8
1
1.2
1.4
1.6
1.8
2
0.8
1
1.2
1.4
1.6
1.8
2
1
0.5
0
-0.5
Type-III(V), LI (A)
(b)
1
Type-III
IL
0.5
0
-0.5
0
0.2
0.4
0.6
0.8
(c)
1
1.2
Time(μs)
1.4
1.6
1.8
2
Fig. 7. Phase Control Signals and Type-III output.
C X : Equivalent parasitic capacitance at node X
R
,ٛ L
C : Equivalent series resistances of the inductor
and capacitor respectively
5. Proposed Dual Output Architecture
In our design of proposed dual output converter as
shown in Fig. 6, the inductor energy is shared between the
two outputs. Dual output with phase control is presented in
paper [1] with lower switching frequency, current sensors
Sa (V)
1
0.5
0
-0.5
Vo2 (V)
2.5
1
0
3.2
PstCmp(V)
Vo1 (V)
1
0.5
0
-0.5
Sb (V)
Sm (V)
1
0.5
0
-0.5
3
1
0
1
2
3
(a)
4
5
6
7
8
9
10
0
1
2
3
(b)
4
5
6
7
8
9
10
0
1
2
3
(c)
4
5
6
7
8
9
10
0
50
100
150 (d) 200
250
300
350
400
450
500
0
50
100
150 (e) 200
250
300
350
400
450
500
100
150 (f)
250
300
350
400
450
500
2.8
2.6
Vcarr(V)
L
I (A)
0
0
50
200
3
2
1
5 (g)
0
10
15
1.8
0
50
55
(h)
60
Time(μs)
65
70
Fig. 8. Simulation Results of proposed Converter.
75
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Table 1. Comparison Summary.
49
converters, ultimately reducing inductor and capacitor
sizes. Settling time is 58µs. The Output voltage was scaled
and fed to the controller in phase control mode. In each
phase, the two phase control signals keeps track for
adjusting the duty ratios to achieve the desired output.
Parameters
Switching
frequency
Input voltage
Unit
[1]
[3]
[This Work]
HZ
500K
1M
5M
V
1.8
2.6-5
1.8
Inductor
Henry
1µH
22µH
600nH
Output Voltages
Filtering
Capacitors
Compensation
Settling time
Current sensors
V
3.0, 3.45
1.2
2.6, 3.3
Acknowledgement
Farad
44µ,47µ
35µ
1µ, 1µ
Sec
-
Yes
150m,
170m
No
Type-III
58µs
No
147m,
230m
This research was supported by the National Research
Foundation of Korea (NRF) grant funded by the Korean
government (MEST) (No. 2011-0009454) research was
supported by the National Research Foundation
Output power
W
240m
Vo1
Vo2
Vo (V)
4
3.3
2.6
2
150
200
250
300
350
400
0.08
Io1
Io2
Iout(A)
0.07
0.06
0.05
0.04
150
200
250
300
350
400
Time(μs)
Fig. 9. Output Voltages and Load Currents.
from dead time control non overlapped signals are
generated. Gate driver is used to generate high current and
to turn and off the Power MOSFETs. We used existing
clock for switching frequency using clock divider 2.5MHz
is achieved through feedback of D-Flip-Flop to drive sp1
and sp2. The simulation results are shown in Figs. 8(a)-(h).
NMOS signal Sm generated is shown in Fig. 8(a), Signal Sa,
Sb are fed to high side PMOS switches located near the
output Vo1 and Vo2 respectively. Boosted Output voltages
are presented in Figs. 8(d) and (e), where output voltage
level stables within 58µs. Comparator input is given in Fig.
8(f). Carrier signal and inductor current are illustrated in
Figs. 8(g) and (h). Zoomed output voltages and load
currents are shown in Fig. 9 for the same load resistors
connected in both outputs. Average load current taken by
Vo1 and Vo2 was 56mA and 70mA respectively.
6. Conclusion
This paper proposed a single inductor dual output dc-dc
converter design using a Type-III compensation. In the
beginning the existing single output and dual output
architectures were discussed. The design was then
formulated towards a dual output dc-dc converter with an
improved control scheme, which is essential for high speed
References
[1] D. Ma, W.-H. Ki, C.-Y. Tsui, and P. K. Mok, "A
single-inductor dual-output integrated DC/DC boost
converter for variable voltage scheduling," in
Proceedings of the 2001 Asia and South Pacific
Design Automation Conference, 2001, pp. 19-20.
Article (CrossRef Link)
[2] L. Ming-Xiang, H. Bo-Han, C. Jiann-Jong, and H.
Yuh-Shyan, "A sub-1V voltage-mode DC-DC buck
converter using PWM control technique," in Electron
Devices and Solid-State Circuits (EDSSC), 2010
IEEE International Conference of, 2010, pp. 1-4.
Article (CrossRef Link)
[3] E. Bonizzoni, F. Borghetti, P. Malcovati, F.
Maloberti, and B. Niessen, "A 200mA 93% Peak
Efficiency Single-Inductor Dual-Output DC-DC
Buck Converter," in Solid-State Circuits Conference,
2007. ISSCC 2007. Digest of Technical Papers. IEEE
International, 2007, pp. 526-619. Article (CrossRef
Link)
[4] E. Sánchez-Sinencio and A. G. Andreou, Lowvoltage/low-power integrated circuits and systems:
low-voltage mixed-signal circuits vol. 4: Wiley-IEEE
Press, 1999. Article (CrossRef Link)
[5] L. Balogh, "Design and application guide for high
speed MOSFET gate drive circuits," 2001. Article
(CrossRef Link)
[6] S. Mappus, "Predictive gate drive boosts synchronous
DC/DC power converter efficiency," App. Rep. Texas
Instruments SLUA281, 2003. Article (CrossRef Link)
[7] Y. H. Ko, Y. S. Jang, S. K. Han, and S. G. Lee,
"Non-load-balance-dependent high efficiency singleinductor
multiple-output
(SIMO)
DC-DC
converters," in Custom Integrated Circuits
Conference (CICC), 2012 IEEE, 2012, pp. 1-4.
Article (CrossRef Link)
[8] A. T. L. Lee, J. K. O. Sin, and P. C. H. Chan,
"Scalability of Quasi-Hysteretic FSM-Based Digitally
Controlled Single-Inductor Dual-String Buck LED
Driver to Multiple Strings," Power Electronics, IEEE
Transactions on, vol. 29, pp. 501-513, 2014. Article
(CrossRef Link)
[9] S. A. Hayder, K. Hongjin, K. Ji-Hun, and L. KangYoon, "3.7V High Frequency DC-DC Synchronous
Boost Converter with Smooth Loop Handover," in
50
Hayder et al.: A Single Inductor Dual Output Synchronous High Speed DC-DC Boost Converter using Type-III Compensation …
ISCIT 2014, Incheion, 2014. Article (CrossRef Link)
[10] S. A. Hayder, P. Hyung-Gu, L. Dong Soo, P. YoungJun, and L. Kang-Yoon, "High Speed DC-DC
Synchronous Boost Converter Using Type-III
Compensation for Low Power Applications," in IEEE
International Conference on Consumer Electronics
(ICCE), Las Vegas, USA, 2015. Article (CrossRef
Link)
[11] L. Cao and A. P. Power, "Design Type II
Compensation In A Systematic Way." Article
(CrossRef Link)
[12] B. Hauke, "Basic calculation of a boost converter's
power stage," Texas Instruments, Application Report
November, pp. 1-9, 2009. Article (CrossRef Link)
Abbas Syed Hayder received his
Master’s degree from Hanyang
University in Electrical Electronics
Control and Instrumentation Engineering in 2010. He is now pursuing
his PhD degree from Sungkyunkwan
University, in Analog Integrated
Circuits. His research interests include
high speed SIMO (Single Inductor Multiple Output) dc-dc
Converters, PLL, VGA (Variable Gain Amplifier), Digital
circuit design, BGR (Band Gap References), and Control
techniques for dc-dc converters.
Hyung-Gu Park was born in Seoul,
Korea. He received his B.S. degree
from the Department of Electronic
Engineering at Konkuk University,
Seoul, Korea, in 2010, where he is
currently working toward a Ph.D.
degree in School of Information and
Communication Engineering, Sungkyunkwan University. His research interests include highspeed interface IC and CMOS RF transceiver
HongJin Kimwas born in Seoul,
Korea, in 1985. He received his B.S.
degree from the Department of
Electronic Engineering at Chungju
University, Chungju, Korea, in 2010,
where he is currently working toward a
M.S& Ph.D. degree in electronic
engineering at SungKyunKwan University, Suwon, Korea. His research interests include on
CMOS RF IC and analog/mixed-mode IC design for lowpower application and Power Management IC
Copyrights © 2015 The Institute of Electronics and Information Engineers
Dong-Soo Lee received his B.S., and
M.S. degree from the Department of
Electronic Engineering at Konkuk
University, Seoul, Korea, in 2012, and
the electronic engineering at Sungkyunkwan University, Suwon, Korea,
in 2014, respectively. He is currently
working toward a Ph.D. degree in
electronic engineering at Sungkyunkwan University. His
research interests include CMOS RF IC and Phase Locked
Loop, Silicon Oscillator and Sensor Interfaces design
Hamed Abbasizadeh was born in
Suq, Iran, in 1984. He received his
B.S. degree in Electronic Engineering
from Bushehr Azad University,
Bushehr, Iran in 2008 and M.S.
degree in Electronic Engineering from
Qazvin Azad University, Qazvin, Iran
in 2012, where he is currently
working toward the Ph.D. degree in School of Information
and Communication Engineering, Sungkyunkwan University.
His research interests include design of high-performance
data converters, bandgap reference voltage, low-dropout
regulators and ultra-low power CMOS RF transceiver.
Kang-Yoon Lee received the B.S.
M.S., and Ph.D. degrees in the School
of Electrical Engineering from Seoul
National University, Seoul, Korea, in
1996, 1998, and 2003, respectively.
From 2003 to 2005, he was with GCT
Semiconductor Inc., San Jose, CA,
where he was a Manager of the Analog
Division and worked on the design of CMOS frequency
synthesizer for CDMA/PCS/PDC and single-chip CMOS
RF chip sets for W-CDMA, WLAN, and PHS. From 2005
to 2011, he was with the Department of Electronics
Engineering, Konkuk University as an Associate Professor.
Since 2012, he has been with the School of Information
and Communication Engineering, Sungkyunkwan University,
where he is currently an Associate Professor. His research
interests include implementation of power integrated
circuits, CMOS RF transceiver, analog integrated circuits,
and analog/digital mixed-mode VLSI system design.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.051
51
IEIE Transactions on Smart Processing and Computing
Digital Error Correction for a 10-Bit Straightforward
SAR ADC
Behnam Samadpoor Rikan, Hamed Abbasizadeh, Sung-Han Do, Dong-Soo Lee, and Kang-Yoon Lee
College of Information and Communication Engineering, Sungkyunkwan University / Suwon, South Korea
{behnam, hamed, dsh0308, blacklds, klee}@skku.edu
* Corresponding Author: Kang-Yoon Lee
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: This paper proposes a 10-b SAR ADC. To increase the conversion speed and reduce the
power consumption and area, redundant cycles were implemented digitally in a capacitor DAC.
The capacitor DAC algorithm was straightforward switching, which included digital error
correction steps. A prototype ADC was implemented in CMOS 0.18-μm technology. This structure
consumed 140μW and achieved 59.4-dB SNDR at 1.25MS/s under a 1.8-V supply. The figure of
merit (FOM) was 140fJ/conversion-step.
Keywords: Redundancy, Digital error correction, Straightforward, SAR ADC
1. Introduction
High performance analog-to-digital conversion on the
nanometer scale is performed according to the high speed
switching rather than amplifying. The nature of a charge
redistribution successive-approximation-register (SAR)
analog-to-digital converter (ADC) is based on high speed
switching. As the demand for low power consumption and
high precision is increasing, the prominent power
efficiency of this structure has made it popular for high
resolution analog-to-digital conversion applications.
Moreover, a verification of recent publications confirmed
that the attitude toward SAR ADCs has increased rapidly.
While the low-power characteristics of SAR ADC has
increased the applications of this structure, it suffers from
low speed because for each conversion, a large number of
decision cycles are required. The necessity and interest in
high speed structures and low power consumption for SAR
ADCs, has prompted many efforts to increase the speed of
this block and overcome this drawback. In addition, many
techniques have been used to further decrease the power
consumption of these blocks [1].
Applying redundancy to the conversion steps is an
alternative approach for SAR ADC design. A redundant
structure has two benefits. First, a redundancy decision
causes in built-in error resilience, in which a bounded
decision error in the early steps can be absorbed into the
redundant conversion ranges of the later steps. The settling
accuracy in most MSB conversion steps of the DAC can
largely be relaxed and the conversion speed can be
expedited. Second, the redundancy allows overlapped
conversion curves in parts of the ADC transfer function,
which is instrumental to a digital domain treatment of
DAC mismatch errors. In this paper for DAC settling error
problem in binary decision SAR ADCs, a digital error
correction (DEC) technique has been applied. The DEC
technique leaves the capacitor intact and employs digital
addition only [2].
The organization of this paper is as follows. Section 2
reviews the conventional structure and straightforward
algorithms for SAR ADCs. Section 3 proposes the digital
error correction technique used for the SAR ADC structure.
A 10-bit prototype SAR ADC using digital error correction
technique is implemented in section 4. Section 5 reports
the simulation results and section 6 concludes the paper.
2. Conventional vs Straightforward SAR
structures
Fig. 1 shows the block diagram of the SAR ADC.
Input is sampled and held in the S&H sub-block. DAC is a
capacitor array, in which according to the control bits from
SAR logic, a voltage is created. The output voltage of the
DAC is compared with the sampled input in a comparator.
The SAR logic decides the control bits according to the
52
Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC
INPUT
+
S&H
Comparator
SAR LOGIC
OUTPUT
‐
DAC
Control Bits
Fig. 1. Structure of SAR ADC.
comparator output in each cycle. The output of the ADC
will be available after the decision of the N bit finished.
Fig. 2 summarizes the structure and switching sequence
of a 3 bit dual sampling SAR ADC as an example. Initially,
the input is sampled in C1, and CDAC is sampled to
VREFT-VIN. At the hold step, the MSB is decided and
according to the MSB bit, the SAR logic decides how to
redistribute the charge in the next step.
If the MSB is 1, (1/2)VREFT is added to the CDAC
voltage by connecting bottom plate of one half of the unit
capacitors to VREFT. In this step (3/2)VREFT-VIN is
compared with the VIN and the second bit is decided. In
the case that MSB is 0, the CDAC voltage again boosts to
1.5VREFT-VIN, and the bottom plate of the sample
capacitor (C1) is connected to the VREFT, which causes
an increase in the input voltage at the positive input of the
comparator to the VREFT+VIN.
For the third bit decision, the voltage in C1 will be
unchanged and if the output of the comparator is 1, one
more capacitor in the cap-bank will be connected to the
VREFT, and if the output of the comparator is 0, one of
the capacitors, in which bottom plate is connected to the
VREFT, will go back to the VREFB. This structure is
called dual sampling because the analog signals are
sampled and held asymmetrically at each input side. In this
VIN
VREFT
VIN
structure, the MSB bit is decided with no switching so the
power consumption of this structure will be lower than the
conventional SAR ADCs. In addition, the number of unit
capacitors in CDAC has been decreased to half compared
to conventional structures. Here VREFT and VREFB are a
substitution of VDD and VSS, respectively, because for
some error correction reasons, the exact VDD and VSS
values cannot be used. Therefore, VREFT, which is
slightly lower than VDD, and VREFB, which is slightly
higher than VSS, were used.
The reason that this structure is reviewed is to give the
reader some insight into SAR ADC and switching
sequence in a capacitor bank. Moreover, in most SAR
logics the switching algorithm is to some extent similar to
this structure. For more detailed information reader can
refer to [3].
For each cycle, we need to charge or discharge some of
the capacitors from VREFT to VREFB and vice versa,
which requires a long time and consumes large amounts of
energy. Moreover, according to these simulations, to
achieve a high performance structure we need to increase
the unit capacitor value. In addition, the amount of C1
needs to be several pF. When small values are used for this
capacitor, the performance of the SAR ADC degrades.
The structure used for the proposed ADC is called
straightforward DAC switching. This reduces the wasted
power and increases the speed of switching. Note that a
similar concept has been reported [4-6].
The positive input of the comparator is connected to
the VCM which is created in the reference generator and
there is no need to be sampled. The other inputs are
connected to the CDAC.
Similar to the dual sampling structure, Fig. 3
summarizes the detailed switching sequence for the 3bit
straightforward SAR ADC. At the sampling mode, the
output of CDAC is charged to VCM-VIN and in hold
VREFB
VREFB
c1
c1
c
c
c
c
2
1/4CV ref
+
‐
YES
VREFB
c1
2
CVref
VREFB
c
c
VREFT
c
c
+
‐
VREFB
c
c
c
c
+
‐
c
c
VREFT
c
c
‐
VIN > 7/8 Vref ?
VIN > 3/4 Vref ?
VREFB
c1
2
1/4CV ref
YES
+
NO
VREFB
c1
VREFB
+
VREFB
c
VREFT
c
‐
c
VIN > 5/8 Vref ?
c
VIN > 1/2 Vref ?
VREFT
2
CV ref
1/4CV ref
NO
YES
VREFT
c1
VREFB
c
c
VREFT
c
c
+
‐
VREFB
c
c
VREFT
c
c
+
‐
VIN > 3/8 Vref ?
VIN > 1/4 Vref ?
VREFT
c1
2
1/4CV ref
Sample mode (cycle 0)
Hold mode (cycle 1)
c1
2
Redistribution mode (cycle 2)
NO
VREFB
c
c
VREFT
c
c
+
‐
VIN > 1/8 Vref ?
Redistribution mode (cycle 3)
Fig. 2. 3 bit dual sampling switching sequence.
53
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
2c
‐
c
2c
+
VCM
VX=2VCM‐VIN+VCM/2
‐
2c
c
c
VREFT
VCM
2
C(VCM)
VCM
VCM
2c
c
+
VCM
VX=2VCM‐VIN+(1/4)VCM
2c
c
VCM > VX ?
‐
c
VREFT
2
(1/4)C(VCM)
‐
c
VCM > VX ?
YES
+
c
VCM > VX ?
‐
VREFT
c
VIN
VCM
VX=2VCM‐VIN
+
VCM
VX=2VCM‐VIN+(3/4)VCM
YES
VX=VCM‐VIN
VCM
2
(1/4)C(VCM)
+
VCM
NO
VREFB
VCM
VCM > VX ?
+
VCM
VX=2VCM‐VIN‐(1/4)VCM
‐
2c
c
c
VREFB
2
(1/4)C(VCM)
c
YES
C(VCM)
2
NO
VREFT
+
VCM
VX=2VCM‐VIN‐VCM/2 ‐
2c
c
c
VCM
VCM > VX ?
+
VCM
VREFB
VCM
VX=2VCM‐VIN‐(3/4)VCM
2c
2 NO
(1/4)C(VCM)
Sample mode (cycle 0)
Hold mode (cycle 1)
VCM > VX ?
c
VCM > VX ?
‐
c
VREFB
VCM
Redistribution mode (cycle 2)
Redistribution mode (cycle 3)
Fig. 3. 3 bit straightforward switching sequence.
mode it goes to 2VCM-VIN and can be compared with the
VCM. According to the decision of the comparator, the
bottom plate of 2C will connect to VREFT or VREFB at
the next cycle. This will continue in the next steps as well.
In each step, the switching is performed from VCM to
VREFT or VREFB. Therefore, the amount of charge or
discharge voltage decreases to half. Consequently, the
amount of energy will be a quarter of the previous case.
Moreover, the speed of the switching can be increased.
Therefore, this structure is more energy efficient than the
conventional structures and has a higher speed.
The bottom-plate sampling-based straightforward
structure can be used for higher resolution as it provides
better linearity in principle [7].
2. Digital error correction technique
using redundant cycles
In high resolution ADCs, the difference between the
coarse capacitor values and fine capacitor values normally
causes decision errors in MSB bits. In particular, when the
speed increases, the DAC is not settled properly in the
decision phases for MSBs. This will cause error at MSB
codes, which cannot be recovered in the following cycles.
To solve this problem, some redundant cycles are normally
predicted. Redundancy can be applied by adding some
extra capacitors to the hardware and predicting some
redundant decision cycles. Digital error correction was
used to solve this problem. This does not require any extra
hardware and the CDAC structure remains intact. The
concept is explained for a 4 bit example, which has 4 bits
and 5 decision cycles.
VIN
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
+
C1 C2 C3 C4 C5
10
DAC settling error
01
00
V‐SHIFT
0 0 1 1 0
2 coarse bit 3 fine bits
0 0
1 1 0
0 1 1 0
DAC settling Error Correction
Fig. 4. DAC settling error correction concept using
redundant cycle (C3).
As implemented in Fig. 4, the decision procedure in the
SAR ADC was divided into two steps. The first step
decides the 2 coarse bits and the second step is a decision
of the 3 fine bits. One bit redundancy was predicted.
Coarse decision thresholds were at 6/16 and 10/16. The
decision levels were shifted 0.5LSB of the coarse
resolution here. Therefore, the MSB decision with the
DAC level is started at 5/8 VREF instead of 1/2 VREF. In
the example, the SAR ADC has a slow VDAC settling,
54
Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC
VCM
VCM
+
VX=VCM‐VIN
‐
4c
2c
c
11
1
C(VCM)
2
4c
VCM
+
VX=2VCM‐VIN+VCM/2+VCM/4
4c
C(VCM)
2c
VREFT VREFT
2
c
c
+
4c
2c
VCM VREFT
c
c
VREFT
c
c
VCM
VCM
10
+
VCM
VCM
C(VCM)
YES
VCM
VCM > VX ?
‐
2c
NO
2
VCM > VX ?
‐
YES
c
VIN
VX=2VCM‐VIN+VCM/4
10
+
VCM
VCM > VX ?
‐
4c
VREFT
2c
c
c
VCM
VCM
VCM > VX ?
‐
0
01
VCM
C(VCM)
C(VCM)
2
YES
VCM
4c
2c
VREFB VREFT
c
+
c
2c
2c
2c
c
c
VCM > VX ?
VCM
VREFB VREFT VCM
VCM > VX ?
‐
00
+
VCM
VCM
C(VCM)
2
NO
Sample mode (cycle 0)
Hold mode (cycle 1)
‐
2
NO
VX=2VCM‐VIN‐VCM/2+VCM/4
+
VCM
4c
VREFB
Redistribution mode (cycle 2)
2c
VCM
c
c
‐
VCM > VX ?
VCM
Redundancy cycle (cycle 3)
Fig. 5. Switching sequence with the error correction technique for 2 coarse and 1 redundancy bits in a 4 bit
structure.
which results in decision error. For the given VIN, 00 is
created for two MSB bits, which is not correct. The
redundancy will be implemented by one additional
decision step. To determine the three fine bits, the VDAC
decision level will shift back to its regular decision level
by 0.5LSB, shifting it down, which has been done at the
beginning to determine the coarse bits. A redundant
decision is then implemented. With two more decisions
after this step, 3 LSB bits are determined. By adding two
coarse bits and three fine bits with one bit overlap, the
coarse bit error can be eliminated. As long as the error
amount is less than 0.5LSB of the coarse resolution, the
error can be eliminated. This will relax the DAC settling
and the sampling decision speed can be increased, despite
there being more conversion steps. A similar concept was
implemented in [2, 7].
Fig. 5 presents the switching sequence for 2 coarse and
1 redundancy decisions. The switching sequence for 2 fine
bits will be the same as the straightforward structure as
explained before. This structure is the same as the
straightforward structure and the redundancy has been
implemented by only changing the decision levels.
The capacitor DAC has been segmented into two subDACs virtually. One of these sub-DACs decides the 2
coarse bits and the other one decides the 3 fine bits. After
the input signal is sampled, except for the largest capacitor
of second sub-DAC, all other capacitors are connected to
VCM. 2C is connected to VREFT and implements the
threshold offset (V-SHIFT) for digital error correction
purposes. The MSB is decided in this step. According to
the MSB bit value, the 4C capacitor is connected to
VREFT or VREFB and the MSB-1 will be decided. As
shown in Fig. 5, the CDAC switching operation for the
redundant decision phase will be different for each case. In
this step, for a further bit decision, the VDAC should be
located at the center of determining the input range for the
digital error correction algorithm. Moreover, 2C should go
back to VCM for the next decision steps of the following
LSB bits. In the case that MSB-1 bit is 0, these two
conditions are satisfied by switching 2C back to the VCM.
As shown in Fig. 6(a), by switching 2C to VCM, the
determined input range will be double, and the VDAC will
be located at the center of this range. The redundancy bit
and 2 other fine bits will then be decided. The final bits
can be obtained by adding 2MSB and 3LSB bits with a one
bit overlap. As shown in Fig. 6(a), the error in MSB-1 bit
has been corrected in the redundancy decision step safely.
In the case that two MSB bits are 01 (Fig. 6(b)), if 2C
is switched to the VCM, the VDAC will not be in the
center of the determined input range (it will be at the
55
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
VIN
1111
1110
1101
1100
1011 10
1010
1001
1000
0111 01
0110
0101
0100
0011
0010 00
0001
0000
C1 C2 C3 C4 C5
+
0 0
1 1 0
VIN
0 1 1 0
V‐SHIFT
0 0 1 1 0
2 coarse bit 3 fine bits
1111
1110
1101
1100
1011 10
1010
1001
1000
0111 01
0110
0101
0100
0011
0010 00
0001
0000
C1 C2 C3 C4 C5
1111
1110
1101
1100
1011 10
1010
1001
1000
0111 01
0110
0101
0100
0011
0010 00
0001
0000
VIN
0 1
0 1 0
+
0 1 1 0
V‐SHIFT
1 1
+
1 0
1 1 1
1 1 1 1
V‐SHIFT
1 1 1 1 1
2 coarse bit 3 fine bits
0 1 0 1 0
2 coarse bit 3 fine bits
b) MSB, MSB‐1=01
a) MSB, MSB‐1=00
C1 C2 C3 C4 C5
c) MSB, MSB‐1=11
Fig. 6. Determined input ranges for different MSB and MSB-1 bits in a 4 bit digital error corrected SAR ADC.
CDAC1 (MSB‐Bits)
32c
224c
32c
32c
VIN
VCM
VREFB
VREFT
CLK1
CLK
CDAC2
_
ADCout
32c
96c
28c
SAR LOGIC
4c
12c
4c
VCM
4c
4c
Comparator
+
VCM
VIN
VCM
VREFB
VREFT
CDAC3 (LSB‐Bits)
4c
2c
c
c
Control Bits
VIN
VCM
VREFB
VREFT
CLK
CLK1
START
RESET
Sampling
Control Bits
ADCout
`
D11 – D0
B9 – B0
`
`
`
Fig. 7. 10-bit error corrected SAR ADC structure and timing diagram.
bottom of this range). On the other hand, for a further bit
decision, 2C should return to the VCM. To solve this
problem and satisfy the abovementioned conditions, as the
required amount of VDAC change is double that of the
returning 2C from VREFT to VCM, 4C was divided into
two 2C capacitors and one of them was switched from
VREFB to VREFT. Moreover, 2C in the lower sub-DAC
was switched back to VCM. This moved the VDAC up to
the desired level. The remainder of the decisions were
identical to the previous case.
Finally, the last case is where all the MSB bits are 1. In
this case, MSB capacitor was connected to VREFT, so the
same procedure as the previous case cannot be followed.
On the other hand, the MSB-1 bit can simply be changed
to 0 and a similar operation can be performed, as done for
the case when this bit was 0. The details for this case can
be found in Fig. 6(c). For similar concepts, the reader can
refer to [8].
In [8], during the redundancy cycle, if all the MSB bits
of higher sub-DAC become 1, a problem occurs. In this
case, in the redundant cycle, the MSB capacitor of the
lower sub-DAC that is connected to VREFT should switch
back to the VCM and at the same time, a capacitor with the
same size in the higher sub-DAC that is connected to the
VREFB should be found and switched to the VREFT. On
the other hand, this is impossible because all the bits that
have been decided by higher sub-DAC are 1, so all the
capacitors of this sub-DAC are already connected to the
56
Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC
Fig. 9. Ramp Simulation Result.
VREFT. The proposed switching sequence of this paper
solved this problem, as verified before. As shown in Fig. 5,
the switching sequence for each case has been illustrated
clearly and all the cases have been considered.
0.5
0
‐0.5
‐1
0.4
INL(LSB)
Fig. 8. Comparator structure.
DNL(LSB)
1
0
‐0.4
4. 10-bit prototype SAR ADC using digital
error correction
The 10-bit SAR structure using the explained error
correction technique has been implemented in Fig. 7. The
capacitor weights are intact compared to the
straightforward structure. The CDAC has been segmented
into three sub-DACs virtually. Each sub-DAC creates 4bits.
CDAC1, CDAC2 and CDAC3 create D11-D8, D7-D4 and
D3-D0, respectively, in which D7 and D3 are decided in
the redundant cycles. The largest capacitor in CDAC2 is
32C (28C+4C), so each capacitor of CDAC1 is broken into
two capacitors, of which one of them is 32C. This is for
the purpose of redundancy in the case that D8 is 1 and
D11D10D9D8 bits are not 1111 (as explained in the
previous section). Therefore, for this case, one of the 32C
capacitors that is connected to the VREFB can be found
and switched to the VREFT in the redundant cycle. For a
similar reason, the capacitors of CDAC2 were broken into
two parts, of which one of them was 4C as the largest
capacitor in CDAC3 is 4C.
The reader might be concerned about the number of
switches. Note that the number of switches has been
increased but size of these switches has been decreased
proportionally. Moreover, the unit capacitor size used in
this structure is 29fF, which is the minimum capacitor size
limited by the process. As the capacitor size is reduced, the
speed can be increased and the power consumption and
area can be decreased. Fig. 8 presents the comparator
structure.
5. Simulation Results
Simulation results for proposed ADC structure are
implemented in this section. Finesim was used for the
simulations. The process was 0.18-μm CMOS technology.
0
200
400
600
800
1000
Fig. 10. DC Analysis Results.
Fig. 11. Sine Simulation Result.
Fig. 9 shows the ramp simulation result for the proposed
ADC. The sampling speed per cycle was 1.25MHz. The
number of samples for this simulation was 1024.
Fig. 10 shows the DC analysis results for this ADC.
The results were extracted from the ramp simulation. DNL
max and DNL min were 0.998 and -1.0 LSB, respectively.
INL max and INL min were 0.562 and -0.719 LSB,
respectively, and Sigma DNL and Sigma INL for this
structure were 0.155 and 0.294 LSB, respectively.
Fig. 11 presents a sine wave simulation. The input
frequency for this simulation was 1.22KHz and the clock
sampling was 1.25MHz. Fig. 12 presents the
implementation of the FFT of the output. The input was a
62 KHz sine wave sampled at 1.25MHz. For this input
frequency, the effective number of bits (ENOB) was 9.4.
The SNDR and SNR were 58.3 and 60.2dB, respectively,
and the amount of SFDR was 63.4dB. This ADC
consumed approximately 80μA from 1.8V source.
Therefore, the current consumption of this structure was
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Magnitude(DBFS)
6. Conclusion
SFDR=63.4dB
SNDR=58.3dB
ENOB=9.4
‐10
‐30
‐50
‐70
‐90
0
57
f/fs
0.25
0.5
Fig. 12. Spectrum of the output samples for the input
62 KHz at 1.25MHz.
This paper proposed a 10 bit SAR ADC. A digital error
correction was used for the straightforward structure. This
ADC was implemented in CMOS 0.18-μm technology.
The power consumption for this structure was 140μW
under a 1.8-V supply. The structure had 59.4-dB SNDR,
61.3 SNR and 65.2 SFDR (peak values) for a 1.25MS/s
conversion speed. The unit capacitor in the CDAC
structure was 29fF. At this conversion speed, for all
frequencies from DC to Nyquist rate, the ENOB was
above 9 bits. The figure of merit (FOM) was
140fJ/conversion-step.
Acknowledgement
This study was supported by Basic Science Research
Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education (NRF2013R1A1A2010114).
This work was supported by IDEC (IPC, EDA Tool,
MPW).
References
Fig. 13. SNDR and SFDR for the different input ranges
from DC to Nyquist rate.
Table 1. Summary and comparison results.
Parameters
Process(nm)
[9]
0.35
[10]
180
This work
180
Resolution(bit)
10
10
10
Sampling Rate (MS/S)
2
0.137
1.25
Supply voltage
3.3
1.5
1.8
ENOBPeak
8
8.65
9.57
SNDRpeak(dB)
-
53.8
59.4
SFDRpeak(dB)
-
-
65.2
Sigma DNL(LSB)
0.5
0.56
0.155
Sigma INL(LSB)
0.5
0.38
0.294
Power(mW)
3
13.4
0.14
FOM(fJ/Conv-step)
-
243
140
140μW. Fig. 13 illustrates the SNDR and SFDR for
different input frequencies from DC to Nyquist rates.
FOM =
Power
2
ENOB
× min{2 × ERBW , f s }
(1)
According to (1), the figure of merit (FOM) was
calculated to be 140fF. Finally, Table 1 summarizes and
compares the work with recent studies in SAR ADC.
[1] W. Liuet, P. Huang and Y. Chiu, “A 12-bit, 45-MS/s,
3-mW
Redundant
Successive-ApproximationRegister Analog-to-Digital Converter with Digital
Calibration,” IEEE Journal of Solid-State Circuits,
VOL. 46, NO. 11, Nov. 2011. Article (CrossRef
Link)
[2] S-H. Cho, C-K. Lee, J-K. Kwon and S-T. Ryu, “A
550-uW 10-b 40-MS/s SAR ADC with Multistep
Addition-Only Digital Error Correction,” IEEE
Journal of Solid-State Circuits, VOL. 46, NO. 8, Aug.
2011. Article (CrossRef Link)
[3] B. Kim, L. Yan, J. Yoo, N. Cho and H-J Yoo, “An
Energy-Efficient Dual Sampling SAR ADC with
Reduced Capacitive DAC,” Circuits and Systems,
ISCAS. IEEE International Symposium, 2009. Article
(CrossRef Link)
[4] V. Hariprasath, J. Guerber, S-H. Lee and U-K. Moon,
“Merged capacitor switching based SAR ADC with
highest switching energy-efficiency,” Electronic
Letters, 2010. Article (CrossRef Link)
[5] Y. Chen, S. Tsukamoto and T. Kuroda, “A 9b
100MS/s 1.46mW SAR ADC in 65nm CMOS,”
IEEE Asian Solid-State Circuits Conference pp. 1618, Nov. 2009. Article (CrossRef Link)
[6] Y. Zhu, C-H. Chan, U-F. Chio, S-W. Sin, S-Pan U, R.
P. Martins and F. Maloberti, “A 10-bit 100-MS/s
Reference-Free SAR ADC in 90 nm CMOS,” IEEE
Journal of Solid-State Circuits, VOL. 45, NO. 6, Jun.
2010. Article (CrossRef Link)
[7] P. W. LI, M. J. Chin, P. R. Gray and R. Castello, “A
Ratio-Independent Algorithmic Analog-to-Digital
Conversion Technique,” IEEE Journal of Solid-State
Circuits, VOL. SC-19, NO.6, Dec. 1984. Article
(CrossRef Link)
58
Rikan et al.: Digital Error Correction for a 10-Bit Straightforward SAR ADC
[8] S-H. Cho, C-K. Lee, J-K. Kwon, and S-T. Ryu, “A
550μW 10b 40MS/s SAR ADC with Multistep
Addition-only Digital Error Correction,” Custom
Integrated Circuits Conference (CICC), 2010. Article
(CrossRef Link)
[9] C. Jun, R. Feng and Xu M. H., “IC Design of 2Ms/s
10-bit SAR ADC with Low Power,” High Density
Packaging and Microsystem Integration, 2007.
Article (CrossRef Link)
[10] H-K. Kim, Y-J. Min, Y-H. Kim and S-W. Kim, “A
Low Power Consumption 10-bit Rail-to-Rail SAR
ADC Using a C-2C Capacitor Array,” Electron
Devices and Solid-State Circuits (EDSSC), IEEE
International Conference on, 2008. Article (CrossRef
Link)
behnam Samadpoor Rikan was born
in Urmia, Iran, in 1983. He received
his B.S. degree in Electronic Engineering from Urmia University, Urmia,
Iran in 2008 and M.S. degree in
Electronic Engineering from Qazvin
Azad University, Qazvin, Iran in 2013,
where he is currently working toward a
Ph.D. degree in School of Information and Communication
Engineering, Sungkyunkwan University, Suwon, Korea.
His research interests include the design of highperformance data converters, bandgap reference voltage,
low-dropout regulators and ultra-low power CMOS RF
transceivers.
Hamed Abbasizadeh was born in Suq,
Iran, in 1984. He received his B.S.
degree in Electronic Engineering from
Bushehr Azad University, Bushehr,
Iran in 2008 and M.S. degree in
Electronic Engineering from Qazvin
Azad University, Qazvin, Iran in 2012,
where he is currently working toward a
Ph.D. degree in the School of Information and
Communication Engineering, Sungkyunkwan University.
His research interests include the design of highperformance data converters, bandgap reference voltage,
low-dropout regulators and ultra-low power CMOS RF
transceiver.
Copyrights © 2015 The Institute of Electronics and Information Engineers
Sung-Han Do was born in Jinju,
Korea, in 1988. He received his B.S.
degree from the Department of
Semiconductor Systems Engineering at
Sungkyunkwan University, Suwon,
Korea, in 2014. He is currently
working toward a M.S. degree in
Semiconductor Display Engineering at
Sungkyunkwan University. His research interests include
Analog to Digital Converter and Sensor Interfaces design.
Dong-Soo Lee was born in, Korea, in
1987. He received his B.S., and M.S.
degree from the Department of
Electronic Engineering at Konkuk
University, Seoul, Korea, in 2012, and
electronic engineering at Sungkyunkwan University, Suwon, Korea, in
2014, respectively. He is currently
working toward a Ph.D. degree in electronic engineering at
Sungkyunkwan University. His research interests include
CMOS RF IC and Phase Locked Loop, Silicon Oscillator,
and Sensor Interfaces design.
Kang-Yoon Lee was born in Jeongup,
Korea, in 1972. He received his B.S.,
M.S. and Ph.D. degrees in the School
of Electrical Engineering from Seoul
National University, Seoul, Korea, in
1996, 1998, and 2003, respectively.
From 2003 to 2005, he was with GCT
Semiconductor Inc., San Jose, CA,
where he was a Manager of the Analog Division and
worked on the design of CMOS frequency synthesizer for
CDMA/PCS/PDC and single-chip CMOS RF chip sets for
W-CDMA, WLAN, and PHS. From 2005 to 2011, he was
with the Department of Electronics Engineering, Konkuk
University as an Associate Professor. Since 2012, he has
been with School of Information and Communication
Engineering, Sungkyunkwan University, where he is
currently an Associate Professor. His research interests
include implementation of power integrated circuits,
CMOS RF transceiver, analog integrated circuits, and
analog/digital mixed-mode VLSI system.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
http://dx.doi.org/10.5573/IEIESPC.2015.4.1.059
59
IEIE Transactions on Smart Processing and Computing
Design of 10-bit 10MS/s Time-Interleaved Flash-SAR ADC
Using Sharable MDAC
Sung-Han Do, Seong-Jin Oh, Dong-Hyeon Seo, Juri Lee, and Kang-Yoon Lee
Sungkyunkwan University {dsh0308, geniejazz, dhblackbox, juri, [email protected]}
* Corresponding Author:
Received April 5, 2014; Revised June 13, 2014; Accepted November 19, 2014; Published February 28, 2015
* Regular Paper
Abstract: This paper presents a 10-bit 10 MS/s Time-Interleaved Flash-SAR ADC with a shared
Multiplying DAC. Using shared MDAC, the total capacitance in the SAR ADC decreased by
93.75%. The proposed ADC consumed 2.28mW under a 1.2V supply and achieved 9.679 bit
ENOB performance. The ADC was implemented in 0.13μm CMOS technology. The chip area was
760 × 280 μm2.
Keywords: Time-Interleaved, Flash-SAR ADC, Multiplying DAC
1. Introduction
Among the many types of ADCs (Analog to Digital
Converters), the SAR ADC is used widely because of its
advantages of low power consumption and simplicity.
However, an exponential increase in the capacitance with
increasing resolution restricts the use of applications
requiring high conversion speeds.
On the other hand, Flash ADC, having the highest
conversion speed in many types of ADC, has great power
consumption and chip area. Therefore, flash ADC is
limited under 8 bit resolution.
Pipeline ADC can be a good alternative to use a fast
conversion speed in a small area. However, because the
op-amp based MDAC should be used at each stage, it
requires a relatively large current.
Therefore, there has been some research on TimeInterleaving ADC, such as : Flash-SAR ADC [1, 2],
pipelined-SAR ADC [3].
This paper presents a 10-bit 10MS/s Time-Interleaved
Flash-SAR ADC architecture with the fast speed of a flash
ADC and the low power of a SAR ADC. In addition, a
Sharable Multiplying-DAC was added to remove the
unnecessary increase in capacitance.
The remainder of this paper is organized as follows.
Section 2 presents the architecture of the proposed TimeInterleaved Flash-SAR ADC. Section 3 presents a detailed
description of the proposed building blocks. Section 4
shows the experimental results. The conclusions are
reported in Section 5.
2. Time-Interleaved Flash-SAR ADC
Architecture
Fig. 1(a) shows a conceptual block diagram of FlashSAR ADC. In the Flash-SAR ADC architecture, the frontend flash ADC receives an analog input signal (VIN) and
decides the MSBs. Using these bits, MDAC conducts a
residue process to make the input of the back-end SAR
ADC. Finally, the remaining LSBs are achieved in SAR
ADC.
Fig. 1(b) shows the operation of the ADC. Note that
because MDACOUT is amplified by a factor of 4 (in this
example), the resolution of back-end SAR ADC can be
decreased.
As shown in Fig. 1(c), the flash ADC uses only 1 clock
cycle for its MSBs conversion and the SAR ADC can start
its operation. Therefore, if only 1 channel SAR ADC is
used, the front-end flash ADC does not work during the
SAR ADC’s successive process period.
Using a multi-channel SAR ADC with the timeinterleaving method, the idle time of flash ADC can be
removed and at every clock cycle, a new analog input
(VIN) can be processed in the flash ADC.
60
Do et al.: Design of 10-bit 10MS/s Time-Interleaved Flash-SAR ADC Using Sharable MDAC
(a)
(a)
(b)
(b)
Fig. 2. (a) Block diagram of the Time-Interleaved FlashSAR ADC architecture using a thermometer CDAC, (b)
Block diagram of the proposed Time-Interleaved FlashSAR ADC Block Diagram.
Table 1. Total Capacitance of each ADC Type (@ 10
bit).
(c)
Fig. 1. (a) Conceptual Block Diagram of the Flash-SAR
ADC (5-bit example), (b) Operation of the Flash-SAR
ADC, (c) Timing diagram of Flash-SAR ADC.
On the other hand, using a multi-channel SAR ADC
makes the architecture too bulky. Fig. 2(a) shows the 10bit Time-Interleaved flash-SAR ADC combined 4-bit flash
ADC with a 6 channel 10-bit SAR ADC using
thermometer code.
Although this architecture has the advantage of a faster
conversion speed than the conventional SAR ADC, the
total capacitance, dominant area of SAR ADC, increases 6
times compared to the conventional 10-bit SAR ADC.
On the other hand, the proposed 10-bit TimeInterleaved Flash SAR ADC has a sharable MDAC
between 4-bit Flash ADC and 6-bit SAR ADC. By
implementing the sharable MDAC, a 24-1 bit thermometer
CDAC in the fine SAR ADC can be removed so that the
size of the ADC can be decreased dramatically.
Table 1 lists the comparison results of the total
capacitance of each ADC type. Compared to the
conventional 10-bit flash-SAR ADC using the Thermometer
CDAC [1, 2], the proposed flash-SAR ADC in this paper
can reduce the total capacitance 24 times, which leads
Type of ADC
Total Capacitance
10 Bit SAR ADC
210 C
10 Bit Flash SAR ADC
(4b flash + 6*10b SAR)
(Thermometer CDAC) [1, 2]
210 C × 6
10 Bit Flash SAR ADC
(4b flash + 6*6b SAR)
(Proposed)
26 C × 6
directly to the reduction of the total size of the circuit.
Moreover, to minimize the area and current
consumption of the SAR ADC in each channel, a DualSampling SAR ADC structure was adopted and Adaptive
Power Control (APC) mode was added to the comparator
of the SAR ADC.
Fig. 3 shows the timing diagram of the proposed 10-bit
10MS/s Time-Interleaved Flash-SAR ADC.
At a sample rate of 10MS/s, the duties of the Clock and
Start of Conversion (SOC) signal were 50% and 25%,
respectively, and the sampling frequency was 10MHz.
When the SOC signal was high, the Sample & Hold circuit
in the ADC samples the analog input signal VIN. On the
other hand, when the SOC signal is low, the Sample &
Hold circuit maintains the sampled signal VIN, and the
Flash ADC and MDAC process the sampled signal VIN.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
61
Fig. 3. Timing diagram of the proposed 10 bit 10MS/s Time Interleaved Flash-SAR ADC.
The front-end flash ADC decides 4-bit MSBs then
these bits control the MDAC for the residue process. After
the residue process, the back-end 6-bit SAR ADC can use
the output voltage of the MDAC as the input voltage. Note
that when the SAR ADC starts its MSB decision process
using the previous one, the flash ADC samples the new
input voltage simultaneously.
In this architecture, a 7 clock cycle latency exists.
Therefore, the output code of flash ADC should maintain
its value 6 clock cycles and merge the back-end SAR ADC
output code. Therefore, flip-flops are designed as an output
merge block. These flip-flops use the EOC (end of
conversion) signal of each channel for triggering.
Fig. 4. Block Diagram of 6-bit SAR ADC.
3. Building Blocks
To reduce the chip area, the dual-sampling SAR ADC
structure was implemented [4]. Fig. 4 shows a block
diagram of the implemented SAR ADC. In general, a 6-bit
SAR ADC requires 26 C for its binary searching algorithm.
On the other hand, in this architecture, using a single
sampling capacitor on the opposite side of CDAC, the total
capacitance of SAR ADC is reduced to almost a half.
Therefore, the area of 6-channel SAR ADC can be
decreased.
In ADC, a comparator is one of the most important
blocks because it greatly affects the accuracy and power
consumption of ADC. Fig. 5 presents a schematic diagram
of the comparator [5]. This is a combined structure of a
preamplifier stage and dynamic latch stage. The use of a
pre-amplifier at the input stage can reduce the effect of
kickback noise. In addition, using the Adaptive Power
Control (APC) function, the unnecessary power
consumption can be reduced.
Fig. 5. Schematic of the Comparator in SAR ADC.
In this paper, a sharable MDAC was implemented to
reduce the total capacitance in SAR ADC.
62
Do et al.: Design of 10-bit 10MS/s Time-Interleaved Flash-SAR ADC Using Sharable MDAC
(a)
(a)
(b)
(b)
Fig. 8. (a) INL/DNL Results, (b) SNDR / ENOB Results.
Fig. 6. (a) Residue Process in the MDAC, (b) Schematic
diagram of the MDAC.
Table 2. Performance Summary.
Process
Fig. 7. Layout of the proposed architecture.
Fig. 6(a) describes the residue process in MDAC. After
flash ADC decides the 4-bit MSB codes, these codes can
be used in MDAC. Using these MSBs, the coefficient k is
decided from 0 to 15/16, and calculated with an analog
input VIN. Finally, by multiplying residue value by 24, the
output of the MDAC is generated as follows:
MDACOUT = ٛ 4 (
IN
− ⋅
)
(1)
After the MDACOUT is made, it is delivered to the input
of the 6 channel SAR ADC and used to determine the
remaining LSB 6-bits.
0.13 um
Supply
1.2 V
Resolution
10 bits
Sampling rate
10 MS/s
Power
2.28 mW
DNL
-1.000 ~ 0.998 LSB
INL
-0.955 ~ 0.937 LSB
SNDR
60.025 dB
ENOB
9.679 bits
Core area
0.21 mm2
spectrum at a 10-MS/s sampling rate. The result of SNDR
is 60.025 dB. The resulting ENOB is 9.679 bits. At a 1.2 V
supply voltage, the proposed ADC consumes 2.28 mW.
The figure-of-merit of the ADC was calculated using the
following equation.
FOM =
Power
2
ENOB
× fs
(2)
The FOM of the proposed Time-Interleaved Flash-SAR
ADC is a 278.14fJ/conversion-step.
5. Conclusion
4. Experimental Results
The proposed ADC designed with a 0.13 μm CMOS
occupies 760 × 280 ㎛2 (Fig. 7). The size of the flash
ADC, MDAC and SAR ADC is 170 × 280 ㎛2, 180 × 280
㎛2 and 370 × 280 ㎛2, respectively.
Fig. 8(a) shows the simulation results of the differential
nonlinearity (DNL) and integral nonlinearity (INL). The
peak DNL and INL were -1.000/0.998 LSB and 0.955/0.937 LSB, respectively. Fig. 8(b) shows the FFT
A Time-Interleaved Flash SAR ADC using a sharable
MDAC was presented. The proposed ADC architecture
can lead to a faster conversion speed than the conventional
SAR ADC and a smaller total capacitance than the other
Time-Interleaving ADC architecture.
The proposed ADC achieves a 10MS/s operation speed
with a power consumption of 2.28mW. In addition. it has a
FOM of 278.14 fJ/conversion-step and occupies an active
area of only 0.21 mm2.
IEIE Transactions on Smart Processing and Computing, vol. 4, no. 1, February 2015
Acknowledgement
This research was supported by Basic Science
Research Program through the National Research
Foundation of Korea (NRF) funded by the Ministry of
Education (NRF-2013R1A1A2010114).
This work was supported by IDEC (IPC, EDA Tool,
MPW).
References
[1] Ba Ro Saim Sung, Sang-Hyun Cho, Chang-Kyo Lee,
Jong-In Kim, and Seung-Tak Ryu “A Time-Interleaved
Flash-SAR Architecture for High Speed A/D
Conversion.” ISCAS 2009, Article (CrossRef Link)
[2] Ba Ro Saim Sung, Chang-Kyo Lee, Wan Kim, Jong-In
Kim, Hyeok-Ki Hong, Ghil-geun Oh, Choong-Hoon
Lee, Michael Choi, Ho-Jin Park, and Seung-Tak Ryu “A
6 bit 2 GS/s Flash-Assisted Time-Interleaved (FATI)
SAR ADC with Background Offset Calibration.”
ASSCC 2013, Article (CrossRef Link)
[3] Lu Sun, Yuxiao Lu and Tingting Mo “A 300MHz
10b Time-Interleaved Pipelined-SAR ADC.”,
CARFIC 2013, Article (CrossRef Link)
[4] Binhee Kim, Long Yan, Jerald Yoo, Namjun Cho,
and Hoi-Jun Yoo “An Energy-Efficient Dual
Sampling SAR ADC with Reduced Capacitive
DAC.” ISCAS 2009, Article (CrossRef Link)
[5] Song Lan, Chao Yuan, Yvonne Y.H.Lam and Liter
Siek “An Ultra Low-Power Rail-to-Rail Comparator
for ADC Designs.”, MWSCAS 2011, Article
(CrossRef Link)
Sung-Han Do was born in Jinju,
Korea, in 1988. He received his B.S.
degree from the Department of
Semiconductor Systems Engineering at
Sungkyunkwan University, Suwon,
Korea, in 2014. He is currently
working toward a M.S. degree in
Semiconductor Display Engineering at
Sungkyunkwan University. His research interests include
Analog to Digital Converters and Sensor Interfaces designs.
Seong-Jin Oh was born in Seoul,
Korea. He received his B.S. degree
from the Department of Electronic
Engineering at Sungkyunkwan Univerity, Suwon, Korea, in 2014, where he
is currently working toward the
Combined Ph.D. and M.S degree in
School of Information and Communiation Engineering. His research interests include CMOS
RF transceiver, Phase Locked Loop, Push-Push Voltage
Controlled Oscillator.
Copyrights © 2015 The Institute of Electronics and Information Engineers
63
Dong-Hyeon Seo was born in Incheon,
Korea, in 1988. He received his B.S.
degree from the Department of
Electronics Engineering at Incheon
National University, Incheon, Korea,
in 2014. He is currently working
toward a M.S. degree in Electrical and
Computer Engineering at Sungkyunwan University. His research interests include Analog to
Digital Converter and Wireless Power Transfer.
Juri Lee received her B.S. degree
from the Department of Electronic
Engineering at Konkuk University,
Seoul, Korea, in 2013, where she is
currently working toward a combined
Ph.D. & M.S degree in the School of
Information and Communication Engineering, Sungkyunkwan University.
Her research interests include VCSEL drivers and CMOS
RF transceivers.
Kang-Yoon Lee was born in Jeongup,
Korea, in 1972. He received his B.S.,
M.S. and Ph.D. degrees in the School
of Electrical Engineering from Seoul
National University, Seoul, Korea, in
1996, 1998 and 2003, respectively.
From 2003 to 2005, he was with GCT
Semiconductor Inc., San Jose, CA,
where he was a Manager of the Analog Division and
worked on the design of the CMOS frequency synthesizer
for CDMA/PCS/PDC and single-chip CMOS RF chip sets
for W-CDMA, WLAN, and PHS. From 2005 to 2011, he
was with the Department of Electronics Engineering,
Konkuk University as an Associate Professor. Since 2012,
he has been with the School of Information and
Communication Engineering, Sungkyunkwan University,
where he is currently an Associate Professor. His research
interests include implementation of power integrated
circuits, CMOS RF transceiver, analog integrated circuits,
and analog/digital mixed-mode VLSI systems.