以 Intel PXA270/2700G 實現 H.264/AVC 視訊解碼器

Transcription

以 Intel PXA270/2700G 實現 H.264/AVC
視訊解碼器
Implementation of the H.264/AVC Video Decoder on Intel
PXA270/2700G Platform
研究生：陳昭廷（CHEN, CHAO-TING）
指導教授：許超雲（Prof. Chau-Yun Hsu）
大同大學
通訊工程研究所
碩士論文
Thesis for Master of Science
Institute of Communication Engineering
Tatung University
中華民國九十七年七月三十一日
July, 31 2008
誌謝
能夠完成這篇論文是非常愉快的一件事，最先要感謝的人就是指
導教授許超雲老師，感謝老師不問寒暑、風雨無阻在研究過程中不辭
辛勞，不時指導學生，並灌輸學生許多觀念和提供研究想法，要求學
生追求更完美的目標以達到今天的成果。再者要感謝父母親的辛苦教
養，受好的教育，也感謝他們提供了我一個舒適的環境來學習與成長。
這篇論文即是在老師的摧生之下，得以順利完成，所以在此要對老師
獻上十二萬分的謝意。
此外，要感謝大同中研所的郭宗勝學長及陳錫銘學長，謝謝學長
們常在百忙之中還能給予我那麼多的寶貴意見，且在我遇到論文問題
與心情低落時，給我適當的協助與信心鼓勵。
最後，我要感謝父母和家人、朋友、Hsu Group 成員們的支持與
付出，使我能在無後顧之憂的情況下順利完成碩士學位，讓我在大同
大學的求學生涯留下完美的句點。
-I-
中文摘要
由於 H.264/AVC 視訊壓縮技術具有低傳輸量和高畫質的特性，在
近來的行動多媒體產品上，H.264/AVC 視訊解壓縮技術扮演一個很重要
的腳色；但是，由於 H.264/AVC 視訊解壓縮技術具有較高運算複雜度，
相對地也必須消耗較多計算量。本篇論文說明將 H.264/AVC 視訊解碼器
移植到 Intel PXA270/2700G 平台的過程以及提出在記憶使用及規劃上改
善 H.264/AVC 視訊解碼器效能/品質的方法。
最後，當採用上述的方法之後，我們可以提升 H.264/AVC 視訊解碼
器 JMPlayer 的效能及品質。
- II -
Abstract
In mobile multimedia products, H.264/AVC video compression plays an
important role due to its features of low bit-rate and high quality. But,
H.264/AVC video decoder consumes more power because of its high
computation complexity. This thesis describes on the process of porting
H.264/AVC video decoder on Intel PXA270/2700G platform. We propose
some methods in memory allocation configuration to improve the
performance and the quality of H.264/AVC video decoder (JMPlayer).
Finally, adopting the above methods, we can improve the performance
and quality of the H.264/AVC video decoder on realizing the H.264/AVC
video decoder (JMPlayer) on Intel PXA270/2700G platform.
- III -
CONTENTS
誌謝……………………………………………………………………… I
中文摘要………………………………………………………………… II
ENGLISH ABSTRACT………………………………………………… III
TABLE OF CONTENTS……………………………………………….. IV
LIST OF FIGURES……………………………………………………... VI
LIST OF TABLES……………………………………………………...VIII
CHAPTER
1
INTRODUCTION ……………………………………………. 1
1.1 Motivation…………………………………………………. 1
1.2 Objective ………………………………………………….. 2
2
H.264/AVC OVERVIEW……………………………………… 3
2.1 Video Compression ……………………………………….. 3
2.2 MPEG and H.26x History………………………………... 22
2.3 H.264/AVC Video Decoder Data Flow .............................. 27
3
DEVELOPMENT PLATFORM OVERVIEW .…..……......... 29
3.1 Microsoft Windows CE Overview………………………... 29
3.2 Intel PXA270/2700G Hardware Architecture…………….. 45
- IV -
4
DEVELOPMENT DESCRIPTION…………………………... 52
4.1 Win CE Platform Development Environment….................. 52
4.2 The Development Flow of the Video Decoder..................... 57
4.3 The Component of the Video Decoder-JMPlayer ………... 58
4.4 Program …………………………………………………... 59
4.5 Improvement ............……………………………...……… 60
4.6 Windows CE Performance Monitor ............………............ 68
4.7 Results …….............……………………………………… 73
5
CONCLUSIONS .................................…………….………… 76
REFERENCES …........…………………….…………………………… 77
APPENDIX - Program Flow …………….……………………………... IX
-V-
LIST OF FIGURES
Figure 2.1 (a) R,G, B components of color image ………………………. 5
Figure 2.1 (b) Cr , C g , Cb components of color image ……………………… 5
Figure 2.2 Video Sampling Modes ……………………………………… 6
Figure 2.3 Spatial and temporal redundancy ……………………………. 8
Figure 2.4 Video CODEC Concepts …………………………………….. 9
Figure 2.5 Motion Estimation ………………………………………….. 10
Figure 2.6 Sample blocks ………………………………………………. 11
Figure 2.7 Motion vectors ……………………………………………… 12
Figure 2.8.1 Block size effects on motion estimation-Part1 ……………. 14
Figure 2.8.2 Block size effects on motion estimation-Part2 ……………. 15
Figure 2.9 Sub-pixel interpolation ……………………………………… 16
Figure 2.10 Discrete Cosines Transform ……………………………….. 17
Figure 2.11 Example of Quantization Matrix …………………………... 20
Figure 2.12 Zig-Zag scan for 16x16 macroblock ………………………. 20
Figure 2.13 Evolution of video coding standard ……………………….. 24
Figure 2.14 H.264/AVC decoder data flow …………………………….. 27
Figure 3.1 Windows CE operating system structure ……………………. 30
Figure 3.2 Basic GWES structure ………………………………………. 41
- VI -
Figure 3.3 PXA270 Hardware Architecture…………………………….. 46
Figure 3.4 PXA270/2700G system architecture block diagram………… 51
Figure 4.1 Win CE Platform Development Environments……………… 52
Figure 4.2 PXA270/2700G Development Environment- Ethernet card… 53
Figure 4.3 (a) PC side - TCP/IP setting ………………………………… 54
Figure 4.3 (b) Target platform side - TCP/IP setting……………………. 56
Figure 4.4 The Development flow of the H.264/AVC video decoder…... 57
Figure 4.5 Win CE platform block diagram, the component of the
JMPlayer ………………………………………………………………... 58
Figure 4.6 Video data stream flow ……………………………………… 59
Figure 4.7 JMPlayer.exe on PXA270/2700G platform ………………… 59
Figure 4.8 Memory allocations in the Windows CE address space……... 63
Figure 4.9 The Memory layout of the Win CE Platform………………... 67
Figure 4.10 Win CE Platform Monitor Environments ………………….. 71
Figure 4.11 Add “JMPlayer.exe % Processor time” to Chart …………... 72
Figure 4.12 The monitor result of the JMPlayer ver.1.0b ………………. 73
Figure 4.13 The monitor result of the JMPlayer ver.1.0c……………….. 74
- VII -
LIST OF TABLES
Table 2.1 Comparison of the video coding standards…………………… 26
Table 3.1 Windows CE supports the following types of controls, menus,
dialog boxes, and resources…………………………………………….. 43
Table 3.2 GDI features supported………………………………………. 45
Table 4.1 The Resource/Memory of the JMplayer …………………….. 74
Table 4.2 The Quality of the JMplayer………………………………….. 75
- VIII -
CHAPTER 1
INTRODUCTION
1.1 Motivation
For digital video is that raw or uncompressed video requires lots of data to
be stored or transmitted. For example, Hi-definition TV (NTSC) video is
typically digitized at 1920x1080 using 4:2:2 YCrCb at 30 frames per second,
which requires a data rate of over 1.39 Gbps [4]. This is many times more
than what can be sustained on broadband networks such as ADSL or WiFi
wireless. Today, broadband networks offer between 1-10 Mbps of sustained
throughput. Clearly, compression is needed to store or transmit digital video
communication.
1.2 Objective
Currently there are several single processors in the market. In this thesis,
we used the Intel PXA270/2700G embedded platform to implement video
decoder. The PXA 270/2700G is cheaper and has smaller area while its
performance for video processing is limited.
The purpose of the thesis is to implement a H.264/AVC decoder with an
Intel PXA270/2700G platform. The first phase of the thesis is to survey the
-1-
decoder on x86 platform .we got the decoding flow of the H.264/AVC, and
the resource of the hardware video accelerators Intel 2700G are introduced to
relieve the computational bottleneck. Meanwhile, scheduling and memory
allocation method are also considered for the best utilization of data
memories. The final product is expected to support H.264/AVC decoding
(176x144 QCIF), then we do some measurements to get the performance of
the H.264/AVC video decoder (JMPlayer).
-2-
CHAPTER 2
H.264/AVC OVERVIEW
2.1 Video Compression
The digital video compression technology has been boomed for many years.
Today, when people chat with their friends through a visual telephone, when
people enjoy the movie broadcasting through Internet or the digital music
such as mp3, the convenience that the digital video industry brings to us
cannot be forgotten. All of these should attribute to the enhancement on mass
storage media or streaming video/audio services which has influenced our
daily life deeply.
2.1.1 Color Spaces: RGB and YUV
RGB is a very common color represent of computer graphic. The image can
be thought as consists of 3 grayscale components (sometimes refers to as
channels) [5]. R, G and B represent lighting colors of red, green and blue
respectively. There is a commonsense that combining red, green and blue
with different weight can produce any visible color. A numerical value is
used to indicate the proportion of each color.
The drawback of RGB representation of color image is that 3 colors are
equally important and should be stored with same amount of data bits. But,
-3-
actually there is another color representation that can represent the color
image more efficiently know as YUV. Instead of using the color of the light,
YUV chooses the Luminance (Y) and Chrominance (UV) of the light to
represent a color image.
YUV uses RGB information, but it creates a black and white image (luma)
from the full color image and then subtracts the three primary colors
resulting in two additional signals (Chroma /Cb, Cr) to describe color.
Combining the three signals back together results in a full color image [4].
The luminance information Y can be calculated from R, G and B according
to the following equations:
Y = k r R + (1 − kb − k r )G + kb B
(2.1)
Where k is the weighting factors,
kb + k r + k g = 1
The color difference information (Chroma) can be derived as:
Cb =
0.5
B − Y (2.2)
1 − kb
Cr =
0.5
R − Y (2.3)
1 − kr
C g = G − Y (2.4)
In reality, only 3 components ( Y , Cb and C r ) need to be transmitted for video
coding because C g can be derived from Y , Cb and C r .
-4-
ITU-R recommendation BT.601[1], K b =0.114, K r = 0.299. The upper
equations can be rewrite as:
Y = 0.299R + 0.587G + 0.114B
Cb = 0.564( B − Y )
(2.6)
Cr = 0.713( R − Y )
(2.7)
R = Y + 1.402C r
(2.8)
G = Y − 0.344Cb − 0.714Cr
B = Y + 1.772Cb
(2.5)
(2.9)
(2.10)
In reality, images are looked as 2D arrays. So, R, G and B in the upper
equations are matrix as well as Y, U and V.
In Figure 2.1(a) is the red, green and blue components of a color image
compares to chroma components Cr , Cb and C g (Figure 2.1(b))
Figure 2.1(a) R,G, B components of color image
Figure 2.1 (b) C r , C g , Cb components of color image
-5-
2.1.2 Video Sampling
The video source is normally a bit steam consists of a series of frames or
fields in decoding order [1]. There are 3 YCbCr sampling modes supported
by MPEG-4 and H.264 as shown in the figure 2.2.
4:2:0 (a)
4:2:2 (b)
4:4:4 (c)
Figure 2.2 Video Sampling Modes
4:2:0 is the most common used sampling pattern. The sampling interval of
luminance sample Y is the same as the video source, which means all the
pixel positions have been sampled. The Cb and C r have the twice-sampling
interval as luminance on both vertical and horizontal directions as shown in
Figure 2.2(a). In this case, every 4 luma samples have one Cb and one C r
sample.
Considering that human eyes are more sensitive to luminance than the color
itself (chrominance), it is possible to reduce the resolution of chrominance
part without degrade the image quality apparently. That is why 4:2:0 is very
-6-
popular in current video compression standards.
For 4:2:2 mode, Cb and C r have the same number of samples on vertical and
half number of samples on horizontal as luma samples. For 4:4:4 mode, it
has the same resolution for Y , Cb and C r on both directions.
2.1.3 Reduce Redundancy
The basic idea of video compression is to compact an original video
sequence (Raw video) into a smaller one with fewer number of bits and the
video can also be recovered by some reverse operations without loose visual
information significantly. The compression is achieved by removing
redundant information from the raw video sequence. There are totally 3
types of redundancies: temporal, spatial and frequency domain redundancy.
2.1.3.1 Spatial and temporal redundancy: Pixel values are not independent,
but are correlated with their neighbors both within the same frame and across
frames. For example, if a large area in a frame has very little difference, here
is spatial redundancy between the adjacent pixels. So, to some extent, the
value of a pixel is predictable by given the values of neighboring pixels. The
similar situation also exists in time domain: for most of the movies, there is
only very few difference between consecutive frames except for the case that
-7-
the object or content of the video is changing quickly. This is often known as
temporal redundancy (see Figure 2.3).
Spatial redundancy
Temporal redundancy
Figure 2.3 Spatial and temporal redundancy
2.1.3.2 Frequency domain redundancy: The human eye and brain (Human
Visual System) are more sensitive to lower frequency, which means that
removing strong contrast part in a picture (such as the edge of objects) will
not interfere human eye from recognizing the picture.
2.1.4 Video CODEC
The redundancies we mentioned can be removed by different methods. The
temporal and spatial redundancy is often reduced by motion estimation (and
compensation) as well as the frequency redundancy is reduced by Discrete
Cosine Transform and Quantization. After these operations, entropy coding
can be employed to the data result to achieve further compression. Figure 2.4
illustrates a common video coding flow:
-8-
Figure 2.4 Video CODEC Concepts
In the following parts, each function block will be addressed in the order that
it exists in the video coding process [2].
2.1.5 Motion Estimation
The input to the coding system is an uncompressed video sequence when the
motion estimation is trying to exploit the similarities between the successive
video frames.
As shown in Figure 2.5, for a given area in current video frame, if there is a
corresponding area in the neighbor frame which is very similar to it, only the
information about the difference between these two regions should be coded
and transmitted but not the whole information of the given area. The
difference, also called residual, is produced by subtract the matched region
with the current region.
-9-
Figure 2.5 Motion Estimation
The basic idea of prediction is that a given area can be recovered by the
residual and the matched region jointly (add the residual to the prediction on
the decoder side). Considering there must be many zero values within the
residual, the temporal redundancy is reduced in this way. In reality, multiple
frames preceding or after (and/or both) the current frame can be used as the
reference to the current frame.
In practical, motion estimation and compensation are often based on
rectangular blocks (MxN or NxN). The most common size of the block is
16x16 for luminance component and 8x8 for chrominance components.
A 16x16 pixel region called macroblock is the basic data unit for motion
compensation in current video coding standards (MPEG series and ITU-T
- 10 -
series). It consists of one 16x16 luminance sample block, one 8x8 Cb
sample block and one 8x8 C r sample block (see Figure 2.6).
Figure 2.6 Sample blocks
Theoretically, the smaller of the block size, the better of the motion
estimation performance, so in the most recent standard H.264/AVC, the size
of data unit for motion estimation is more flexible as the minimum data unit
is down to 4x4 [3].
2.1.6 Motion vectors
As Figure 2.7 shows, motion vector is a two-value pair (∆x, ∆y), which
indicates the relative position offsets of the current macroblock compares to
its best matching region on both vertical and horizontal directions. Motion
vector is encoded and transmitted together with the residual.
- 11 -
Figure 2.7 Motion vectors
During the decoding process, the residual should be added to the matching
region to recover the current frame. With the help of motion vectors, the
matching region can be found from the reference frame.
2.1.7 Block size effect
In Figure 2.8.1 shows the three type block sizes based on one frame, picture
(a) and (b) are the original frames-Previous frame n − 1 and Current frame n ,
and the picture (c) and (d) are gray frames form picture (a) and (b).The
picture (e)、(f)、(g) are got form picture (c) , and picture (e) block size 4x4,
picture (f) block size 8x8,picture (g) block size 16x16.
Picture (d) is subtracted form picture (g) with motion compensation to
- 12 -
produce a residual picture (j).The energy in the residual is reduced by motion
compensation each 16x16 macroblock (see Figure 2.8.2 (j)).Motion
compensating each 8x8 block reduces the residual energy further (see Figure
2.8.2(i)) and motion compensation each 4x4 block gives he smallest residual
energy of all (see Figure 2.8.2(h)). These examples show that smaller motion
compensation block sizes can produce better motion compensation results.
However, a smaller block size leads to increased complexity (more search
operations must be carried out) and an increase in the number of motion
vectors that need to be transmitted. Sending each motion vector requires bits
to be sent and the extra overhead for vectors may outweigh the benefit of
reduced residual energy. An effective compromise is to adapt the block size
to the picture characteristics.
Obviously, the more mid-grey area, the more redundant information is
reduced. In order to achieve higher compression efficiency, H.264/AVC
chooses smaller block size for motion estimation. However, as the redundant
information within residual is reduced, there should be more motion vectors
encoded and transmitted. So, H.264/AVC supports changing the block size
dynamically according to the content of the frame.
- 13 -
2.1.7.1 Color image to gray image
Previous Frame n-1
Current Frame n
(a)
(b)
(c)
(d)
(e)MC- Block Size: 4X4
(f)MC- Block Size: 8X8
(g)MC- Block Size: 16X16
Figure 2.8.1 Block size effects on motion estimation-Part1
- 14 -
2.1.7.2 Block size effect
Block
Size
Residual with MC
Error frame
(e)
(h)
(f)
(i)
(g)
(j)
4x4
8x8
16x16
Figure 2.8.2 Block size effects on motion estimation-Part2
Note: 1.MC: Motion-compensation
2. Those frames are form "Superman Returns" announcement video.
- 15 -
2.1.8 Sub-pixel interpolation
The accuracy of motion compensation is in units of distance between pixels.
In case the motion vector points to an integer-sample position, the prediction
signal consists of the corresponding samples of the reference picture;
Figure 2.9 Sub-pixel interpolation
otherwise the corresponding sample is obtained using interpolation to
generate non-integer positions [6]. Non-integer position interpolation gives
the encoder more choices when searching for the best matching region
compares to integer motion estimation, the result is the redundancy in the
residual can be reduced further.
2.1.9 Discrete Cosine Transform
After the motion estimation, the residual data can be converted into another
domain (transform domain) by some kind of means in order to minimize the
- 16 -
frequency redundancy. Most of the transforms are block based, such as
Karhunen-Loeve Transform (KLT), Singular Value Decomposition (SVD)
and the Discrete Cosine Transform (DCT) [9]. We will only touch on DCT in
this paper.
Figure 2.10 Discrete Cosines Transform
The discrete cosine transform (DCT) helps separate the image into parts (or
spectral sub-bands) of differing importance (with respect to the image's
visual quality). The DCT is similar to the Discrete Fourier Transform: it
transforms a signal or image from the spatial domain to the frequency
domain.
The DCT operates on NxN sample block X in the form :
Y = AXAT
The inverse DCT is in the form:
X = AT YA
A is the NxN transform matrix, Y is the result sample block in frequency
- 17 -
domain
Aij = Ci cos
(2 j + 1)iπ
2N
where
Ci =
1
(i = 0),
N
Ci =
2
(i > 0)
N
The general equation for 2D (N data items) DCT is defined by the following
equation:
N −1 N −1
Yxy = C x C y ∑∑ X ij cos
i =0 j =0
N −1 N −1
X ij = ∑∑ C x C yYxy cos
x = 0 y =0
(2 j + 1) yπ
(2i + 1) xπ
cos
2N
2N
(2 j + 1) yπ
(2i + 1) xπ
cos
2N
2N
The basic operation of the DCT is as follows:
The input image is N by N.
X(i,j) is the intensity of the pixel in row i and column j.
Y(x,y) is the DCT coefficient matrix of the image in the DCT domain.
Y(0,0) corresponding to the coefficient in the upper left corner which is
defined as DC coefficient and all the rest are defined as AC coefficients.
2.1.10 Quantization and Zig-Zag scan
After DCT transformation, quantization is employed to truncate the
magnitude of DCT coefficients in order to reduce the number of bits that
represent the coefficients.
- 18 -
Quantization can be performed on each individual coefficient, which is
known as Scalar Quantization (SQ). Quantization can also be performed on a
group of coefficients together, and this is known as Vector Quantization
(VQ). [6].
A general example of quantization:
⎛ X ⎞
⎟⎟
FQ = round ⎜⎜
QP
⎝
⎠
Y = FQ ⋅ QP
(2.11)
(2.12)
The input value X is scaled by QP and rounded to the nearest integer. This
operation is normally not reversible, because some information has been lost
during rounding and it is impossible to recover X to its original value.
In video coding, quantization is often performed by vector quantization,
which means X, QP and Y are matrixes. For matrix quantization, eq.2.11 and
eq.2.12 are still applicable. A typical 8x8 quantization matrix is listed as
below.
The coefficients that close to the lower right corner of the quantization
matrix is bigger than those close to the upper left corner, because it wants to
quantize the high frequency components of X more than low frequency
components in order to neglect part of the information of high frequency.
- 19 -
Figure 2.11 Example of Quantization Matrix
2.1.11 Zig-Zag scan
After Quantization, most of the non-zero DCT coefficients are located close
to upper left corner in the matrix.
Figure 2.12 Zig-Zag scan for 16x16 macroblock
Through Zig-Zag scan, the order of the coefficients is rearranged in order that most of the
- 20 -
zeros are grouped together in the output data stream. In the following stage, run length
coding, this string of zeros can be encoded with very few number of bits.
2.1.12 Run-length Encoding
Run-length coding chooses to use a series of (run, level) pairs to represent a
string of data.
For example :
For an input data array: {1,3,0,0,0,8,2,0,3,0,0,2…..}
The output (run, level) pairs are: (0,1),(0,3),(3,8), (0,2),(1,3),(2,2)…..Run
here means how many zeros come before the next non-zero data. Level is the
value of the non-zero data.
2.1.13 Entropy Coding
The last stage in Figure 2.4 is entropy coding. Entropy encoder compressed
the quantized data into smaller number of bits for future transmission. This is
achieved by given each value a unique code word based on the probability
that the value exists in the data steam. The more often one value comes with
the data stream; the fewer bits are assigned to its code word.
The most commonly used entropy encoders are the Huffman encoder and the
arithmetic encoder, although for applications requiring fast execution, simple
run-length encoding (RLE) has proven very effective [6].
- 21 -
Two advanced entropy coding methods know as CAVLC (Context-based
adaptive Variable Length Coding) and CABAC (Context-based Arithmetic
Coding) are adopted by H.264/AVC. These two methods have improved
coding efficiency compares to the methods applied in previous standards.
2.2 MPEG and H.26x History
2.2.1 ISO/IEC, ITU-T and JVT
ISO/IEC and ITU-T are two main international standard organizations for
coding standards of video, audio and their combination.
H.26x family of standard is designed by ITU-T. As the ITU
Telecommunication Standardization Sector, ITU-T is a permanent organ of
ITU responsible for studying technical, operating and tariff questions and
issuing Recommendations on them with a view to standardizing
telecommunications on a world-wide basis [1]. H.261 is the first version of
h.26x series started since 1984. During the following years, h.262, h.263,
h.263+, h.263++ and h.264 are released by ITU-T subsequently.
The MPEG family of standards includes MPEG-1, MPEG-2 and MPEG-4,
formally known as ISO/IEC-11172, ISO/IEC-13818 and ISO/IEC-14496.
MPEG is originally the name given to the group of experts that developed
these standards. The MPEG working group (formally known as ISO/IEC
- 22 -
JTC1/SC29/WG11) is part of JTC1, the Joint ISO/IEC Technical Committee
on Information Technology.
The Joint Video Team (JVT) consists of members from ISO/IEC
JTC1/SC29/WG11 (MPEG) and ITU-T SG16 Q.6 (VCEG). They published
H.264 Recommendation/MPEG-4 part 10 standard.
2.2.2 MPEG-4
MPEG-4 (ISO/IEC 14496) became the international standard since 1999.
The basic coding theory of MPEG-4 still remains the same as previous
MPEG standards but more networks oriented. It is more suitable for
broadcast, interactive and conversational environment.
MPEG-4 introduced ‘objects’ concept: A video object in a scene is an entity
that a user is allowed to access (seek, browse) and manipulate (cut and paste).
It serves from (2 Kbit/s for speech, 5 Kbit/s for video) to (5 Mbit/s for
transparent quality Video, 64 Kbit/s per channel for CD quality Audio).
Defined profiles and levels as a kind of subset of the entire bit stream syntax.
2.2.3 MPEG part 10 & H.264/AVC
The newest standard, H.264/AVC (also known as MPEG part 10), is jointly
developed by ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC
Moving Picture Experts Group (MPEG). The final approval submission as
- 23 -
H.264/AVC [1] was released in March 2003.
The motivation of this standard comes from the growing multimedia services
and the popularity of HDTV which need more efficient coding method. At
the same time, various transmission media especially for those low speed
media (Cable Modem, xDSL or UMTS) also calls for the significant
enhancement of coding efficiency.
By introduce some unique techniques, H.264/AVC aims to increase
compression rate significantly (save up to 50% bit rate as compared to
MPEG-2 picture quality) while transmit high quality image at both high and
low bit rate as well as achieve better error robust and network friendly.
Figure 2.13 Evolution of video coding standards
Full-specified decoding process ensures that there is no mismatch during
decoding. Network Adaptation Layer allows H.264 to be transported over
different networks [2].
- 24 -
2.2.4 MPEG and H.264/AVC comparisons
The ITU-T Video Coding Experts Group developed the H.261 standard for
video conferencing applications which offered reasonable compression
performance with relatively low complexity.
This was superseded by the popular H.263 standard, offering better
performance through features such as half-pixel motion compensation and
improved Variable-length coding. Two further versions of H.263 have been
released, each offering additional optional coding modes to support better
compression efficiency and greater flexibility.
The latest version (Version 3) includes 19 optional modes, but is constrained
by the requirement to support the original, ‘baseline’ H.263 CODEC. The
H.26L standard, under development at the time of writing, incorporates a
number of new coding tools such as a 4 x 4 block transform and flexible
motion vector options and promises to outperform earlier standards.
Comparing the performance of the various coding standards is difficult
because a direct ‘rate-distortion’ comparison does not take into account other
factors such as features, flexibility and market penetration.
- 25 -
Table 2.1 Comparison of the video coding standards
Standard
MJPEG
Target
application
Coding
performance
Image coding
1
Scalable and lossless coding
modes
2
Integer-pixel motion
compensation
Video
H.261
conferencing
MPEG-1
Video-CD
3
MPEG-2
Digital TV
3
Video
4
H.263
Conferencing
4
5
4x4 DCT 2x2 DHT, Motion
estimation
16x16~4x4,CAVLC,CABAC
coding
Video
H.264/AVC
I, P, B-pictures, half-pixel
compensation
As above; field coding, scalable
coding Motion estimation
16x16
Optimized for low bit rates;
many optional modes Many
options including content-based
tools
Many options including
content-based tools, Motion
estimation 16x16 or 8x8
Multimedia
MPEG-4
Features
conferencing
It seems clear that the H.263, MPEG-2 and MPEG-4 standards each have
their advantages for designers of video communication systems. Each of
these standards makes use of common coding technologies: motion
estimation and compensation, block transformation and entropy coding [8].
H.264 decoder 主要原理 ~~~
- 26 -
2.3 H.264/AVC Video Decoder Data Flow
Figure 2.14 H.264/AVC decoder data flow
1. A compressed bit stream /NAL is entropy decoded to extract coefficients,
motion vector and header for each macroblock.
2. Run-level coding and reordering are reversed to produce a quantized,
transformed macroblock X.
3. X is rescaled and inverse transformed to produce a decoded residual D ' .
4. The decoded motion vector is used to locate a 16 × 16 region in the
decoder’s copy of the previous (reference) frame Fn'−1 . This region becomes
the motion compensated prediction P.
5. P is added to D ' to produce a reconstructed macroblock. The
reconstructed macroblocks are saved to produce decoded frame Fn' .
After a complete frame is decoded, Fn' is ready to be displayed and may
also be stored as a reference frame for the next decoded frame Fn'+1 .
It is clear from the figures and from the above explanation that the
- 27 -
encoder includes a decoding path (rescale, IDCT, reconstruct). This is
necessary to ensure that the encoder and decoder use identical reference
frames Fn'−1 for motion compensated prediction [9].
- 28 -
CHAPTER 3
DEVELOPMENT PLATFORM OVERVIEW
3.1 Microsoft Windows CE Overview
Microsoft® Windows® CE is a compact, highly efficient, multiplatform
operating system. It is not a reduced version of Microsoft® Windows®
95/XP, but was designed from the ground up as a multithreaded, fully
preemptive, multitasking operating system for platforms with limited
resources. Its modular design allows it to be customized for products ranging
from consumer electronic devices to specialized industrial controllers
General Features of Windows CE
- Provides you with a modular operating system that you can customize for
specific products. The basic core of the operating system requires less
than 200 KB of ROM.
- Provides interrupt delivery, prioritizing, and servicing.
- Runs on a wide variety of platforms
- Supports more than 1,000 of the most frequently used Microsoft®
Win32® functions, along with familiar development models and tools.
- Supports a variety of user-interface hardware, including touch screen and
color displays with up to 32-bits-per-pixel color depth.
- Supports a variety of serial and network communication technologies.
- 29 -
- Supports Mobile Channels to provide Web services for Windows CE
users.
- Supports COM/OLE, Automation, and other advanced methods of
inter-process communication.
Windows CE has four primary modules or groups of modules.
- The kernel supports basic services, such as process and thread handling and
memory management.
- The file system supports persistent storage of information.
- The graphics windowing and events subsystem (GWES) controls graphics
and window-related features
- The communications interface supports the exchange of information with
other devices [14].
Figure 3.1 Windows CE operating system structure
- 30 -
3.1.1 Kernel
The Kernel - the core of the operating system - provides system services for
managing threads, memory, and resources. It includes:
-
Preemptive, priority-based thread scheduling based on the Win32
process and thread model. Priority inversion is prevented with a system of
priority inheritance that dynamically adjusts thread priorities.
-
Predictable thread synchronization mechanisms, including wait objects.
Examples of these mechanisms are named mutexes, critical sections, and
named and unnamed event objects.
-
Efficient memory management based on dynamic-link libraries (DLLs),
which link user applications at run-time.
-
A flat, virtual address space, with 32 MB of memory reserved for each
process. Process memory is protected by altering page protections.
-
On-demand paging for both read-only memory (ROM) and random
access memory (RAM).
-
Heap size that is limited only by available memory.
-
Control of interrupt handling. You can map interrupt requests (IRQs) to
hardware interrupts and implement your own interrupt service routines and
interrupt service threads.
- 31 -
Extensive debugging support, such as including just-in-time debugging
3.1.2 Persistent Storage
The file system supports persistent storage of information. It includes:
- Support for FAT file systems with up to nine FAT volumes.
- Transactioned file handling to protect against data loss.
- Demand paging for devices that support paging.
- FAT file system mirroring to allow preservation of the file system if power
is lost or cold reset is needed.
- Installable block device drivers.
3.1.3 Communications Interface
The communications interface supports a wide range of technologies. It
includes:
- Support for serial communications, including infrared links.
- Support for Internet client applications, including Hypertext Transfer
Protocol (HTTP) and File Transfer Protocol (FTP) protocols.
- A Common Internet File System (CIFS) redirector for access to remote
file systems by means of the Internet.
- A subset of Windows Sockets (Winsock) version 1.1, plus support for
Secure Sockets
- 32 -
- A Transmission Control Protocol/Internet Protocol (TCP/IP) transport
layer configurable for wireless networking.
- An Infrared Data Association (IrDA) transport layer for robust infrared
communication.
- Both Point-to-Point Protocol (PPP) and Serial Line Internet Protocol
(SLIP) for serial-link networking.
- Support for local area networking through the network driver interface
specification (NDIS).
- Support for managing phone connections with the Telephony API (TAPI).
- A Remote Access Service (RAS) client for connections to remote file
systems by modem.
3.1.4 Graphics, Windowing, and Events Subsystem (GWES)
The GWES module supports the graphics and windowing functionality
needed to display text and images and to receive user input. It includes:
Support for a broad range of window styles, including overlapping windows.
- A large selection of customizable controls.
- Support for keyboard and stylus input.
- A command bar combining the functionality of a toolbar and a menu bar.
- An Out of Memory dialog box that requests user action when the system
is low on memory.
- Full UNICODE support.
- 33 -
A multiplatform graphics device interface (GDI) that supports the following
features:
- Both color and grayscale displays, with color depths of up to 32 bits per
pixel
- Palette management
- TrueType and raster fonts
- Printer, memory, and display device contexts (DCs).
3.1.5 Handling, Multitasking, and Multithreading
The Windows CE kernel contains the core operating system functionality
that must be present on all Windows CE-based platforms. It includes support
for memory management, process management, exception handling,
multitasking, and multithreading.
The Windows CE kernel borrows much of what is best from Windows-based
desktop platforms. For example, all Windows CE-based applications run in a
fully preemptive, multitasking environment, in protected memory spaces.
Windows CE supports native Unicode strings, allowing you to
internationalize applications.
Unlike the kernels found on Windows-based desktop platforms, the
Windows CE kernel uses DLLs to maximize available memory. The DLLs
- 34 -
are written as reentrant code, which allows applications to simultaneously
share common routines. This approach minimizes the amount of
memory-resident code required to execute applications.
3.1.6 Processes and Threads
As a multitasking operating system, Windows CE can support up to 32
simultaneous processes, each process being a single instance of an
application. In addition, multithreading support allows each process to create
multiple threads of execution. A thread is a part of a process that runs
concurrently with other parts. Threads operate independently, but each one
belongs to a particular process and shares the same memory space. The total
number of threads is limited only by available physical memory.
Processes rely on Win32 messages to initiate processing, control system
resources, and communicate with the operating system and the user. Each
process has its own message queue. For multithreaded applications, each
thread also has its own separate message queue. When there are no messages
in the queue and the thread is not engaged in any other activity, the system
suspends the thread, saving CPU resources.
Although a thread can operate independently, it often needs to be managed
- 35 -
by the process. For example, one thread may depend on another for
information. Thread synchronization suspends a thread's execution until the
thread receives notification to proceed. Windows CE supports thread
synchronization by providing a set of wait objects, which stops a thread until
a change in the wait object signals the thread to proceed. Supported wait
objects include critical sections, named and unnamed events, and named
mutex objects.
Windows CE implements thread synchronization with minimum processor
resources - an important feature for many battery-powered devices. And,
unlike many operating systems, Windows CE uses the kernel to handle
thread-related tasks, such as scheduling, synchronization, and resource
management. Consequently, an application need not poll for process or
thread completion or perform other thread-management functions.
Because Windows CE is preemptive, it allows the execution of a process or
thread to be preempted by one with higher priority. It uses a priority-based,
time-slice algorithm, with eight levels of thread priority, for thread
scheduling.
3.1.7 Memory Architecture
- 36 -
The Windows CE kernel supports a single, flat, or unsegmented, virtual
address space that all processes share. Instead of assigning each process a
different address space, Windows CE protects process memory by altering
page protections. Because it maps virtual addresses onto physical memory
using the kernel, you do not need to be concerned with the physical layout of
the target system's memory.
Approximately 1 GB of virtual memory is available to processes. It is
divided into 33 slots, each 32 MB in size. The kernel protects each process
by assigning it to a unique slot with one slot reserved for the currently
running process. Thus, the number of processes is limited to 32, but there is
no limit, aside from physical memory, on the total number of threads.
The kernel prevents an application from accessing memory outside of its
allocated slot by generating an exception. Applications can check for, and
handle, such exceptions by using the try-except statement.
Windows CE allows memory mapping, which permits multiple processes to
share the same physical memory. Memory mapping results in very fast data
transfer between cooperating processes, or between a driver and an
application. Approximately 1 GB of virtual address space, distinct from that
used for the slots, is allocated for memory mapping.
- 37 -
Windows CE always allocates memory to applications one page at a time.
The system designer specifies page size when the operating system is built
for the target hardware platform. On a Handheld PC, for example, the page
size is typically either 1 KB or 4 KB.
3.1.8 Physical Memory Usage
Windows CE-based platforms usually have no disk drive. Therefore,
physical memory, typically consisting of a combination of ROM and RAM,
plays a substantially different role on a Windows CE-based platform than it
does on a desktop computer.
Because ROM cannot be modified by the user, it is used for permanent
storage. The contents of ROM, determined by the original equipment
manufacturer (OEM), includes the operating system and any built-in
applications that the manufacturer provides, for example, Microsoft® Pocket
Word and Microsoft® Pocket Excel on an Windows CE-based platform.
Depending on your product requirements, you can also place application
code in ROM.
Because on most Windows CE systems, RAM is maintained continuously, it
is effectively nonvolatile. This feature allows your application to use RAM
for persistent storage as well as program execution, compensating for the
- 38 -
lack of a disk drive. To serve these two purposes, RAM is divided into
storage, also known as the object store, and program memory. Program
memory is used for program execution, while the object store is used for
persistent storage of data and any executable code not stored in ROM.
To minimize RAM requirements on Windows CE-based devices, executable
code stored in ROM usually executes in-place, not in RAM. Because of this,
the operating system needs only a small amount of RAM for such purposes
as stack and heap storage.
Applications are commonly stored and executed in RAM. This approach is
used primarily by third-party applications that are added by the user. Because
RAM-based applications are stored in compressed form, they must be
uncompressed and loaded into program memory for execution. To increase
the performance of application software and reduce RAM use, Windows CE
supports on-demand paging. With it, the operating system needs to
uncompress and load only the memory page containing the portion of the
application that is currently executing. When execution is finished, the page
can be swapped out, and the next page can be loaded.
Like RAM-based applications, ROM-based executable code, including
DLLs, can be compressed. When compressed, the code does not execute in
- 39 -
place, but is handled much like its RAM-based counterpart. The code is
uncompressed and loaded a page at a time into RAM program memory, and
then is swapped out when no longer needed.
3.1.9 Graphics, Windowing, and Event Subsystem
The Graphics Windowing and Event Subsystem (GWES) is the graphical
user interface between the user, your application, and the operating system.
GWES handles user input by translating keystrokes, stylus movements, and
control selections into messages that convey information to applications and
the operating system. GWES handles output to the user by creating and
managing the windows, graphics, and text that are displayed on display
devices and printers.
GWES supports all the windows, dialog boxes, controls, menus, and
resources that make up Windows CE user interface. This interface allows
users to control applications by choosing menu commands, pushing buttons,
checking and un-checking boxes, and manipulating a variety of other
controls. GWES provides information to the user in the form of bitmaps,
carets, cursors, text, and icons.
Even Windows CE-based platforms that lack a graphical user interface use
GWES basic windowing and messaging capabilities. These provide the
- 40 -
means for communication between the user, the application, and the
operating system.
As part of GWES, Windows CE provides support for active power
management to extend the limited lifetime of battery-operated devices. The
operating system automatically determines a power consumption level to
match the state of operation of the device.
The following illustration describes the GWES structure.
Figure 3.2 Basic GWES structure
3.1.9.1 Window Management
The most central feature of GWES is the window. In Windows CE-based
platforms with traditional graphical displays, the window is the rectangular
area of the screen where an application displays output and receives input
- 41 -
from the user. However, all applications need windows in order to receive
messages from the operating system, even those created for devices that lack
graphical displays.
When you create a window, Windows CE creates a message queue for the
window. The operating system translates the information it receives from the
user into messages which it places into the message queue of the active
window. The application processes most of these messages, and passes the
rest back to Windows CE for processing.
Windows CE does not send applications any messages dealing with the
nonclient area of the window. A window's nonclient area is the area of the
window where an application is not allowed to draw, such as the title bar and
scroll bars. The window manager controls the nonclient area.
Windows CE does not support the Maximize and Minimize buttons. A user
can send the window to the back of the Z order by tapping the window's
button on the taskbar. The user restores the window by tapping its taskbar
button again.
The taskbar is always visible on Windows CE. You cannot hide the taskbar
or use the full screen to display a window.
- 42 -
3.1.10 Controls, Menus, Dialog Boxes, and Resources
GWES provides controls, menus, dialog boxes, and resources to provide the
user with a standard way to make selections, carry out commands, and
perform input and output tasks.
Controls and dialog boxes are child windows that allow users to view and
organize information and to set or change attributes. A dialog box is a
window that contains controls.
All menus in Windows CE are implemented as top-level, pop-up windows.
Windows CE supports scrolling menus that automatically add scroll arrows
when a menu does not fit on the screen.
Table 3.1 Windows CE supports the following types of controls, menus, dialog boxes, and
resources
Application-defined dialog boxes
Bitmaps
Carets
Check boxes
Combo boxes
Command band
Command bars
Common dialog boxes
Cursors
Custom draw service
Date and time picker controls
Edit control
Group boxes
Header controls
Icons
Image lists
Images
Keyboard accelerators
List boxes
List views
Menus
Message boxes
Scroll bars
Static controls
Status bars
Strings
Tree views
Up-down controls
Windows CE does not support menu bars, but it does support command bars,
- 43 -
which combine the functionality of a menu bar and tool bar in one control.
Command bars make efficient use of the limited space available on many
Windows CE-based devices.
In addition to the controls listed in the previous table, Windows CE supports
the HTML viewer control, which makes it easier for you to add HTML
support to your applications.
3.1.11 Graphics Device Interface
The graphics device interface (GDI) is the GWES subsystem that controls
the display of text and graphics. You use GDI to draw lines, curves, closed
figures, text, and bitmapped images.
GDI uses a device context (DC) to store the information it needs to display
text and graphics on a specified device.
The graphic objects stored in a DC include a pen for line drawing, a brush
for painting and filling, a font for text output, a bitmap for copying or
scrolling, a palette for defining the available colors, and a region for
clipping.
Windows CE supports printer DCs for drawing on printers, display DCs for
drawing on video displays, and memory displays for drawing in memory.
- 44 -
Table 3.2 GDI features supported.
GDI feature
Description
Raster and TrueType Allows only one of these to be used on a specified
fonts
system. TrueType fonts generate superior text
output because they are scalable and rotatable.
Custom color
palettes, and both
palettized and
nonpalettized color
display devices
Supports color formats of 1, 2, 4, 8, 16, 24, and 32
bits per pixel (bpp). The first two are unique to
Windows CE.
Bit block transfer
functions and raster
operation codes
Allows you to transform and combine bitmaps in a
wide variety of ways.
Pens and brushes
Supports dashed, wide, and solid pens, and
patterned brushes.
Printing
Supports full graphical printing.
Shape drawing
functions
Supports the ellipse, polygon, rectangle, and round
rectangle shapes.
3.2 Intel PXA270/2700G Hardware Architecture
The PXA27x processor offers an integrated system-on-a-chip design
based on the Intel XScale® Microarchitecture. The PXA27x processor
integrates the Intel XScale® Microarchitecture core with many on-chip
peripherals that allows design of many different products for the handheld
and cellular handset markets [15].
- 45 -
Figure 3.3 PXA270 Hardware Architecture
3.2.1 CPU
The PXA27x processor is an implementation of the Intel XScale®
microarchitecture, which is described in the Intel XScale® Core Developer’s
Manual. The characteristics of this particular implementation include the
following:
-Several coprocessor registers
-Semaphores and interrupts for processor control
-Multiple reset mechanisms
- 46 -
-Sophisticated power management
-Highly multiplexed pin usage
3.2.2 Internal Memory
The PXA27x processor provides 256 Kbytes of internal memory-mapped
SRAM. The SRAM is divided into four banks, each consisting of 64 Kbytes.
Features
-256 Kbytes of on-chip SRAM arranged as four banks of 64 Kbytes
-Bank-by-bank power management for reduced power consumption
-Byte write support
3.2.3 Memory Controller
The external memory-bus interface for the PXA27x processor supports
SDRAM, synchronous, and asynchronous burst-mode and page-mode flash
memory, page-mode ROM, SRAM, PC Card, and CompactFlash expansion
memory.
Memory types are programmable through the memory interface
configuration registers. Memory requests are placed in a four-deep
processing queue and processed in the order they are received.
- 47 -
3.2.4 LCD Controller
The LCD/flat-panel controller is backward-compatible with Intel® PXA25x
and PXA26x processor LCD controllers. Several additional features are also
supported. The feature is:
•Pixel formats of 18, 19, 24, and 25 bits per pixel (bpp)
The following list describes features supported by the PXA27x processor
LCD controller:
•Display modes
-Support for single- or dual-scan display modules
-Passive monochrome mode supports up to 256 gray-scale levels (8
bits)
-Active color mode supports up to 16’777’216 colors (24 bits)
-Passive color mode supports a total of 16’777’216 colors (24 bits)
-Support for LCD panels with an internal frame buffer
-Support for 8-bit (each) passive dual-scan color displays
-Support for up to 18-bit per pixel single-scan color displays without an
internal frame buffer
-Support for up to 24-bit per pixel single-scan color displays with an
internal frame buffer
- 48 -
3.2.5 DMA Controller
The PXA27x processor contains a direct-memory access (DMA) controller
that transfers data to and from memory in response to requests generated by
peripheral devices or companion chips. The peripheral devices and
companion chips do not directly supply addresses and commands to the
memory controller. Instead, the states required to manage a data stream are
maintained in 32 DMA channels [16].
3.2.6 2700G Multimedia Accelerator
As the multimedia capabilities of handheld devices increase to include
higher resolutions, high-quality displays, complete graphical user interfaces
(GUIs), multimedia, 3D capabilities, and video playback, it is necessary to
include a graphics accelerator in many handheld devices.
The 2700G Multimedia Accelerator component is low-power, full-featured
graphics accelerator optimized for Intel® Personal Client Architecture
solutions and support the PXA27x processor family. The 2700G Multimedia
Accelerator provides high-performance 2D, 3D, MPEG2, MPEG4, and
Windows® Media Video (WMV) acceleration as well as dual-display
capabilities. The 2700G Multimedia Accelerator supports resolutions up to
XGA (1024 x 768 x 24 bpp) color.
- 49 -
The 2700G Multimedia Accelerator 3D acceleration provides a complete
hardware 3D rendering pipeline. This includes 831K triangles per second
processing capability and 84 million pixels-per-second fill rate.
Advanced 3D hardware acceleration includes features such as texture and
light mapping, point, bilinear and anisotropic filtering, alpha blending, dual
texturing support, deferred texturing, screen tiling, texture compression and
full-screen anti-aliasing.
The 2700 Multimedia Accelerator 2D acceleration includes support for
clipping, anti-aliasing. All 256 raster operations (Source, Pattern, and
Destination) defined by Microsoft are supported.
For optimal performance and lowest power, the following video-decode
capabilities are accelerated in the 2700 Multimedia Accelerator hardware:
Inverse Zig-Zag (IZZ), Inverse Discrete Cosine Transforms (IDCT), Motion
Compensation (MC), Color Space Conversion (CSC), and Scaling.
The 2700G Multimedia Accelerator hardware will perform Inverse Zig-Zag,
Inverse Discrete Cosine Transform, and Motion Compensation for
MPEG-1*/2*/4* and Motion Compensation for WMV® video streams [17].
The 2700G Multimedia Accelerator also provides flexibility to a system’s
display capabilities.
- 50 -
Figure 3.4 PXA270/2700G system architecture block diagram
With this device, a system can use a wide variety of displays, including
integrated LCDs of various resolutions, color depths, and refresh rates, as
well as external displays (e.g., analog CRTs, TVs, or digital flat panels).
When paired with the applications processor, the 2700G Multimedia
Accelerator is capable of simultaneously driving separate display streams to
two separate displays [18].
- 51 -
CHAPTER 4
DEVELOPMENT DESCRIPTION
4.1 Win CE Platform Development Environment
You can do with a Microsoft® Windows® CE 5.0-Platform Builder is
use Ethernet for connectivity. Ethernet, the network hardware that is used
extensively throughout the Internet, is a very fast way (at 10 Mbps) for two
or more computers to communicate.
You should be able to access your target platform (PXA270/2700G) with
the Windows XP via Ethernet just like Microsoft's peer-to-peer networking.
With corporate networks that use DHCP to assign settings automatically, you
should be able to more easily connect their target platform (PXA270/2700G)
via Ethernet.
HUB
WinCE Platform builder
And
Windows embeddedVC++
OS: WinCE V5.0
HW: Intel PXA270 & 2700G
H.264 Player : JMPlayer
Figure 4.1 Win CE Platform Development Environments
- 52 -
4.1.1 Required Hardware and Software
You'll need the following before getting started:
- A PC with Windows XP
- Windows CE Platform builder V5.0.
- Desktop or laptop Ethernet card installed, including Microsoft
networking client and TCP/IP (you can check this from the Network Control
Panel)
Figure 4.2 PXA270/2700G Development Environment-Ethernet card.
- An NE2000 Ethernet PC Card from Socket Communications at
http://www.socketcom.com/ for your target platform (PXA270/2700G). Note:
- 53 -
Other NE2000 compatible cards may work with the Handheld PC but are not
yet certified at this writing. Also, 3Com and Xircom are working on drivers
for their Ethernet PC Cards as well.
- Ethernet hub or crossover cable to plug both units into.
4.1.2 PC Configuration
You must configure your PC for Ethernet and install the Microsoft
Network Client and the TCP/IP protocol. Don't forget to fill in the
identification section. This is done from Start, Settings, Control Panel,
Network. I recommend using 192.168.0.1, subnet mask 255.255.255.0.
Figure 4.3(a) PC side -TCP/IP setting
- 54 -
4.1.3 Configure the TCP/IP properties on your desktop PC
Before you begin using Ethernet, you must have Windows CE Platform
Builder V5.0 installed. Also, you must use your serial cable to establish a
connection to your PC for Ethernet to work. This is the only way that your
target platform (PXA270/2700G) knows the name of your desktop.
4.1.4 Target Platform (PXA270/2700G) Configuration
Establish a serial connection to the PC with which you are going to use
ActiveSync. This is required to put the PC's computer name in your target
platform (PXA270/2700G) for use with Ethernet. Install the Ethernet drivers
from the Windows CE platform builder V5.0. They are located with the
optional software. You need to copy the drivers to your target platform's
/Windows directory, which will add a Network Control Panel as well as
other relevant files. They require less than 200K of free RAM allocated for
storage. You then need to reset the device to load the new components.
Next, configure the Network Control Panel for Ethernet on your target
platform (find it under Start, Settings, Control Panel). You must plug in the
IP address for the target platform (I suggest 192.168.0.2 with subnet mask
255.255.255.0) and the WINS server address, which is the PC's IP address
(192.168.0.1 using my recommended setting). Leave the other fields blank.
- 55 -
Figure 4.3(b) Target platform side- TCP/IP setting
4.1.5 Properly configured Network Control Panel settings on the target
platform Connecting
Plug both the target platform and the PC into the hub. (If you are using a
crossover cable, you can connect the target platform and PC directly without
a hub.) Turn on the PC and the Handheld PC and plug in the Ethernet PC
Card into your Handheld PC. Check for a link light on the PC Card. If it is
not lit, you are having a cable or hub problem.
To start the connection, select ActiveSync on your target platform (under
Start, Programs, Communications). Make sure the connection is set to
networking and the host matches the PC computer name. Then click connect
to start communications. ActiveSync must be running at all times when you
are using Windows CE platform builder with your target platform.
4.1.6 Download the Application ( JMPlayer ) via Network connection
If you enable continuous synchronization, your target platform will stay
- 56 -
up-to-date, downloading new application (JMPlayer), and other files and
data while it's connected to your desktop PC.
4.2 The Development Flow of the Video Decoder
Figure 4.4 The Development flow of the H.264/AVC video decoder
4.2.1 In Task I – Survey stage of the H.264/AVC Decoder[12]
H/W: x86
OS: Windows XP
Decoder: JM V10.1
4.2.2 In Task II_a–Simulation stage of the H.264/AVC Decoder
H/W: x86
OS: Embedded - Windows CE V5.0 CEPC
Decoder: JM V10.1 Î JMPlayer V1.0.1
4.2.3 In Task II_b–Implement stage of the H.264/AVC Decoder
H/W: Intel PXA270/2700G embedded platform
OS: Embedded - Windows CE V5.0
Decoder: JMPlayer V1.0.1b
- 57 -
4.2.4 In Task II_c–Fine Tuning stage of the H.264/AVC Decoder
H/W: Intel PXA270/2700G embedded platform
OS: Embedded - Windows CE V5.0
Decoder: JMPlayer V1.0.1c
About the detail program flow of the JMPlayer V1.0.1, please see
APPENDIX - Program Flow . ( Top flow , Decode one frame, Read new
slice, Decode one slice、Read Macroblock、Decode Macroblcok ) [11].
4.3 The Component of the Video Decoder - JMPlayer
Figure 4.5 Win CE platform block diagram, the component of the JMPlayer
In the Figure 4.5, the components of the JMPlayer were “Application”
(H.264/AVC video decoder)、”Embedded Shell”、”File Manager”、”Device
- 58 -
Manager”、”Device Driver”、”OAL Bootloader” and “OEM Hardware (Intel
PXA270/2700G)”.
Figure 4.6 Video data stream flow
4.4 Program
Figure 4.7 JMPlayer.exe on PXA270/2700G platform
- 59 -
4.5 Improvement
4.5.1 The Memory Architecture in Windows CE V5.0
In any Microsoft® Windows® CE–based device, ROM stores the entire
operating system (OS), as well as the applications that come with the OS
design.
If a module is not compressed, the OS executes the ROM-based modules in
place. If the OS compresses a ROM-based module, it decompresses the
module and pages it into RAM. The OS loads all read/write data into RAM.
The OEM controls the option to enable compression in ROM.
When the OS executes programs directly from ROM, it saves program RAM
and reduces the time needed to start an application, because the OS does not
have to copy the program into RAM before launching it.
For programs contained in the object store or on a flash memory storage card,
the OS does not execute these in place if the programs are not in ROM.
Instead, the OS pages these programs into the RAM and then executes them.
Depending on the OEM and the driver options on a specific Windows
CE–based device, the program module can be paged on demand. The OS can
bring in one page at a time or load the entire module into ROM at once.
The RAM on a Windows CE–based device is divided into two areas: the
- 60 -
object store and the program memory.
z The object store resembles a permanent, virtual RAM disk. Data in the
object store is retained when you suspend or perform a soft reset
operation on the system. Devices typically have a backup power supply
for the RAM to preserve data if the main power supply is interrupted
temporarily. When operation resumes, the system looks for a previously
created object store in RAM and uses it, if one is found. Devices that do
not have battery-backed RAM can use the hive-based registry to
preserve data during multiple boot processes.
z The program memory consists of the remaining RAM. Program
memory works like the RAM in personal computers — it stores the
heaps and stacks for the applications that are running.
The maximum size for the RAM file system is 256 MB, with a maximum
size of 32 MB for a single file. However, a database-volume file has a
16-MB limit. The maximum number of objects in the object store is
4,000,000.
The boundary between the object store and the program RAM is movable. In
Control Panel, a user can modify the system settings to move the boundary,
if the OS design provides this option.
- 61 -
Under low-memory conditions on some Windows CE–based devices, the OS
might prompt the user for permission to take some object store RAM for use
as program RAM to meet the RAM requirements of an application.
When Windows CE OS starts, it creates a single 4-Gigabyte (GB) virtual
address space. The address space is divided into 33 slots, and each slot is 32
megabytes (MB). All the processes share the address space. When a process
starts, Windows CE selects an open slot for the process in the address space
of the system. Slot zero is reserved for the currently running process.
Additionally, Windows CE creates a stack for the thread and a heap for the
process. Each stack has an initial size of at least 1 kilobyte (KB) or 4 KB,
which is committed on demand.
The amount of stack space that is reserved by the system, on a per-process
basis, is specified in the /STACK option for the linker.
Because the stack size is CPU-dependent, the system on some devices
allocates 4 KB for each stack.
The maximum number of threads is dependent on the amount of available
memory. You can allocate additional memory, outside of the 32 MB that is
assigned to each slot, by using memory-mapped files or by calling the
VirtualAlloc function.
- 62 -
The following illustration shows how memory is allocated in the Windows
CE address space.
Figure 4.8 Memory allocations in the Windows CE address space
When a process initializes, the OS maps the following DLLs and memory
components:
- Some execute-in-place (XIP) dynamic-link libraries (DLLs)
- Some read/write sections of other XIP DLLs
- All non-XIP DLLs
- Stack
- Heap
Data section for each process in the slot assigned to the process
DLLs and ROM DLL read/write sections are loaded at the top of the slot.
- 63 -
DLLs are controlled by the loader, which loads all the DLLs at the same
address for each process. The stack, the heap, and the executable (.exe) file
are created and mapped from the bottom up. The bottom 64 KB of memory
always remains free [13].
4.5.1.1 Virtual Memory
Windows CE implements a paged virtual memory management system
similar to other Microsoft Windows–based desktop platforms. A page is
always made up of 4,096 bytes (4 KB).
Each process in Windows CE has only 32 MB of virtual address space. Your
process, dynamic-link libraries (DLLs), heaps, stacks, and virtual memory
allocations use this address space.
By default, memory allocated by VirtualAlloc falls into the virtual address
space of a process. However, you can request to have the memory allocated
outside the process slot.
4.5.1.2 VirtualAlloc
This function reserves or commits a region of pages in the virtual address
space of the calling process.
Memory allocated by VirtualAlloc is initialized to zero.
- 64 -
LPVOID VirtualAlloc(
L P V O I D lpAddress,
D W O R D dwSize,
D W O R D flAllocationType,
D W O R D flProtect
}
VirtualAlloc can perform the following operations:
Commit a region of pages reserved by a previous call to the VirtualAlloc
function.You can use VirtualAlloc to reserve a block of pages and then make
additional calls to VirtualAlloc to commit individual pages from the reserved
block. This enables a process to reserve a range of its virtual address space
without consuming physical storage until it is needed.
Each page in the virtual address space of the process is in one of three states:
z Free, in which the page is not committed or reserved and is not
accessible to the process. VirtualAlloc can reserve, or simultaneously
reserve and commit, a free page.
z Reserved, in which the range of addresses cannot be used by other
allocation functions, but the page is not accessible and has no physical
storage associated with it. VirtualAlloc can commit a reserved page, but
it cannot reserve it a second time. The VirtualFree function can release a
reserved page, making it a free page.
z Committed, in which physical storage is allocated for the page, and
- 65 -
access is controlled by a protection code.
The system initializes and loads each committed page into physical memory
only at the first attempt to read or write to that page. When the process
terminates, the system releases the storage for committed pages.
VirtualAlloc can commit an already committed page. This means you can
commit a range of pages, regardless of whether they have been committed,
and the function will not fail.
VirtualFree can decommit a committed page, releasing the page's storage,
or it can simultaneously decommit and release a committed page.
If the lpAddress parameter is not NULL, the function uses the lpAddress
and dwSize parameters to compute the region of pages to be allocated.
The current state of the entire range of pages must be compatible with
the type of allocation specified by the flAllocationType parameter. Otherwise,
the function fails and no pages are allocated. This compatibility requirement
does not preclude committing an already committed page.
If you call VirtualAlloc with dwSize >= 2 MB, flAllocationType set to
MEM_RESERVE, and flProtect set to PAGE_NOACCESS, it
automatically reserves memory at the shared memory region. This preserves
per-process virtual memory.
- 66 -
4.5.2 The Method of the embedded Memory Configuration
In image.cfg
%WINCEROOT%\PLATFORM\Sandgateii_g\Src\Inc
#define IMAGE_DISPLAY_RAM_OFFSET 0x07CB8000
0x00000000
Boot loader Stack
64KB
Boot loader RAM
64KB
Boot loader Code
256KB
0x00010000
0x00020000
0x00060000
OEM LOGO
256KB
(Reserved for future)
380KB
0x000A0000
0x000FF000
ARG
4KB
0x00100000
48MB
NK
0x0310000
79MB
0x07CB8000
IMAGE_DISPLAY_RAM_OFFSET
0x08000000
Figure 4.9 The Memory layout of the Win CE Platform.
4.5.3 Quality Fine Tuning
In Ldecod.cpp and JMPlayerDlg.cpp
input->R_decoder=500000;
// Decoder rate
input->B_decoder=104000;
// Decoder buffer size
input->F_decoder=73000;
// Decoder initial delay
- 67 -
In Video decoder module , Timer Setting.
void CPJMPlayerDlg::OnButton1()
SetTimer(123,20,NULL);
// The timer of the decode as 20ms
100ms Î 50ms Î20 ms Î10 ms
4.6 Windows CE Performance Monitor
4.6.1 Performance Profiling
With remote tools for performance profiling, you can use a development
workstation to remotely monitor performance criteria on a Microsoft®
Windows® CE–based target device.
Remote Performance Monitor Provides information about the remote tool
that provides a variety of monitoring charts, logs, and viewers that allow you
to measure performance on a target device. Remote Performance Monitor is
a graphical tool for measuring the performance of a Microsoft® Windows®
CE–based OS design. With the tool, you can observe the behavior of
performance objects such as CPUs, threads, processes, and system memory.
Each performance object has an associated set of performance counters
that provide information about device usage, queue lengths, and delays, and
information used to measure throughput and internal congestion.
- 68 -
Remote Performance Monitor provides the ability to track current activity
on a target device and the ability to view data from a log file.
To display information from a target device in Remote Performance
Monitor, you must first connect to the target device with Platform Manager.
A performance counter tracks a single metric on your target device.
To get information on a performance object, call the function corresponding
to the performance object. The data structure the function receives provides
the statistics for the performance object [14].
4.6.2 Chart View Window
The Remote Performance Monitor supports a Chart view that allows you to
monitor performance in real time. With the tool, you can observe the
behavior of performance objects such as CPUs, threads, processes, and
system memory.
The Chart View window allows you to select categories of statistics you
want to monitor, and to select the type of chart you want displayed. When
you choose a category, the Remote Performance Monitor tool automatically
updates the available options in that category.
- 69 -
4.6.3 Observe JMPlayer Performance
You can use the Remote Performance Monitor tool to display statistics that
describe the performance of a target device in real time.
You can also use Remote Performance Monitor for information about
Resource use You can also configure the Remote Performance Monitor tool
to alert you when the target device meets a specified condition.
Step1. Open an CEPB.
Step2. If you have not built your run-time image, build the run-time image
with KITL mode.
Step3. Establish a hardware connection between your CEPB and the target
platform; then configure Platform Builder to download the run-time image
(with KITL mode) to the target device over the established connection.
- If your target device is a CEPC, you must boot a run-time image on the
CEPC.
- If your target platform is custom hardware, use the steps for booting a
run-time image on a CEPC as a model for establishing a connection to your
custom hardware.
- 70 -
HUB
OS: WinCE V5.0 KITL Mode
HW: Intel PXA270 & 2700G
H.264 Player : JMPlayer
WinCE Platform builder
And
Windows embeddedVC++
Windows CE Remote
Performance Monitor
Figure 4.10 Win CE Platform Monitor Environments
For more information, see section 4.1 Win CE Platform Development
Environment.
Step4. Download the run-time image to the target platform.
Step5. Open the Remote Performance Monitor tool and then configure the
connection from the tool to the target platform.
When you configure the connection, perform the following steps:
- In the Transport box, choose KITL Transport for Windows CE and then
choose Configure.
- In the Named connection box, choose the named connection you used in
the CEPB IDE to connect to the target device.
- In the Startup Server box, choose CESH Server for Windows CE and then
choose Configure.
- Choose Directory containing last image booted by Platform Builder.
- 71 -
Step6. Connect the Remote Performance Monitor tool to the target device.
Step7. Open the Chart view window, which Remote Performance Monitor
uses to display data from the target device in real time.
Step8. Configure the Chart view window to display one or more statistics
from the target device.
For example, to display memory use as a percentage of total memory, do the
following:
- In the Object box, choose CE Process Statistics.
- From the Counter list, choose % Porcessor Time Load.
Note: % Porcessor Time is the percentage of elapsed time that all the
threads of this process used the processor to execute instructions.
Figure 4.11 Add “JMPlayer.exe %Processor time” to Chart
Step9. After you add a statistic to the chart, if you are not satisfied with the
appearance of the line that plots the statistic, change the appearance of the
- 72 -
line.
Step10. If you are not satisfied with the appearance or behavior of the chart,
modify the appearance or behavior.
Step11. Open the Alert view window, which Remote Performance Monitor
uses to notify you when specific condition occurs on the target device.
Step12. Configure the Alert view window to notify you when conditions
“JMPlayer.exe “occur on the target platform.
4.7 Results
After section 4.6 configuration and measuring the JMPlayer.exe process time
via Windows CEPB Remote tool ( Zoom-in performance Monitor) , we get
the result as two Figure 4.12 and Figure 4.13.
72 sec
Figure 4.12 The monitor result of the JMPlayer ver.1.0b
- 73 -
65 sec
Figure 4.13 The monitor result of the JMPlayer ver.1.0c
4.7.1 The Resource/Memory of the JMplayer
Table 4.1 The Resource/Memory of the JMplayer
JMPlayer Version
Memory/Resource of the embedded platform
V1.0 b
96%
V1.0 c
98%
In Table 4.1, at the JMPlayer ver1.0b case, we use the 96% memory/resource
of the embedded platform. But in JMPlayer ver.1.0c, we use the 98%
memory/resource of the embedded platform.
4.7.2 The Quality of the JMplayer
We played the same H.264/AVC video sequences “foreman” sequences.
Table 4.2 The Quality of the JMplayer
- 74 -
JMPlayer Version
The Quality of the JMplayer
V1.0 b
Lag
V1.0 c
Smooth
In Table 4.2, we got the lag quality at the JMPlayer ver1.0b and it spend 72
sec .We got the smooth quality at JMPlayer ver1.0c and it spend 65 sec.
- 75 -
CHAPTER 5
CONCLUSIONS
Video conferencing via the Internet is becoming more widely used and
may gain further acceptance with increases in processor and connection
performance. The Application trends have two application areas.
1. Very low power, very low bandwidth video for embedded system.
2. High bandwidth, high quality video coding.
The same video sequences” foreman”, its (H.264 format) size was 330KB
only. In this thesis, the application of the H.264/AVC video decoder
(JMPlayer) is Baseline profile. It used as video conferencing appropriately.
In PXA270/2700G platform - JMPlayer.exe V1.0c, we played the same
H.264 sequences” foreman”, we got the smooth quality, but spend more
memory/resource of embedded platform.
In the other embedded platforms as VIA-C7、TI、MIPS…etc. , We also
could use the same method to get the fine performance (high play quality) of
the H.264/AVC video decoder.
- 76 -
REFERENCES
[1] Joint Video Team of ITU-T and ISO/IEC JTC 1, “Draft ITU-T Recommendation and
Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 |
ISO/IEC 14496-10 AVC),” Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T
VCEG, JVT-G050, Mar. 2003.
[2] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the
H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol.
13, no. 7, pp. 560–576, July 2003.
[3] Sullivan, G.J. and Wiegand, T. , “ Video Compression - from concepts to the
H.264/AVC standard” Proceedings of the IEEE , Volume 93 , Issue 1, pp. 18-31,
June 2005.
[4] Jeremiah Golston and Dr. Ajit Rao, Video codecs tutorial: Trade-offs with H.264,
VC-1 and other advanced codecs. Texas Instruments White Paper, Mar. 2006.
[5] Joe Maller: FXScript Reference: RGB and YUV Color, http://www.joemaller.com
[6] Image Compression - from DCT to Wavelets: A Review, Subhasis Saha
www.acm.org/crossroads/xrds6-3/sahaimgcoding.html
[7] Tiejun Hu and Di Wu, Design of Single Scalar DSP based H.264/AVC Decoder, Mar.
2005.
[8] Iain E. G. Richardson, Video Codec Design: Developing Image and Video
- 77 -
Compression Systems, John Wiley & Sons, Ltd., June 2002.
[9] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for
Next-generation Multimedia, John Wiley & Sons, Ltd., Dec. 2003.
[10] HHI Image Communication, Scalable Extension of H.264/AVC,
http://ip.hhi.de/imagecom G1/savce/index.htm.
[11] ITU-T, Joint Video Team (JVT), http://www.itu.int/ITU-T/studygroups/com16/jvt.
[12] TML, H.264/AVC reference software JM 10.1, http://iphome.hhi.de/suehring/tml/ .
[13] Douglas Boling, Programming Microsoft Windows CE .NET Third Edition,
Microsoft Press, Ltd., 2003.
[14] Help file of Platform Builder for Microsoft Windows CE 5.0-Microsoft Corp.
[15] Intel® Xscale PXA27x Developer Manual (Document Number: 280000-002), Intel
Corp., Apr. 2004.
[16] Intel® Xscale PXA27x Design Guide (Document Number: 280001-001), Intel
Corp., May 2005.
[17] Intel® 2700G7 Multimedia Accelerator datasheet (Document Number:
304430-001), Intel Corp., Nov. 2004.
[18] Intel® 2700G Multimedia Accelerator Video Acceleration API Software
Developer’s Guide (Document Number: 300950-001), Intel Corp., Jan. 2005.
- 78 -
APPENDIX - Program Flow [12].
- IX -
Appendix 2- decode one frame
Setup
Read_new_slice
frame
Frame/field
Dec_frame
_slice
field
Dec_field_
slice
Nextheader=
SOP ?
Y
Deblock
_frame
Exit_frame
-X-
YCbCr
frame
N
Appendix 3- Read new slice
Setup
Yes
Annexb?
GetAnnexbNALU
No
GetRTNALU
freeNALU
NALUto
RBSP
NALU_TYP
E
IDR
YES
Read_IDR
YES
Read_DPA
YES
Read_DPB
YES
Read_DPC
YES
Read_SEI
YES
Read_PPS
YES
Read_SPS
YES
Read_PD
YES
Read_EOSE
Q
YES
Read_EOST
REAM
YES
Read_FILL
No
NALU_TYP
E
DPA
No
NALU_TYP
E
DPB
No
NALU_TYP
E
DPC
No
NALU_TYP
E
SEI
No
NALU_TYP
E
PPS
No
NALU_TYP
E
SPS
No
NALU_TYP
E
PD
No
NALU_TYP
E
EOSEQ
No
NALU_TYP
E
EOSTREAM
No
NALU_TYP
E
FILL
- XI -
Exit
Appendix 4- decode one slice
Setup
Set_ref_pic_num
Start_mb
Read_one_mb
Dec_one_mb
Exit_mb
exit
- XII -
Appendix 5- Read Macroblock
Setup
Read_mb_
mode
P_Slice
Interpret_
mb
Mode_P
I_Slice
Interpret_
mb
Mode_I
B_Slice
Interpret_
mb
Mode_B
SP_Slice
Interpret_
mb
Mode_SP
SI_Slice
Interpret_
mb
Mode_SI
Is_P8x8
Read_8x8
_subpart_
mod
I_slice or
SI_slice?
YES
Read_mb_
func1
None
I/SI_slice
CABAC?
YES
Read_mb_
func2
VLC non
intra?
YES
Read_mb_
func3
Ini_mb
Is_DIRECT?
Func_dire
ct
Is_Copy?
Func_copy
IPCM?
Read_ipre
d_modes
readMotionInfoFro
mNAL
Is_intermv
readIPCM
coeffsFro
mNAL
readCBPa
nelCoeffsF
romNAL
exit
- XIII -
- XIV -

以 Intel PXA270/2700G 實現 H.264/AVC 視訊解碼器

Transcription

Similar documents

24 News

Product Change Notification PCN.2015.03.30.1

CF-SX4 Fact Sheet

See more

Taming Your Indicator Consumption Pipeline

Calvary Chapel - Imagine Communications

Programmer

A guide to dismantling the Theoben H.E. Gas Ram

Media Engine - WDL Systems

Build An Alphanumeric Pager Decoder

HAZLOC Solutions

SDN144A0 - Jim`s Model Trains

10-11 Faculty Handbook draft