以 Intel PXA270/2700G 實現 H.264/AVC 視訊解碼器
Transcription
以 Intel PXA270/2700G 實現 H.264/AVC 視訊解碼器
以 Intel PXA270/2700G 實現 H.264/AVC 視訊解碼器 Implementation of the H.264/AVC Video Decoder on Intel PXA270/2700G Platform 研究生:陳昭廷(CHEN, CHAO-TING) 指導教授:許超雲(Prof. Chau-Yun Hsu) 大同大學 通訊工程研究所 碩士論文 Thesis for Master of Science Institute of Communication Engineering Tatung University 中華民國 九十七 年 七 月 三十一 日 July, 31 2008 誌 謝 能夠完成這篇論文是非常愉快的一件事,最先要感謝的人就是指 導教授許超雲老師,感謝老師不問寒暑、風雨無阻在研究過程中不辭 辛勞,不時指導學生,並灌輸學生許多觀念和提供研究想法,要求學 生追求更完美的目標以達到今天的成果。再者要感謝父母親的辛苦教 養,受好的教育,也感謝他們提供了我一個舒適的環境來學習與成長。 這篇論文即是在老師的摧生之下,得以順利完成,所以在此要對老師 獻上十二萬分的謝意。 此外,要感謝大同中研所的郭宗勝學長及陳錫銘學長,謝謝學長 們常在百忙之中還能給予我那麼多的寶貴意見,且在我遇到論文問題 與心情低落時,給我適當的協助與信心鼓勵。 最後,我要感謝父母和家人、朋友、Hsu Group 成員們的支持與 付出,使我能在無後顧之憂的情況下順利完成碩士學位,讓我在大同 大學的求學生涯留下完美的句點。 -I- 中 文 摘 要 由於 H.264/AVC 視訊壓縮技術具有低傳輸量和高畫質的特性,在 近來的行動多媒體產品上,H.264/AVC 視訊解壓縮技術扮演一個很重要 的腳色;但是,由於 H.264/AVC 視訊解壓縮技術具有較高運算複雜度, 相對地也必須消耗較多計算量。本篇論文說明將 H.264/AVC 視訊解碼器 移植到 Intel PXA270/2700G 平台的過程以及提出在記憶使用及規劃上改 善 H.264/AVC 視訊解碼器效能/品質的方法。 最後,當採用上述的方法之後,我們可以提升 H.264/AVC 視訊解碼 器 JMPlayer 的效能及品質。 - II - Abstract In mobile multimedia products, H.264/AVC video compression plays an important role due to its features of low bit-rate and high quality. But, H.264/AVC video decoder consumes more power because of its high computation complexity. This thesis describes on the process of porting H.264/AVC video decoder on Intel PXA270/2700G platform. We propose some methods in memory allocation configuration to improve the performance and the quality of H.264/AVC video decoder (JMPlayer). Finally, adopting the above methods, we can improve the performance and quality of the H.264/AVC video decoder on realizing the H.264/AVC video decoder (JMPlayer) on Intel PXA270/2700G platform. - III - CONTENTS 誌謝……………………………………………………………………… I 中文摘要………………………………………………………………… II ENGLISH ABSTRACT………………………………………………… III TABLE OF CONTENTS……………………………………………….. IV LIST OF FIGURES……………………………………………………... VI LIST OF TABLES……………………………………………………...VIII CHAPTER 1 INTRODUCTION ……………………………………………. 1 1.1 Motivation…………………………………………………. 1 1.2 Objective ………………………………………………….. 2 2 H.264/AVC OVERVIEW……………………………………… 3 2.1 Video Compression ……………………………………….. 3 2.2 MPEG and H.26x History………………………………... 22 2.3 H.264/AVC Video Decoder Data Flow .............................. 27 3 DEVELOPMENT PLATFORM OVERVIEW .…..……......... 29 3.1 Microsoft Windows CE Overview………………………... 29 3.2 Intel PXA270/2700G Hardware Architecture…………….. 45 - IV - 4 DEVELOPMENT DESCRIPTION…………………………... 52 4.1 Win CE Platform Development Environment….................. 52 4.2 The Development Flow of the Video Decoder..................... 57 4.3 The Component of the Video Decoder-JMPlayer ………... 58 4.4 Program …………………………………………………... 59 4.5 Improvement ............……………………………...……… 60 4.6 Windows CE Performance Monitor ............………............ 68 4.7 Results …….............……………………………………… 73 5 CONCLUSIONS .................................…………….………… 76 REFERENCES …........…………………….…………………………… 77 APPENDIX - Program Flow …………….……………………………... IX -V- LIST OF FIGURES Figure 2.1 (a) R,G, B components of color image ………………………. 5 Figure 2.1 (b) Cr , C g , Cb components of color image ……………………… 5 Figure 2.2 Video Sampling Modes ……………………………………… 6 Figure 2.3 Spatial and temporal redundancy ……………………………. 8 Figure 2.4 Video CODEC Concepts …………………………………….. 9 Figure 2.5 Motion Estimation ………………………………………….. 10 Figure 2.6 Sample blocks ………………………………………………. 11 Figure 2.7 Motion vectors ……………………………………………… 12 Figure 2.8.1 Block size effects on motion estimation-Part1 ……………. 14 Figure 2.8.2 Block size effects on motion estimation-Part2 ……………. 15 Figure 2.9 Sub-pixel interpolation ……………………………………… 16 Figure 2.10 Discrete Cosines Transform ……………………………….. 17 Figure 2.11 Example of Quantization Matrix …………………………... 20 Figure 2.12 Zig-Zag scan for 16x16 macroblock ………………………. 20 Figure 2.13 Evolution of video coding standard ……………………….. 24 Figure 2.14 H.264/AVC decoder data flow …………………………….. 27 Figure 3.1 Windows CE operating system structure ……………………. 30 Figure 3.2 Basic GWES structure ………………………………………. 41 - VI - Figure 3.3 PXA270 Hardware Architecture…………………………….. 46 Figure 3.4 PXA270/2700G system architecture block diagram………… 51 Figure 4.1 Win CE Platform Development Environments……………… 52 Figure 4.2 PXA270/2700G Development Environment- Ethernet card… 53 Figure 4.3 (a) PC side - TCP/IP setting ………………………………… 54 Figure 4.3 (b) Target platform side - TCP/IP setting……………………. 56 Figure 4.4 The Development flow of the H.264/AVC video decoder…... 57 Figure 4.5 Win CE platform block diagram, the component of the JMPlayer ………………………………………………………………... 58 Figure 4.6 Video data stream flow ……………………………………… 59 Figure 4.7 JMPlayer.exe on PXA270/2700G platform ………………… 59 Figure 4.8 Memory allocations in the Windows CE address space……... 63 Figure 4.9 The Memory layout of the Win CE Platform………………... 67 Figure 4.10 Win CE Platform Monitor Environments ………………….. 71 Figure 4.11 Add “JMPlayer.exe % Processor time” to Chart …………... 72 Figure 4.12 The monitor result of the JMPlayer ver.1.0b ………………. 73 Figure 4.13 The monitor result of the JMPlayer ver.1.0c……………….. 74 - VII - LIST OF TABLES Table 2.1 Comparison of the video coding standards…………………… 26 Table 3.1 Windows CE supports the following types of controls, menus, dialog boxes, and resources…………………………………………….. 43 Table 3.2 GDI features supported………………………………………. 45 Table 4.1 The Resource/Memory of the JMplayer …………………….. 74 Table 4.2 The Quality of the JMplayer………………………………….. 75 - VIII - CHAPTER 1 INTRODUCTION 1.1 Motivation For digital video is that raw or uncompressed video requires lots of data to be stored or transmitted. For example, Hi-definition TV (NTSC) video is typically digitized at 1920x1080 using 4:2:2 YCrCb at 30 frames per second, which requires a data rate of over 1.39 Gbps [4]. This is many times more than what can be sustained on broadband networks such as ADSL or WiFi wireless. Today, broadband networks offer between 1-10 Mbps of sustained throughput. Clearly, compression is needed to store or transmit digital video communication. 1.2 Objective Currently there are several single processors in the market. In this thesis, we used the Intel PXA270/2700G embedded platform to implement video decoder. The PXA 270/2700G is cheaper and has smaller area while its performance for video processing is limited. The purpose of the thesis is to implement a H.264/AVC decoder with an Intel PXA270/2700G platform. The first phase of the thesis is to survey the -1- decoder on x86 platform .we got the decoding flow of the H.264/AVC, and the resource of the hardware video accelerators Intel 2700G are introduced to relieve the computational bottleneck. Meanwhile, scheduling and memory allocation method are also considered for the best utilization of data memories. The final product is expected to support H.264/AVC decoding (176x144 QCIF), then we do some measurements to get the performance of the H.264/AVC video decoder (JMPlayer). -2- CHAPTER 2 H.264/AVC OVERVIEW 2.1 Video Compression The digital video compression technology has been boomed for many years. Today, when people chat with their friends through a visual telephone, when people enjoy the movie broadcasting through Internet or the digital music such as mp3, the convenience that the digital video industry brings to us cannot be forgotten. All of these should attribute to the enhancement on mass storage media or streaming video/audio services which has influenced our daily life deeply. 2.1.1 Color Spaces: RGB and YUV RGB is a very common color represent of computer graphic. The image can be thought as consists of 3 grayscale components (sometimes refers to as channels) [5]. R, G and B represent lighting colors of red, green and blue respectively. There is a commonsense that combining red, green and blue with different weight can produce any visible color. A numerical value is used to indicate the proportion of each color. The drawback of RGB representation of color image is that 3 colors are equally important and should be stored with same amount of data bits. But, -3- actually there is another color representation that can represent the color image more efficiently know as YUV. Instead of using the color of the light, YUV chooses the Luminance (Y) and Chrominance (UV) of the light to represent a color image. YUV uses RGB information, but it creates a black and white image (luma) from the full color image and then subtracts the three primary colors resulting in two additional signals (Chroma /Cb, Cr) to describe color. Combining the three signals back together results in a full color image [4]. The luminance information Y can be calculated from R, G and B according to the following equations: Y = k r R + (1 − kb − k r )G + kb B (2.1) Where k is the weighting factors, kb + k r + k g = 1 The color difference information (Chroma) can be derived as: Cb = 0.5 B − Y (2.2) 1 − kb Cr = 0.5 R − Y (2.3) 1 − kr C g = G − Y (2.4) In reality, only 3 components ( Y , Cb and C r ) need to be transmitted for video coding because C g can be derived from Y , Cb and C r . -4- ITU-R recommendation BT.601[1], K b =0.114, K r = 0.299. The upper equations can be rewrite as: Y = 0.299R + 0.587G + 0.114B Cb = 0.564( B − Y ) (2.6) Cr = 0.713( R − Y ) (2.7) R = Y + 1.402C r (2.8) G = Y − 0.344Cb − 0.714Cr B = Y + 1.772Cb (2.5) (2.9) (2.10) In reality, images are looked as 2D arrays. So, R, G and B in the upper equations are matrix as well as Y, U and V. In Figure 2.1(a) is the red, green and blue components of a color image compares to chroma components Cr , Cb and C g (Figure 2.1(b)) Figure 2.1(a) R,G, B components of color image Figure 2.1 (b) C r , C g , Cb components of color image -5- 2.1.2 Video Sampling The video source is normally a bit steam consists of a series of frames or fields in decoding order [1]. There are 3 YCbCr sampling modes supported by MPEG-4 and H.264 as shown in the figure 2.2. 4:2:0 (a) 4:2:2 (b) 4:4:4 (c) Figure 2.2 Video Sampling Modes 4:2:0 is the most common used sampling pattern. The sampling interval of luminance sample Y is the same as the video source, which means all the pixel positions have been sampled. The Cb and C r have the twice-sampling interval as luminance on both vertical and horizontal directions as shown in Figure 2.2(a). In this case, every 4 luma samples have one Cb and one C r sample. Considering that human eyes are more sensitive to luminance than the color itself (chrominance), it is possible to reduce the resolution of chrominance part without degrade the image quality apparently. That is why 4:2:0 is very -6- popular in current video compression standards. For 4:2:2 mode, Cb and C r have the same number of samples on vertical and half number of samples on horizontal as luma samples. For 4:4:4 mode, it has the same resolution for Y , Cb and C r on both directions. 2.1.3 Reduce Redundancy The basic idea of video compression is to compact an original video sequence (Raw video) into a smaller one with fewer number of bits and the video can also be recovered by some reverse operations without loose visual information significantly. The compression is achieved by removing redundant information from the raw video sequence. There are totally 3 types of redundancies: temporal, spatial and frequency domain redundancy. 2.1.3.1 Spatial and temporal redundancy: Pixel values are not independent, but are correlated with their neighbors both within the same frame and across frames. For example, if a large area in a frame has very little difference, here is spatial redundancy between the adjacent pixels. So, to some extent, the value of a pixel is predictable by given the values of neighboring pixels. The similar situation also exists in time domain: for most of the movies, there is only very few difference between consecutive frames except for the case that -7- the object or content of the video is changing quickly. This is often known as temporal redundancy (see Figure 2.3). Spatial redundancy Temporal redundancy Figure 2.3 Spatial and temporal redundancy 2.1.3.2 Frequency domain redundancy: The human eye and brain (Human Visual System) are more sensitive to lower frequency, which means that removing strong contrast part in a picture (such as the edge of objects) will not interfere human eye from recognizing the picture. 2.1.4 Video CODEC The redundancies we mentioned can be removed by different methods. The temporal and spatial redundancy is often reduced by motion estimation (and compensation) as well as the frequency redundancy is reduced by Discrete Cosine Transform and Quantization. After these operations, entropy coding can be employed to the data result to achieve further compression. Figure 2.4 illustrates a common video coding flow: -8- Figure 2.4 Video CODEC Concepts In the following parts, each function block will be addressed in the order that it exists in the video coding process [2]. 2.1.5 Motion Estimation The input to the coding system is an uncompressed video sequence when the motion estimation is trying to exploit the similarities between the successive video frames. As shown in Figure 2.5, for a given area in current video frame, if there is a corresponding area in the neighbor frame which is very similar to it, only the information about the difference between these two regions should be coded and transmitted but not the whole information of the given area. The difference, also called residual, is produced by subtract the matched region with the current region. -9- Figure 2.5 Motion Estimation The basic idea of prediction is that a given area can be recovered by the residual and the matched region jointly (add the residual to the prediction on the decoder side). Considering there must be many zero values within the residual, the temporal redundancy is reduced in this way. In reality, multiple frames preceding or after (and/or both) the current frame can be used as the reference to the current frame. In practical, motion estimation and compensation are often based on rectangular blocks (MxN or NxN). The most common size of the block is 16x16 for luminance component and 8x8 for chrominance components. A 16x16 pixel region called macroblock is the basic data unit for motion compensation in current video coding standards (MPEG series and ITU-T - 10 - series). It consists of one 16x16 luminance sample block, one 8x8 Cb sample block and one 8x8 C r sample block (see Figure 2.6). Figure 2.6 Sample blocks Theoretically, the smaller of the block size, the better of the motion estimation performance, so in the most recent standard H.264/AVC, the size of data unit for motion estimation is more flexible as the minimum data unit is down to 4x4 [3]. 2.1.6 Motion vectors As Figure 2.7 shows, motion vector is a two-value pair (∆x, ∆y), which indicates the relative position offsets of the current macroblock compares to its best matching region on both vertical and horizontal directions. Motion vector is encoded and transmitted together with the residual. - 11 - Figure 2.7 Motion vectors During the decoding process, the residual should be added to the matching region to recover the current frame. With the help of motion vectors, the matching region can be found from the reference frame. 2.1.7 Block size effect In Figure 2.8.1 shows the three type block sizes based on one frame, picture (a) and (b) are the original frames-Previous frame n − 1 and Current frame n , and the picture (c) and (d) are gray frames form picture (a) and (b).The picture (e)、(f)、(g) are got form picture (c) , and picture (e) block size 4x4, picture (f) block size 8x8,picture (g) block size 16x16. Picture (d) is subtracted form picture (g) with motion compensation to - 12 - produce a residual picture (j).The energy in the residual is reduced by motion compensation each 16x16 macroblock (see Figure 2.8.2 (j)).Motion compensating each 8x8 block reduces the residual energy further (see Figure 2.8.2(i)) and motion compensation each 4x4 block gives he smallest residual energy of all (see Figure 2.8.2(h)). These examples show that smaller motion compensation block sizes can produce better motion compensation results. However, a smaller block size leads to increased complexity (more search operations must be carried out) and an increase in the number of motion vectors that need to be transmitted. Sending each motion vector requires bits to be sent and the extra overhead for vectors may outweigh the benefit of reduced residual energy. An effective compromise is to adapt the block size to the picture characteristics. Obviously, the more mid-grey area, the more redundant information is reduced. In order to achieve higher compression efficiency, H.264/AVC chooses smaller block size for motion estimation. However, as the redundant information within residual is reduced, there should be more motion vectors encoded and transmitted. So, H.264/AVC supports changing the block size dynamically according to the content of the frame. - 13 - 2.1.7.1 Color image to gray image Previous Frame n-1 Current Frame n (a) (b) (c) (d) (e)MC- Block Size: 4X4 (f)MC- Block Size: 8X8 (g)MC- Block Size: 16X16 Figure 2.8.1 Block size effects on motion estimation-Part1 - 14 - 2.1.7.2 Block size effect Block Size Residual with MC Error frame (e) (h) (f) (i) (g) (j) 4x4 8x8 16x16 Figure 2.8.2 Block size effects on motion estimation-Part2 Note: 1.MC: Motion-compensation 2. Those frames are form "Superman Returns" announcement video. - 15 - 2.1.8 Sub-pixel interpolation The accuracy of motion compensation is in units of distance between pixels. In case the motion vector points to an integer-sample position, the prediction signal consists of the corresponding samples of the reference picture; Figure 2.9 Sub-pixel interpolation otherwise the corresponding sample is obtained using interpolation to generate non-integer positions [6]. Non-integer position interpolation gives the encoder more choices when searching for the best matching region compares to integer motion estimation, the result is the redundancy in the residual can be reduced further. 2.1.9 Discrete Cosine Transform After the motion estimation, the residual data can be converted into another domain (transform domain) by some kind of means in order to minimize the - 16 - frequency redundancy. Most of the transforms are block based, such as Karhunen-Loeve Transform (KLT), Singular Value Decomposition (SVD) and the Discrete Cosine Transform (DCT) [9]. We will only touch on DCT in this paper. Figure 2.10 Discrete Cosines Transform The discrete cosine transform (DCT) helps separate the image into parts (or spectral sub-bands) of differing importance (with respect to the image's visual quality). The DCT is similar to the Discrete Fourier Transform: it transforms a signal or image from the spatial domain to the frequency domain. The DCT operates on NxN sample block X in the form : Y = AXAT The inverse DCT is in the form: X = AT YA A is the NxN transform matrix, Y is the result sample block in frequency - 17 - domain Aij = Ci cos (2 j + 1)iπ 2N where Ci = 1 (i = 0), N Ci = 2 (i > 0) N The general equation for 2D (N data items) DCT is defined by the following equation: N −1 N −1 Yxy = C x C y ∑∑ X ij cos i =0 j =0 N −1 N −1 X ij = ∑∑ C x C yYxy cos x = 0 y =0 (2 j + 1) yπ (2i + 1) xπ cos 2N 2N (2 j + 1) yπ (2i + 1) xπ cos 2N 2N The basic operation of the DCT is as follows: The input image is N by N. X(i,j) is the intensity of the pixel in row i and column j. Y(x,y) is the DCT coefficient matrix of the image in the DCT domain. Y(0,0) corresponding to the coefficient in the upper left corner which is defined as DC coefficient and all the rest are defined as AC coefficients. 2.1.10 Quantization and Zig-Zag scan After DCT transformation, quantization is employed to truncate the magnitude of DCT coefficients in order to reduce the number of bits that represent the coefficients. - 18 - Quantization can be performed on each individual coefficient, which is known as Scalar Quantization (SQ). Quantization can also be performed on a group of coefficients together, and this is known as Vector Quantization (VQ). [6]. A general example of quantization: ⎛ X ⎞ ⎟⎟ FQ = round ⎜⎜ QP ⎝ ⎠ Y = FQ ⋅ QP (2.11) (2.12) The input value X is scaled by QP and rounded to the nearest integer. This operation is normally not reversible, because some information has been lost during rounding and it is impossible to recover X to its original value. In video coding, quantization is often performed by vector quantization, which means X, QP and Y are matrixes. For matrix quantization, eq.2.11 and eq.2.12 are still applicable. A typical 8x8 quantization matrix is listed as below. The coefficients that close to the lower right corner of the quantization matrix is bigger than those close to the upper left corner, because it wants to quantize the high frequency components of X more than low frequency components in order to neglect part of the information of high frequency. - 19 - Figure 2.11 Example of Quantization Matrix 2.1.11 Zig-Zag scan After Quantization, most of the non-zero DCT coefficients are located close to upper left corner in the matrix. Figure 2.12 Zig-Zag scan for 16x16 macroblock Through Zig-Zag scan, the order of the coefficients is rearranged in order that most of the - 20 - zeros are grouped together in the output data stream. In the following stage, run length coding, this string of zeros can be encoded with very few number of bits. 2.1.12 Run-length Encoding Run-length coding chooses to use a series of (run, level) pairs to represent a string of data. For example : For an input data array: {1,3,0,0,0,8,2,0,3,0,0,2…..} The output (run, level) pairs are: (0,1),(0,3),(3,8), (0,2),(1,3),(2,2)…..Run here means how many zeros come before the next non-zero data. Level is the value of the non-zero data. 2.1.13 Entropy Coding The last stage in Figure 2.4 is entropy coding. Entropy encoder compressed the quantized data into smaller number of bits for future transmission. This is achieved by given each value a unique code word based on the probability that the value exists in the data steam. The more often one value comes with the data stream; the fewer bits are assigned to its code word. The most commonly used entropy encoders are the Huffman encoder and the arithmetic encoder, although for applications requiring fast execution, simple run-length encoding (RLE) has proven very effective [6]. - 21 - Two advanced entropy coding methods know as CAVLC (Context-based adaptive Variable Length Coding) and CABAC (Context-based Arithmetic Coding) are adopted by H.264/AVC. These two methods have improved coding efficiency compares to the methods applied in previous standards. 2.2 MPEG and H.26x History 2.2.1 ISO/IEC, ITU-T and JVT ISO/IEC and ITU-T are two main international standard organizations for coding standards of video, audio and their combination. H.26x family of standard is designed by ITU-T. As the ITU Telecommunication Standardization Sector, ITU-T is a permanent organ of ITU responsible for studying technical, operating and tariff questions and issuing Recommendations on them with a view to standardizing telecommunications on a world-wide basis [1]. H.261 is the first version of h.26x series started since 1984. During the following years, h.262, h.263, h.263+, h.263++ and h.264 are released by ITU-T subsequently. The MPEG family of standards includes MPEG-1, MPEG-2 and MPEG-4, formally known as ISO/IEC-11172, ISO/IEC-13818 and ISO/IEC-14496. MPEG is originally the name given to the group of experts that developed these standards. The MPEG working group (formally known as ISO/IEC - 22 - JTC1/SC29/WG11) is part of JTC1, the Joint ISO/IEC Technical Committee on Information Technology. The Joint Video Team (JVT) consists of members from ISO/IEC JTC1/SC29/WG11 (MPEG) and ITU-T SG16 Q.6 (VCEG). They published H.264 Recommendation/MPEG-4 part 10 standard. 2.2.2 MPEG-4 MPEG-4 (ISO/IEC 14496) became the international standard since 1999. The basic coding theory of MPEG-4 still remains the same as previous MPEG standards but more networks oriented. It is more suitable for broadcast, interactive and conversational environment. MPEG-4 introduced ‘objects’ concept: A video object in a scene is an entity that a user is allowed to access (seek, browse) and manipulate (cut and paste). It serves from (2 Kbit/s for speech, 5 Kbit/s for video) to (5 Mbit/s for transparent quality Video, 64 Kbit/s per channel for CD quality Audio). Defined profiles and levels as a kind of subset of the entire bit stream syntax. 2.2.3 MPEG part 10 & H.264/AVC The newest standard, H.264/AVC (also known as MPEG part 10), is jointly developed by ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). The final approval submission as - 23 - H.264/AVC [1] was released in March 2003. The motivation of this standard comes from the growing multimedia services and the popularity of HDTV which need more efficient coding method. At the same time, various transmission media especially for those low speed media (Cable Modem, xDSL or UMTS) also calls for the significant enhancement of coding efficiency. By introduce some unique techniques, H.264/AVC aims to increase compression rate significantly (save up to 50% bit rate as compared to MPEG-2 picture quality) while transmit high quality image at both high and low bit rate as well as achieve better error robust and network friendly. Figure 2.13 Evolution of video coding standards Full-specified decoding process ensures that there is no mismatch during decoding. Network Adaptation Layer allows H.264 to be transported over different networks [2]. - 24 - 2.2.4 MPEG and H.264/AVC comparisons The ITU-T Video Coding Experts Group developed the H.261 standard for video conferencing applications which offered reasonable compression performance with relatively low complexity. This was superseded by the popular H.263 standard, offering better performance through features such as half-pixel motion compensation and improved Variable-length coding. Two further versions of H.263 have been released, each offering additional optional coding modes to support better compression efficiency and greater flexibility. The latest version (Version 3) includes 19 optional modes, but is constrained by the requirement to support the original, ‘baseline’ H.263 CODEC. The H.26L standard, under development at the time of writing, incorporates a number of new coding tools such as a 4 x 4 block transform and flexible motion vector options and promises to outperform earlier standards. Comparing the performance of the various coding standards is difficult because a direct ‘rate-distortion’ comparison does not take into account other factors such as features, flexibility and market penetration. - 25 - Table 2.1 Comparison of the video coding standards Standard MJPEG Target application Coding performance Image coding 1 Scalable and lossless coding modes 2 Integer-pixel motion compensation Video H.261 conferencing MPEG-1 Video-CD 3 MPEG-2 Digital TV 3 Video 4 H.263 Conferencing 4 5 4x4 DCT 2x2 DHT, Motion estimation 16x16~4x4,CAVLC,CABAC coding Video H.264/AVC I, P, B-pictures, half-pixel compensation As above; field coding, scalable coding Motion estimation 16x16 Optimized for low bit rates; many optional modes Many options including content-based tools Many options including content-based tools, Motion estimation 16x16 or 8x8 Multimedia MPEG-4 Features conferencing It seems clear that the H.263, MPEG-2 and MPEG-4 standards each have their advantages for designers of video communication systems. Each of these standards makes use of common coding technologies: motion estimation and compensation, block transformation and entropy coding [8]. H.264 decoder 主要原理 ~~~ - 26 - 2.3 H.264/AVC Video Decoder Data Flow Figure 2.14 H.264/AVC decoder data flow 1. A compressed bit stream /NAL is entropy decoded to extract coefficients, motion vector and header for each macroblock. 2. Run-level coding and reordering are reversed to produce a quantized, transformed macroblock X. 3. X is rescaled and inverse transformed to produce a decoded residual D ' . 4. The decoded motion vector is used to locate a 16 × 16 region in the decoder’s copy of the previous (reference) frame Fn'−1 . This region becomes the motion compensated prediction P. 5. P is added to D ' to produce a reconstructed macroblock. The reconstructed macroblocks are saved to produce decoded frame Fn' . After a complete frame is decoded, Fn' is ready to be displayed and may also be stored as a reference frame for the next decoded frame Fn'+1 . It is clear from the figures and from the above explanation that the - 27 - encoder includes a decoding path (rescale, IDCT, reconstruct). This is necessary to ensure that the encoder and decoder use identical reference frames Fn'−1 for motion compensated prediction [9]. - 28 - CHAPTER 3 DEVELOPMENT PLATFORM OVERVIEW 3.1 Microsoft Windows CE Overview Microsoft® Windows® CE is a compact, highly efficient, multiplatform operating system. It is not a reduced version of Microsoft® Windows® 95/XP, but was designed from the ground up as a multithreaded, fully preemptive, multitasking operating system for platforms with limited resources. Its modular design allows it to be customized for products ranging from consumer electronic devices to specialized industrial controllers General Features of Windows CE - Provides you with a modular operating system that you can customize for specific products. The basic core of the operating system requires less than 200 KB of ROM. - Provides interrupt delivery, prioritizing, and servicing. - Runs on a wide variety of platforms - Supports more than 1,000 of the most frequently used Microsoft® Win32® functions, along with familiar development models and tools. - Supports a variety of user-interface hardware, including touch screen and color displays with up to 32-bits-per-pixel color depth. - Supports a variety of serial and network communication technologies. - 29 - - Supports Mobile Channels to provide Web services for Windows CE users. - Supports COM/OLE, Automation, and other advanced methods of inter-process communication. Windows CE has four primary modules or groups of modules. - The kernel supports basic services, such as process and thread handling and memory management. - The file system supports persistent storage of information. - The graphics windowing and events subsystem (GWES) controls graphics and window-related features - The communications interface supports the exchange of information with other devices [14]. Figure 3.1 Windows CE operating system structure - 30 - 3.1.1 Kernel The Kernel - the core of the operating system - provides system services for managing threads, memory, and resources. It includes: - Preemptive, priority-based thread scheduling based on the Win32 process and thread model. Priority inversion is prevented with a system of priority inheritance that dynamically adjusts thread priorities. - Predictable thread synchronization mechanisms, including wait objects. Examples of these mechanisms are named mutexes, critical sections, and named and unnamed event objects. - Efficient memory management based on dynamic-link libraries (DLLs), which link user applications at run-time. - A flat, virtual address space, with 32 MB of memory reserved for each process. Process memory is protected by altering page protections. - On-demand paging for both read-only memory (ROM) and random access memory (RAM). - Heap size that is limited only by available memory. - Control of interrupt handling. You can map interrupt requests (IRQs) to hardware interrupts and implement your own interrupt service routines and interrupt service threads. - 31 - Extensive debugging support, such as including just-in-time debugging 3.1.2 Persistent Storage The file system supports persistent storage of information. It includes: - Support for FAT file systems with up to nine FAT volumes. - Transactioned file handling to protect against data loss. - Demand paging for devices that support paging. - FAT file system mirroring to allow preservation of the file system if power is lost or cold reset is needed. - Installable block device drivers. 3.1.3 Communications Interface The communications interface supports a wide range of technologies. It includes: - Support for serial communications, including infrared links. - Support for Internet client applications, including Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP) protocols. - A Common Internet File System (CIFS) redirector for access to remote file systems by means of the Internet. - A subset of Windows Sockets (Winsock) version 1.1, plus support for Secure Sockets - 32 - - A Transmission Control Protocol/Internet Protocol (TCP/IP) transport layer configurable for wireless networking. - An Infrared Data Association (IrDA) transport layer for robust infrared communication. - Both Point-to-Point Protocol (PPP) and Serial Line Internet Protocol (SLIP) for serial-link networking. - Support for local area networking through the network driver interface specification (NDIS). - Support for managing phone connections with the Telephony API (TAPI). - A Remote Access Service (RAS) client for connections to remote file systems by modem. 3.1.4 Graphics, Windowing, and Events Subsystem (GWES) The GWES module supports the graphics and windowing functionality needed to display text and images and to receive user input. It includes: Support for a broad range of window styles, including overlapping windows. - A large selection of customizable controls. - Support for keyboard and stylus input. - A command bar combining the functionality of a toolbar and a menu bar. - An Out of Memory dialog box that requests user action when the system is low on memory. - Full UNICODE support. - 33 - A multiplatform graphics device interface (GDI) that supports the following features: - Both color and grayscale displays, with color depths of up to 32 bits per pixel - Palette management - TrueType and raster fonts - Printer, memory, and display device contexts (DCs). 3.1.5 Handling, Multitasking, and Multithreading The Windows CE kernel contains the core operating system functionality that must be present on all Windows CE-based platforms. It includes support for memory management, process management, exception handling, multitasking, and multithreading. The Windows CE kernel borrows much of what is best from Windows-based desktop platforms. For example, all Windows CE-based applications run in a fully preemptive, multitasking environment, in protected memory spaces. Windows CE supports native Unicode strings, allowing you to internationalize applications. Unlike the kernels found on Windows-based desktop platforms, the Windows CE kernel uses DLLs to maximize available memory. The DLLs - 34 - are written as reentrant code, which allows applications to simultaneously share common routines. This approach minimizes the amount of memory-resident code required to execute applications. 3.1.6 Processes and Threads As a multitasking operating system, Windows CE can support up to 32 simultaneous processes, each process being a single instance of an application. In addition, multithreading support allows each process to create multiple threads of execution. A thread is a part of a process that runs concurrently with other parts. Threads operate independently, but each one belongs to a particular process and shares the same memory space. The total number of threads is limited only by available physical memory. Processes rely on Win32 messages to initiate processing, control system resources, and communicate with the operating system and the user. Each process has its own message queue. For multithreaded applications, each thread also has its own separate message queue. When there are no messages in the queue and the thread is not engaged in any other activity, the system suspends the thread, saving CPU resources. Although a thread can operate independently, it often needs to be managed - 35 - by the process. For example, one thread may depend on another for information. Thread synchronization suspends a thread's execution until the thread receives notification to proceed. Windows CE supports thread synchronization by providing a set of wait objects, which stops a thread until a change in the wait object signals the thread to proceed. Supported wait objects include critical sections, named and unnamed events, and named mutex objects. Windows CE implements thread synchronization with minimum processor resources - an important feature for many battery-powered devices. And, unlike many operating systems, Windows CE uses the kernel to handle thread-related tasks, such as scheduling, synchronization, and resource management. Consequently, an application need not poll for process or thread completion or perform other thread-management functions. Because Windows CE is preemptive, it allows the execution of a process or thread to be preempted by one with higher priority. It uses a priority-based, time-slice algorithm, with eight levels of thread priority, for thread scheduling. 3.1.7 Memory Architecture - 36 - The Windows CE kernel supports a single, flat, or unsegmented, virtual address space that all processes share. Instead of assigning each process a different address space, Windows CE protects process memory by altering page protections. Because it maps virtual addresses onto physical memory using the kernel, you do not need to be concerned with the physical layout of the target system's memory. Approximately 1 GB of virtual memory is available to processes. It is divided into 33 slots, each 32 MB in size. The kernel protects each process by assigning it to a unique slot with one slot reserved for the currently running process. Thus, the number of processes is limited to 32, but there is no limit, aside from physical memory, on the total number of threads. The kernel prevents an application from accessing memory outside of its allocated slot by generating an exception. Applications can check for, and handle, such exceptions by using the try-except statement. Windows CE allows memory mapping, which permits multiple processes to share the same physical memory. Memory mapping results in very fast data transfer between cooperating processes, or between a driver and an application. Approximately 1 GB of virtual address space, distinct from that used for the slots, is allocated for memory mapping. - 37 - Windows CE always allocates memory to applications one page at a time. The system designer specifies page size when the operating system is built for the target hardware platform. On a Handheld PC, for example, the page size is typically either 1 KB or 4 KB. 3.1.8 Physical Memory Usage Windows CE-based platforms usually have no disk drive. Therefore, physical memory, typically consisting of a combination of ROM and RAM, plays a substantially different role on a Windows CE-based platform than it does on a desktop computer. Because ROM cannot be modified by the user, it is used for permanent storage. The contents of ROM, determined by the original equipment manufacturer (OEM), includes the operating system and any built-in applications that the manufacturer provides, for example, Microsoft® Pocket Word and Microsoft® Pocket Excel on an Windows CE-based platform. Depending on your product requirements, you can also place application code in ROM. Because on most Windows CE systems, RAM is maintained continuously, it is effectively nonvolatile. This feature allows your application to use RAM for persistent storage as well as program execution, compensating for the - 38 - lack of a disk drive. To serve these two purposes, RAM is divided into storage, also known as the object store, and program memory. Program memory is used for program execution, while the object store is used for persistent storage of data and any executable code not stored in ROM. To minimize RAM requirements on Windows CE-based devices, executable code stored in ROM usually executes in-place, not in RAM. Because of this, the operating system needs only a small amount of RAM for such purposes as stack and heap storage. Applications are commonly stored and executed in RAM. This approach is used primarily by third-party applications that are added by the user. Because RAM-based applications are stored in compressed form, they must be uncompressed and loaded into program memory for execution. To increase the performance of application software and reduce RAM use, Windows CE supports on-demand paging. With it, the operating system needs to uncompress and load only the memory page containing the portion of the application that is currently executing. When execution is finished, the page can be swapped out, and the next page can be loaded. Like RAM-based applications, ROM-based executable code, including DLLs, can be compressed. When compressed, the code does not execute in - 39 - place, but is handled much like its RAM-based counterpart. The code is uncompressed and loaded a page at a time into RAM program memory, and then is swapped out when no longer needed. 3.1.9 Graphics, Windowing, and Event Subsystem The Graphics Windowing and Event Subsystem (GWES) is the graphical user interface between the user, your application, and the operating system. GWES handles user input by translating keystrokes, stylus movements, and control selections into messages that convey information to applications and the operating system. GWES handles output to the user by creating and managing the windows, graphics, and text that are displayed on display devices and printers. GWES supports all the windows, dialog boxes, controls, menus, and resources that make up Windows CE user interface. This interface allows users to control applications by choosing menu commands, pushing buttons, checking and un-checking boxes, and manipulating a variety of other controls. GWES provides information to the user in the form of bitmaps, carets, cursors, text, and icons. Even Windows CE-based platforms that lack a graphical user interface use GWES basic windowing and messaging capabilities. These provide the - 40 - means for communication between the user, the application, and the operating system. As part of GWES, Windows CE provides support for active power management to extend the limited lifetime of battery-operated devices. The operating system automatically determines a power consumption level to match the state of operation of the device. The following illustration describes the GWES structure. Figure 3.2 Basic GWES structure 3.1.9.1 Window Management The most central feature of GWES is the window. In Windows CE-based platforms with traditional graphical displays, the window is the rectangular area of the screen where an application displays output and receives input - 41 - from the user. However, all applications need windows in order to receive messages from the operating system, even those created for devices that lack graphical displays. When you create a window, Windows CE creates a message queue for the window. The operating system translates the information it receives from the user into messages which it places into the message queue of the active window. The application processes most of these messages, and passes the rest back to Windows CE for processing. Windows CE does not send applications any messages dealing with the nonclient area of the window. A window's nonclient area is the area of the window where an application is not allowed to draw, such as the title bar and scroll bars. The window manager controls the nonclient area. Windows CE does not support the Maximize and Minimize buttons. A user can send the window to the back of the Z order by tapping the window's button on the taskbar. The user restores the window by tapping its taskbar button again. The taskbar is always visible on Windows CE. You cannot hide the taskbar or use the full screen to display a window. - 42 - 3.1.10 Controls, Menus, Dialog Boxes, and Resources GWES provides controls, menus, dialog boxes, and resources to provide the user with a standard way to make selections, carry out commands, and perform input and output tasks. Controls and dialog boxes are child windows that allow users to view and organize information and to set or change attributes. A dialog box is a window that contains controls. All menus in Windows CE are implemented as top-level, pop-up windows. Windows CE supports scrolling menus that automatically add scroll arrows when a menu does not fit on the screen. Table 3.1 Windows CE supports the following types of controls, menus, dialog boxes, and resources Application-defined dialog boxes Bitmaps Carets Check boxes Combo boxes Command band Command bars Common dialog boxes Cursors Custom draw service Date and time picker controls Edit control Group boxes Header controls Icons Image lists Images Keyboard accelerators List boxes List views Menus Message boxes Scroll bars Static controls Status bars Strings Tree views Up-down controls Windows CE does not support menu bars, but it does support command bars, - 43 - which combine the functionality of a menu bar and tool bar in one control. Command bars make efficient use of the limited space available on many Windows CE-based devices. In addition to the controls listed in the previous table, Windows CE supports the HTML viewer control, which makes it easier for you to add HTML support to your applications. 3.1.11 Graphics Device Interface The graphics device interface (GDI) is the GWES subsystem that controls the display of text and graphics. You use GDI to draw lines, curves, closed figures, text, and bitmapped images. GDI uses a device context (DC) to store the information it needs to display text and graphics on a specified device. The graphic objects stored in a DC include a pen for line drawing, a brush for painting and filling, a font for text output, a bitmap for copying or scrolling, a palette for defining the available colors, and a region for clipping. Windows CE supports printer DCs for drawing on printers, display DCs for drawing on video displays, and memory displays for drawing in memory. - 44 - Table 3.2 GDI features supported. GDI feature Description Raster and TrueType Allows only one of these to be used on a specified fonts system. TrueType fonts generate superior text output because they are scalable and rotatable. Custom color palettes, and both palettized and nonpalettized color display devices Supports color formats of 1, 2, 4, 8, 16, 24, and 32 bits per pixel (bpp). The first two are unique to Windows CE. Bit block transfer functions and raster operation codes Allows you to transform and combine bitmaps in a wide variety of ways. Pens and brushes Supports dashed, wide, and solid pens, and patterned brushes. Printing Supports full graphical printing. Shape drawing functions Supports the ellipse, polygon, rectangle, and round rectangle shapes. 3.2 Intel PXA270/2700G Hardware Architecture The PXA27x processor offers an integrated system-on-a-chip design based on the Intel XScale® Microarchitecture. The PXA27x processor integrates the Intel XScale® Microarchitecture core with many on-chip peripherals that allows design of many different products for the handheld and cellular handset markets [15]. - 45 - Figure 3.3 PXA270 Hardware Architecture 3.2.1 CPU The PXA27x processor is an implementation of the Intel XScale® microarchitecture, which is described in the Intel XScale® Core Developer’s Manual. The characteristics of this particular implementation include the following: -Several coprocessor registers -Semaphores and interrupts for processor control -Multiple reset mechanisms - 46 - -Sophisticated power management -Highly multiplexed pin usage 3.2.2 Internal Memory The PXA27x processor provides 256 Kbytes of internal memory-mapped SRAM. The SRAM is divided into four banks, each consisting of 64 Kbytes. Features -256 Kbytes of on-chip SRAM arranged as four banks of 64 Kbytes -Bank-by-bank power management for reduced power consumption -Byte write support 3.2.3 Memory Controller The external memory-bus interface for the PXA27x processor supports SDRAM, synchronous, and asynchronous burst-mode and page-mode flash memory, page-mode ROM, SRAM, PC Card, and CompactFlash expansion memory. Memory types are programmable through the memory interface configuration registers. Memory requests are placed in a four-deep processing queue and processed in the order they are received. - 47 - 3.2.4 LCD Controller The LCD/flat-panel controller is backward-compatible with Intel® PXA25x and PXA26x processor LCD controllers. Several additional features are also supported. The feature is: •Pixel formats of 18, 19, 24, and 25 bits per pixel (bpp) The following list describes features supported by the PXA27x processor LCD controller: •Display modes -Support for single- or dual-scan display modules -Passive monochrome mode supports up to 256 gray-scale levels (8 bits) -Active color mode supports up to 16’777’216 colors (24 bits) -Passive color mode supports a total of 16’777’216 colors (24 bits) -Support for LCD panels with an internal frame buffer -Support for 8-bit (each) passive dual-scan color displays -Support for up to 18-bit per pixel single-scan color displays without an internal frame buffer -Support for up to 24-bit per pixel single-scan color displays with an internal frame buffer - 48 - 3.2.5 DMA Controller The PXA27x processor contains a direct-memory access (DMA) controller that transfers data to and from memory in response to requests generated by peripheral devices or companion chips. The peripheral devices and companion chips do not directly supply addresses and commands to the memory controller. Instead, the states required to manage a data stream are maintained in 32 DMA channels [16]. 3.2.6 2700G Multimedia Accelerator As the multimedia capabilities of handheld devices increase to include higher resolutions, high-quality displays, complete graphical user interfaces (GUIs), multimedia, 3D capabilities, and video playback, it is necessary to include a graphics accelerator in many handheld devices. The 2700G Multimedia Accelerator component is low-power, full-featured graphics accelerator optimized for Intel® Personal Client Architecture solutions and support the PXA27x processor family. The 2700G Multimedia Accelerator provides high-performance 2D, 3D, MPEG2, MPEG4, and Windows® Media Video (WMV) acceleration as well as dual-display capabilities. The 2700G Multimedia Accelerator supports resolutions up to XGA (1024 x 768 x 24 bpp) color. - 49 - The 2700G Multimedia Accelerator 3D acceleration provides a complete hardware 3D rendering pipeline. This includes 831K triangles per second processing capability and 84 million pixels-per-second fill rate. Advanced 3D hardware acceleration includes features such as texture and light mapping, point, bilinear and anisotropic filtering, alpha blending, dual texturing support, deferred texturing, screen tiling, texture compression and full-screen anti-aliasing. The 2700 Multimedia Accelerator 2D acceleration includes support for clipping, anti-aliasing. All 256 raster operations (Source, Pattern, and Destination) defined by Microsoft are supported. For optimal performance and lowest power, the following video-decode capabilities are accelerated in the 2700 Multimedia Accelerator hardware: Inverse Zig-Zag (IZZ), Inverse Discrete Cosine Transforms (IDCT), Motion Compensation (MC), Color Space Conversion (CSC), and Scaling. The 2700G Multimedia Accelerator hardware will perform Inverse Zig-Zag, Inverse Discrete Cosine Transform, and Motion Compensation for MPEG-1*/2*/4* and Motion Compensation for WMV® video streams [17]. The 2700G Multimedia Accelerator also provides flexibility to a system’s display capabilities. - 50 - Figure 3.4 PXA270/2700G system architecture block diagram With this device, a system can use a wide variety of displays, including integrated LCDs of various resolutions, color depths, and refresh rates, as well as external displays (e.g., analog CRTs, TVs, or digital flat panels). When paired with the applications processor, the 2700G Multimedia Accelerator is capable of simultaneously driving separate display streams to two separate displays [18]. - 51 - CHAPTER 4 DEVELOPMENT DESCRIPTION 4.1 Win CE Platform Development Environment You can do with a Microsoft® Windows® CE 5.0-Platform Builder is use Ethernet for connectivity. Ethernet, the network hardware that is used extensively throughout the Internet, is a very fast way (at 10 Mbps) for two or more computers to communicate. You should be able to access your target platform (PXA270/2700G) with the Windows XP via Ethernet just like Microsoft's peer-to-peer networking. With corporate networks that use DHCP to assign settings automatically, you should be able to more easily connect their target platform (PXA270/2700G) via Ethernet. HUB WinCE Platform builder And Windows embeddedVC++ OS: WinCE V5.0 HW: Intel PXA270 & 2700G H.264 Player : JMPlayer Figure 4.1 Win CE Platform Development Environments - 52 - 4.1.1 Required Hardware and Software You'll need the following before getting started: - A PC with Windows XP - Windows CE Platform builder V5.0. - Desktop or laptop Ethernet card installed, including Microsoft networking client and TCP/IP (you can check this from the Network Control Panel) Figure 4.2 PXA270/2700G Development Environment-Ethernet card. - An NE2000 Ethernet PC Card from Socket Communications at http://www.socketcom.com/ for your target platform (PXA270/2700G). Note: - 53 - Other NE2000 compatible cards may work with the Handheld PC but are not yet certified at this writing. Also, 3Com and Xircom are working on drivers for their Ethernet PC Cards as well. - Ethernet hub or crossover cable to plug both units into. 4.1.2 PC Configuration You must configure your PC for Ethernet and install the Microsoft Network Client and the TCP/IP protocol. Don't forget to fill in the identification section. This is done from Start, Settings, Control Panel, Network. I recommend using 192.168.0.1, subnet mask 255.255.255.0. Figure 4.3(a) PC side -TCP/IP setting - 54 - 4.1.3 Configure the TCP/IP properties on your desktop PC Before you begin using Ethernet, you must have Windows CE Platform Builder V5.0 installed. Also, you must use your serial cable to establish a connection to your PC for Ethernet to work. This is the only way that your target platform (PXA270/2700G) knows the name of your desktop. 4.1.4 Target Platform (PXA270/2700G) Configuration Establish a serial connection to the PC with which you are going to use ActiveSync. This is required to put the PC's computer name in your target platform (PXA270/2700G) for use with Ethernet. Install the Ethernet drivers from the Windows CE platform builder V5.0. They are located with the optional software. You need to copy the drivers to your target platform's /Windows directory, which will add a Network Control Panel as well as other relevant files. They require less than 200K of free RAM allocated for storage. You then need to reset the device to load the new components. Next, configure the Network Control Panel for Ethernet on your target platform (find it under Start, Settings, Control Panel). You must plug in the IP address for the target platform (I suggest 192.168.0.2 with subnet mask 255.255.255.0) and the WINS server address, which is the PC's IP address (192.168.0.1 using my recommended setting). Leave the other fields blank. - 55 - Figure 4.3(b) Target platform side- TCP/IP setting 4.1.5 Properly configured Network Control Panel settings on the target platform Connecting Plug both the target platform and the PC into the hub. (If you are using a crossover cable, you can connect the target platform and PC directly without a hub.) Turn on the PC and the Handheld PC and plug in the Ethernet PC Card into your Handheld PC. Check for a link light on the PC Card. If it is not lit, you are having a cable or hub problem. To start the connection, select ActiveSync on your target platform (under Start, Programs, Communications). Make sure the connection is set to networking and the host matches the PC computer name. Then click connect to start communications. ActiveSync must be running at all times when you are using Windows CE platform builder with your target platform. 4.1.6 Download the Application ( JMPlayer ) via Network connection If you enable continuous synchronization, your target platform will stay - 56 - up-to-date, downloading new application (JMPlayer), and other files and data while it's connected to your desktop PC. 4.2 The Development Flow of the Video Decoder Figure 4.4 The Development flow of the H.264/AVC video decoder 4.2.1 In Task I – Survey stage of the H.264/AVC Decoder[12] H/W: x86 OS: Windows XP Decoder: JM V10.1 4.2.2 In Task II_a–Simulation stage of the H.264/AVC Decoder H/W: x86 OS: Embedded - Windows CE V5.0 CEPC Decoder: JM V10.1 Î JMPlayer V1.0.1 4.2.3 In Task II_b–Implement stage of the H.264/AVC Decoder H/W: Intel PXA270/2700G embedded platform OS: Embedded - Windows CE V5.0 Decoder: JMPlayer V1.0.1b - 57 - 4.2.4 In Task II_c–Fine Tuning stage of the H.264/AVC Decoder H/W: Intel PXA270/2700G embedded platform OS: Embedded - Windows CE V5.0 Decoder: JMPlayer V1.0.1c About the detail program flow of the JMPlayer V1.0.1, please see APPENDIX - Program Flow . ( Top flow , Decode one frame, Read new slice, Decode one slice、Read Macroblock、Decode Macroblcok ) [11]. 4.3 The Component of the Video Decoder - JMPlayer Figure 4.5 Win CE platform block diagram, the component of the JMPlayer In the Figure 4.5, the components of the JMPlayer were “Application” (H.264/AVC video decoder)、”Embedded Shell”、”File Manager”、”Device - 58 - Manager”、”Device Driver”、”OAL Bootloader” and “OEM Hardware (Intel PXA270/2700G)”. Figure 4.6 Video data stream flow 4.4 Program Figure 4.7 JMPlayer.exe on PXA270/2700G platform - 59 - 4.5 Improvement 4.5.1 The Memory Architecture in Windows CE V5.0 In any Microsoft® Windows® CE–based device, ROM stores the entire operating system (OS), as well as the applications that come with the OS design. If a module is not compressed, the OS executes the ROM-based modules in place. If the OS compresses a ROM-based module, it decompresses the module and pages it into RAM. The OS loads all read/write data into RAM. The OEM controls the option to enable compression in ROM. When the OS executes programs directly from ROM, it saves program RAM and reduces the time needed to start an application, because the OS does not have to copy the program into RAM before launching it. For programs contained in the object store or on a flash memory storage card, the OS does not execute these in place if the programs are not in ROM. Instead, the OS pages these programs into the RAM and then executes them. Depending on the OEM and the driver options on a specific Windows CE–based device, the program module can be paged on demand. The OS can bring in one page at a time or load the entire module into ROM at once. The RAM on a Windows CE–based device is divided into two areas: the - 60 - object store and the program memory. z The object store resembles a permanent, virtual RAM disk. Data in the object store is retained when you suspend or perform a soft reset operation on the system. Devices typically have a backup power supply for the RAM to preserve data if the main power supply is interrupted temporarily. When operation resumes, the system looks for a previously created object store in RAM and uses it, if one is found. Devices that do not have battery-backed RAM can use the hive-based registry to preserve data during multiple boot processes. z The program memory consists of the remaining RAM. Program memory works like the RAM in personal computers — it stores the heaps and stacks for the applications that are running. The maximum size for the RAM file system is 256 MB, with a maximum size of 32 MB for a single file. However, a database-volume file has a 16-MB limit. The maximum number of objects in the object store is 4,000,000. The boundary between the object store and the program RAM is movable. In Control Panel, a user can modify the system settings to move the boundary, if the OS design provides this option. - 61 - Under low-memory conditions on some Windows CE–based devices, the OS might prompt the user for permission to take some object store RAM for use as program RAM to meet the RAM requirements of an application. When Windows CE OS starts, it creates a single 4-Gigabyte (GB) virtual address space. The address space is divided into 33 slots, and each slot is 32 megabytes (MB). All the processes share the address space. When a process starts, Windows CE selects an open slot for the process in the address space of the system. Slot zero is reserved for the currently running process. Additionally, Windows CE creates a stack for the thread and a heap for the process. Each stack has an initial size of at least 1 kilobyte (KB) or 4 KB, which is committed on demand. The amount of stack space that is reserved by the system, on a per-process basis, is specified in the /STACK option for the linker. Because the stack size is CPU-dependent, the system on some devices allocates 4 KB for each stack. The maximum number of threads is dependent on the amount of available memory. You can allocate additional memory, outside of the 32 MB that is assigned to each slot, by using memory-mapped files or by calling the VirtualAlloc function. - 62 - The following illustration shows how memory is allocated in the Windows CE address space. Figure 4.8 Memory allocations in the Windows CE address space When a process initializes, the OS maps the following DLLs and memory components: - Some execute-in-place (XIP) dynamic-link libraries (DLLs) - Some read/write sections of other XIP DLLs - All non-XIP DLLs - Stack - Heap Data section for each process in the slot assigned to the process DLLs and ROM DLL read/write sections are loaded at the top of the slot. - 63 - DLLs are controlled by the loader, which loads all the DLLs at the same address for each process. The stack, the heap, and the executable (.exe) file are created and mapped from the bottom up. The bottom 64 KB of memory always remains free [13]. 4.5.1.1 Virtual Memory Windows CE implements a paged virtual memory management system similar to other Microsoft Windows–based desktop platforms. A page is always made up of 4,096 bytes (4 KB). Each process in Windows CE has only 32 MB of virtual address space. Your process, dynamic-link libraries (DLLs), heaps, stacks, and virtual memory allocations use this address space. By default, memory allocated by VirtualAlloc falls into the virtual address space of a process. However, you can request to have the memory allocated outside the process slot. 4.5.1.2 VirtualAlloc This function reserves or commits a region of pages in the virtual address space of the calling process. Memory allocated by VirtualAlloc is initialized to zero. - 64 - LPVOID VirtualAlloc( L P V O I D lpAddress, D W O R D dwSize, D W O R D flAllocationType, D W O R D flProtect } VirtualAlloc can perform the following operations: Commit a region of pages reserved by a previous call to the VirtualAlloc function.You can use VirtualAlloc to reserve a block of pages and then make additional calls to VirtualAlloc to commit individual pages from the reserved block. This enables a process to reserve a range of its virtual address space without consuming physical storage until it is needed. Each page in the virtual address space of the process is in one of three states: z Free, in which the page is not committed or reserved and is not accessible to the process. VirtualAlloc can reserve, or simultaneously reserve and commit, a free page. z Reserved, in which the range of addresses cannot be used by other allocation functions, but the page is not accessible and has no physical storage associated with it. VirtualAlloc can commit a reserved page, but it cannot reserve it a second time. The VirtualFree function can release a reserved page, making it a free page. z Committed, in which physical storage is allocated for the page, and - 65 - access is controlled by a protection code. The system initializes and loads each committed page into physical memory only at the first attempt to read or write to that page. When the process terminates, the system releases the storage for committed pages. VirtualAlloc can commit an already committed page. This means you can commit a range of pages, regardless of whether they have been committed, and the function will not fail. VirtualFree can decommit a committed page, releasing the page's storage, or it can simultaneously decommit and release a committed page. If the lpAddress parameter is not NULL, the function uses the lpAddress and dwSize parameters to compute the region of pages to be allocated. The current state of the entire range of pages must be compatible with the type of allocation specified by the flAllocationType parameter. Otherwise, the function fails and no pages are allocated. This compatibility requirement does not preclude committing an already committed page. If you call VirtualAlloc with dwSize >= 2 MB, flAllocationType set to MEM_RESERVE, and flProtect set to PAGE_NOACCESS, it automatically reserves memory at the shared memory region. This preserves per-process virtual memory. - 66 - 4.5.2 The Method of the embedded Memory Configuration In image.cfg %WINCEROOT%\PLATFORM\Sandgateii_g\Src\Inc #define IMAGE_DISPLAY_RAM_OFFSET 0x07CB8000 0x00000000 Boot loader Stack 64KB Boot loader RAM 64KB Boot loader Code 256KB 0x00010000 0x00020000 0x00060000 OEM LOGO 256KB (Reserved for future) 380KB 0x000A0000 0x000FF000 ARG 4KB 0x00100000 48MB NK 0x0310000 79MB 0x07CB8000 IMAGE_DISPLAY_RAM_OFFSET 0x08000000 Figure 4.9 The Memory layout of the Win CE Platform. 4.5.3 Quality Fine Tuning In Ldecod.cpp and JMPlayerDlg.cpp input->R_decoder=500000; // Decoder rate input->B_decoder=104000; // Decoder buffer size input->F_decoder=73000; // Decoder initial delay - 67 - In Video decoder module , Timer Setting. void CPJMPlayerDlg::OnButton1() SetTimer(123,20,NULL); // The timer of the decode as 20ms 100ms Î 50ms Î20 ms Î10 ms 4.6 Windows CE Performance Monitor 4.6.1 Performance Profiling With remote tools for performance profiling, you can use a development workstation to remotely monitor performance criteria on a Microsoft® Windows® CE–based target device. Remote Performance Monitor Provides information about the remote tool that provides a variety of monitoring charts, logs, and viewers that allow you to measure performance on a target device. Remote Performance Monitor is a graphical tool for measuring the performance of a Microsoft® Windows® CE–based OS design. With the tool, you can observe the behavior of performance objects such as CPUs, threads, processes, and system memory. Each performance object has an associated set of performance counters that provide information about device usage, queue lengths, and delays, and information used to measure throughput and internal congestion. - 68 - Remote Performance Monitor provides the ability to track current activity on a target device and the ability to view data from a log file. To display information from a target device in Remote Performance Monitor, you must first connect to the target device with Platform Manager. A performance counter tracks a single metric on your target device. To get information on a performance object, call the function corresponding to the performance object. The data structure the function receives provides the statistics for the performance object [14]. 4.6.2 Chart View Window The Remote Performance Monitor supports a Chart view that allows you to monitor performance in real time. With the tool, you can observe the behavior of performance objects such as CPUs, threads, processes, and system memory. The Chart View window allows you to select categories of statistics you want to monitor, and to select the type of chart you want displayed. When you choose a category, the Remote Performance Monitor tool automatically updates the available options in that category. - 69 - 4.6.3 Observe JMPlayer Performance You can use the Remote Performance Monitor tool to display statistics that describe the performance of a target device in real time. You can also use Remote Performance Monitor for information about Resource use You can also configure the Remote Performance Monitor tool to alert you when the target device meets a specified condition. Step1. Open an CEPB. Step2. If you have not built your run-time image, build the run-time image with KITL mode. Step3. Establish a hardware connection between your CEPB and the target platform; then configure Platform Builder to download the run-time image (with KITL mode) to the target device over the established connection. - If your target device is a CEPC, you must boot a run-time image on the CEPC. - If your target platform is custom hardware, use the steps for booting a run-time image on a CEPC as a model for establishing a connection to your custom hardware. - 70 - HUB OS: WinCE V5.0 KITL Mode HW: Intel PXA270 & 2700G H.264 Player : JMPlayer WinCE Platform builder And Windows embeddedVC++ Windows CE Remote Performance Monitor Figure 4.10 Win CE Platform Monitor Environments For more information, see section 4.1 Win CE Platform Development Environment. Step4. Download the run-time image to the target platform. Step5. Open the Remote Performance Monitor tool and then configure the connection from the tool to the target platform. When you configure the connection, perform the following steps: - In the Transport box, choose KITL Transport for Windows CE and then choose Configure. - In the Named connection box, choose the named connection you used in the CEPB IDE to connect to the target device. - In the Startup Server box, choose CESH Server for Windows CE and then choose Configure. - Choose Directory containing last image booted by Platform Builder. - 71 - Step6. Connect the Remote Performance Monitor tool to the target device. Step7. Open the Chart view window, which Remote Performance Monitor uses to display data from the target device in real time. Step8. Configure the Chart view window to display one or more statistics from the target device. For example, to display memory use as a percentage of total memory, do the following: - In the Object box, choose CE Process Statistics. - From the Counter list, choose % Porcessor Time Load. Note: % Porcessor Time is the percentage of elapsed time that all the threads of this process used the processor to execute instructions. Figure 4.11 Add “JMPlayer.exe %Processor time” to Chart Step9. After you add a statistic to the chart, if you are not satisfied with the appearance of the line that plots the statistic, change the appearance of the - 72 - line. Step10. If you are not satisfied with the appearance or behavior of the chart, modify the appearance or behavior. Step11. Open the Alert view window, which Remote Performance Monitor uses to notify you when specific condition occurs on the target device. Step12. Configure the Alert view window to notify you when conditions “JMPlayer.exe “occur on the target platform. 4.7 Results After section 4.6 configuration and measuring the JMPlayer.exe process time via Windows CEPB Remote tool ( Zoom-in performance Monitor) , we get the result as two Figure 4.12 and Figure 4.13. 72 sec Figure 4.12 The monitor result of the JMPlayer ver.1.0b - 73 - 65 sec Figure 4.13 The monitor result of the JMPlayer ver.1.0c 4.7.1 The Resource/Memory of the JMplayer Table 4.1 The Resource/Memory of the JMplayer JMPlayer Version Memory/Resource of the embedded platform V1.0 b 96% V1.0 c 98% In Table 4.1, at the JMPlayer ver1.0b case, we use the 96% memory/resource of the embedded platform. But in JMPlayer ver.1.0c, we use the 98% memory/resource of the embedded platform. 4.7.2 The Quality of the JMplayer We played the same H.264/AVC video sequences “foreman” sequences. Table 4.2 The Quality of the JMplayer - 74 - JMPlayer Version The Quality of the JMplayer V1.0 b Lag V1.0 c Smooth In Table 4.2, we got the lag quality at the JMPlayer ver1.0b and it spend 72 sec .We got the smooth quality at JMPlayer ver1.0c and it spend 65 sec. - 75 - CHAPTER 5 CONCLUSIONS Video conferencing via the Internet is becoming more widely used and may gain further acceptance with increases in processor and connection performance. The Application trends have two application areas. 1. Very low power, very low bandwidth video for embedded system. 2. High bandwidth, high quality video coding. The same video sequences” foreman”, its (H.264 format) size was 330KB only. In this thesis, the application of the H.264/AVC video decoder (JMPlayer) is Baseline profile. It used as video conferencing appropriately. In PXA270/2700G platform - JMPlayer.exe V1.0c, we played the same H.264 sequences” foreman”, we got the smooth quality, but spend more memory/resource of embedded platform. In the other embedded platforms as VIA-C7、TI、MIPS…etc. , We also could use the same method to get the fine performance (high play quality) of the H.264/AVC video decoder. - 76 - REFERENCES [1] Joint Video Team of ITU-T and ISO/IEC JTC 1, “Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC),” Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-G050, Mar. 2003. [2] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, July 2003. [3] Sullivan, G.J. and Wiegand, T. , “ Video Compression - from concepts to the H.264/AVC standard” Proceedings of the IEEE , Volume 93 , Issue 1, pp. 18-31, June 2005. [4] Jeremiah Golston and Dr. Ajit Rao, Video codecs tutorial: Trade-offs with H.264, VC-1 and other advanced codecs. Texas Instruments White Paper, Mar. 2006. [5] Joe Maller: FXScript Reference: RGB and YUV Color, http://www.joemaller.com [6] Image Compression - from DCT to Wavelets: A Review, Subhasis Saha www.acm.org/crossroads/xrds6-3/sahaimgcoding.html [7] Tiejun Hu and Di Wu, Design of Single Scalar DSP based H.264/AVC Decoder, Mar. 2005. [8] Iain E. G. Richardson, Video Codec Design: Developing Image and Video - 77 - Compression Systems, John Wiley & Sons, Ltd., June 2002. [9] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia, John Wiley & Sons, Ltd., Dec. 2003. [10] HHI Image Communication, Scalable Extension of H.264/AVC, http://ip.hhi.de/imagecom G1/savce/index.htm. [11] ITU-T, Joint Video Team (JVT), http://www.itu.int/ITU-T/studygroups/com16/jvt. [12] TML, H.264/AVC reference software JM 10.1, http://iphome.hhi.de/suehring/tml/ . [13] Douglas Boling, Programming Microsoft Windows CE .NET Third Edition, Microsoft Press, Ltd., 2003. [14] Help file of Platform Builder for Microsoft Windows CE 5.0-Microsoft Corp. [15] Intel® Xscale PXA27x Developer Manual (Document Number: 280000-002), Intel Corp., Apr. 2004. [16] Intel® Xscale PXA27x Design Guide (Document Number: 280001-001), Intel Corp., May 2005. [17] Intel® 2700G7 Multimedia Accelerator datasheet (Document Number: 304430-001), Intel Corp., Nov. 2004. [18] Intel® 2700G Multimedia Accelerator Video Acceleration API Software Developer’s Guide (Document Number: 300950-001), Intel Corp., Jan. 2005. - 78 - APPENDIX - Program Flow [12]. - IX - Appendix 2- decode one frame Setup Read_new_slice frame Frame/field Dec_frame _slice field Dec_field_ slice Nextheader= SOP ? Y Deblock _frame Exit_frame -X- YCbCr frame N Appendix 3- Read new slice Setup Yes Annexb? GetAnnexbNALU No GetRTNALU freeNALU NALUto RBSP NALU_TYP E IDR YES Read_IDR YES Read_DPA YES Read_DPB YES Read_DPC YES Read_SEI YES Read_PPS YES Read_SPS YES Read_PD YES Read_EOSE Q YES Read_EOST REAM YES Read_FILL No NALU_TYP E DPA No NALU_TYP E DPB No NALU_TYP E DPC No NALU_TYP E SEI No NALU_TYP E PPS No NALU_TYP E SPS No NALU_TYP E PD No NALU_TYP E EOSEQ No NALU_TYP E EOSTREAM No NALU_TYP E FILL - XI - Exit Appendix 4- decode one slice Setup Set_ref_pic_num Start_mb Read_one_mb Dec_one_mb Exit_mb exit - XII - Appendix 5- Read Macroblock Setup Read_mb_ mode P_Slice Interpret_ mb Mode_P I_Slice Interpret_ mb Mode_I B_Slice Interpret_ mb Mode_B SP_Slice Interpret_ mb Mode_SP SI_Slice Interpret_ mb Mode_SI Is_P8x8 Read_8x8 _subpart_ mod I_slice or SI_slice? YES Read_mb_ func1 None I/SI_slice CABAC? YES Read_mb_ func2 VLC non intra? YES Read_mb_ func3 Ini_mb Is_DIRECT? Func_dire ct Is_Copy? Func_copy IPCM? Read_ipre d_modes readMotionInfoFro mNAL Is_intermv readIPCM coeffsFro mNAL readCBPa nelCoeffsF romNAL exit - XIII - - XIV -