M - AMD
Transcription
M - AMD
M-JPEG Decoding Using OpenCL on Fusion Guillaume de Bailliencourt | Morgan Multimedia | Manager Session Description 3 Session Description This session presents a project with the goal of decoding and displaying an M-JPEG video stream using the GPU part of an AMD Fusion APU as much as possible. An M-JPEG video stream is composed of a sequence of JPEG images. To decode each JPEG image, the bitstream needs to pass through five stages: huffman decoding, inverse quantization, inverse DCT (Discrete Cosinus Transform), pixel upsampling, and color conversion. The CPU part of the AMD Fusion APU handles the first decoding stage. The GPU part of the APU performs all the other decoding stages and displays the image. OpenCL and DirectX are used to code the GPU part in DirectShow environment. 4 Overview 5 Overview M-JPEG applications M-JPEG “standard” DirectShow graph Sources Morgan M-JPEG Decoder Video Renderers • Common DirectShow Video Renderers • Morgan DirectX 10 Video Renderer (supporting D3D10 Interop) Timing & Benchmark PSNR & SSIM Usage examples & benefits 6 M-JPEG applications Past & Present 7 M-JPEG applications Past (90’s) H/W Capture & Editing • • • • • • Fast Screen Machine (still images) Targa 1000 & 2000 (Mac & PC) Matrox Rainbow Runner Miro DC10, DC20 & DC30 Iomega Buz Professional solutions (Avid, GrassValley, …) All were using H/W codec (Zoran, ST Micro, Ti …) 8 M-JPEG applications Present Non MP4 Digicam & DSLR in video mode • Up to 1080p 30fps • MOV or AVI container High-end DSLR in burst mode • JPEG files sequence Webcams & HD Webcams (Microsoft, Logitech) • Chip streams M-JPEG data to USB (decoded in driver) • Up to 1080p 30fps • Up to 2.5K x 2K 10fps IP Cameras • Streams M-JPEG to IP network Video & Digital Cinema Editing • Transcoding DSLR sources • Low-res proxy 9 M-JPEG “standard” 10 M-JPEG “standard” There’s no real standard Microsoft OpenDML AVI File Format Extensions • • • • ‘MJPG’ FourCC Missing Huffman tables Interlaced if height > 288 (2 JPEG per frame) Not respected in many Digicam ‘MJPG’ AVI files (HD progressive, Huffman tables, …) QuickTime File Format Specification (MOV) • Photo-JPEG • MJPEG-A & MJPEG-B - Missing Huffman tables, or not … - Missing JPEG markers, or not … Others • Old h/w bitstreams ‘TVMJ’, ‘FLJP’, … - Little endian - Missing JPEG tables and markers Complex “universal” parsing & Huffman decoding 11 DirectShow graph 12 DirectShow graph Source 13 Decoder Renderer Sources 14 Sources File Source • Need Demux - AVI - MOV - Other container (MKV, JPEG file sequence, …) AVStream • Webcam / USB 2.0 • Other devices / Other buses NetStream • HTTP • RTP / RTSP • Other protocol 15 Morgan M-JPEG Decoder with GPU off-loading 16 Morgan M-JPEG Decoder Overview JPEG decoding & display diagram CPU part : C++, ASM & SIMD GPU part : OpenCL & AML (AMD Media Library) Overlapping CPU decoding & GPU decoding Avoid Memory Transfer (Zero Memory Copy on APU) Multithreaded decoding Output to mapped host mem | device mem | D3D10 Interop 17 Morgan M-JPEG Decoder JPEG decoding & display diagram Parsing + Huffman + De‐zigzag 18 iQ + iDCT Upsampling + Color Conversion + Scaling Display Morgan M-JPEG Decoder CPU part : C++, ASM & SIMD C++ for core decoder object wrapper (multithreaded) C++ for JPEG parser ASM for Huffman decoder • Output to small temp buffer, fits in cache SIMD for mem transfer between CPU & GPU parts SIMD (integer 16-bits signed) for CPU iQ & iDCT (benchmark GPU vs CPU) CPU optimized decoder vs in-box M-JPEG decoder • • • • 19 x3 faster / 1 core x6 faster / 2 cores x9 faster / 3 cores x11 faster / 4 cores Morgan M-JPEG Decoder GPU part : OpenCL & AML (AMD Media Library) “AML is a library of OpenCL-based kernels that allow codec developers to use many degrees of freedom in implementing an optimized set of video encoders, decoders, and transcoders that use combinations of CPU, GPU/APU shaders, and GPU/APU dedicated hardware.” Mike Schmit - Sr Manager, Video Software - AMD “JpegDecode” AML kernels • Inputs - Buffer of raw DCT coefficients (16-bits signed), 3 components, Planar or Interleaved supported - Quantization Tables - Buffer description • Do iQ & iDCT (32-bits float precision) • Output - YCbCr buffer (8-bits unsigned) 20 Morgan M-JPEG Decoder Overlapping CPU & GPU decoding Double buffered input for kernel Enqueue OpenCL async commands once CPU decoding is finished Requires async sources 21 Morgan M-JPEG Decoder Avoid Memory Transfer (Zero Memory Copy on APU) Memory Copy Zero Memory Copy • CPU decoding : CL_MEM_READ_ONLY | CL_MEM_USE_PERSISTENT_MEM_AMD • GPU decoding : CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR 22 Morgan M-JPEG Decoder Multithreaded decoding N Cores => N Frames / N CPU decoding Threads running in parallel 1 context, 1 device N command queues N kernel instances Threads synchronisation Respect frame order (Out of order execution to In order delivery) Requires async sources for performance boost (transcoding) At nominal frame rate or with sync sources, balances load over N cores 23 Morgan M-JPEG Decoder Multithreaded decoding Requires async sources for performance boost (transcoding) 24 Morgan M-JPEG Decoder Multithreaded decoding At nominal frame rate or with sync sources, balances load over N cores 25 Morgan M-JPEG Decoder Output to mapped host mem | device mem | D3D10 Interop Output to mapped host mem when • Downstream filter need host memory (CPU encoder) • Downstream filter is using DirectX 7 or 9 surfaces (no more D3D9 Interop, DXVA Sharing undocumented) • On APU, small impact (Zero Memory Copy) Output to device mem when • Downstream filter supports OpenCL (GPU encoder, GPU processing) • Using MEDIASUBTYPE_AMLV (AML) Output to D3D10 Interop when • Using Morgan DirectX 10 Video Renderer • Writing custom D3D10 processing/rendering code 26 Video Renderers 27 Video Renderers Common DirectShow Video Renderers (in-box) Video Mixing Renderer 7 (VMR7) • It allocates DDRAW 7 Surfaces • Morgan M-JPEG Decoder side - Optimized Lock/Unlock using AM_GBF_NODDSURFACELOCK flag on GetBuffer() - Query IVMRSurface on sample - Call IVMRSurface::GetSurface to get LPDIRECTDRAWSURFACE7 - Lock LPDIRECTDRAWSURFACE7 - Copy kernel output (mapped host memory) to locked surface - Unlock surface 28 Video Renderers Common DirectShow Video Renderers (in-box) Video Mixing Venderer 9 (VMR 9) • It allocates D3D9 Textures • Morgan M-JPEG Decoder side - Optimized Lock/Unlock using AM_GBF_NODDSURFACELOCK flag on GetBuffer() - Query IVMRSurface9 on sample - Call IVMRSurface9::GetSurface to get IDirect3DSurface9 - Call IDirect3DSurface9::LockRect - Copy kernel output (mapped host memory) to locked surface - Unlock surface 29 Video Renderers Common DirectShow Video Renderers (in-box) Enhanced Video Renderer (EVR) • It allocates D3D9 Textures • Morgan M-JPEG Decoder side - Optimized Lock/Unlock using AM_GBF_NODDSURFACELOCK flag on GetBuffer() - Query IMFGetService on sample - Call IMFGetService::GetService to get IDirect3DSurface9 - Call IDirect3DSurface9::LockRect - Copy kernel output (mapped host memory) to locked surface - Unlock surface 30 Video Renderers Morgan DirectX 10 Video Renderer Without D3D10 Interop • It allocates D3D10 Textures (D3D10_USAGE_DYNAMIC, D3D10_CPU_ACCESS_WRITE) • Morgan M-JPEG Decoder side - Optimized Map/Unmap using AM_GBF_NODDSURFACELOCK flag on GetBuffer() - Query IVMRSurface10 on sample - Call IVMRSurface10::GetTexture to get ID3D10Texture2D - Call ID3D10Texture2D::Map - Copy kernel output (mapped host mem) to mapped texture - Unmap Texture 31 Video Renderers Morgan DirectX 10 Video Renderer With D3D10 Interop • It allocates D3D10 Textures (D3D10_USAGE_DEFAULT, D3D10_RESOURCE_MISC_SHARED) • Morgan M-JPEG Decoder side - Set AM_GBF_NODDSURFACELOCK | AMD_MM_GPU_USE_D3D10_INTEROP flag on GetBuffer() - Query IVMRSurface10 on sample - Call IVMRSurface10::GetTexture to get ID3D10Texture2D - Call clCreateFromD3D10Texture2DKHR with ID3D10Texture2D (one time) - Call clEnqueueAcquireD3D10ObjectsKHR - Call clEnqueueCopyBufferToImage (Copy kernel output to D3D10 Texture) - Call clEnqueueReleaseD3D10ObjectsKHR 32 Video Renderers Morgan DirectX 10 Video Renderer GPU Processing • • • • • • • • Accepts inputs > 8 bpc One or Two pass Uses D3D10 Pixel Shaders 32-bits float precision Upsampler / Scaler YUV to RGB RGB range Chromatic adaptation (optional) • Output to 8, 10 or 16 bpc 33 Timing & Benchmark 34 Timing & Benchmark Overview System setup & Reference clip Timing • Output to device mem • Output to mapped host mem • Output to D3D10 Interop Benchmark • GPU vs CPU 35 System setup & Reference clip System setup • • • • • • • • • Quad core Llano APU (no L3 cache, 4x1MB L2 cache) “CPU” clock 24x100MHz = 2.4GHz TurboCore (800MHz–2400MHz, can be even higher than 2.4GHz) RAM 8GB DDR3 @ 667x2 = 1333MHz “GPU” HD 6550D @ 594MHz (BeaverCreek) APU set to “Performance” mode (TurboCore policy) Win 7 x64 AMD APP Profiler, use clEnqueueMarker to mark key points in timeline (CPU start/stop, Deliver, …) GraphStudio x64 (GraphEdit like) Reference clip • • • • 36 Shot by DSLR (Panasonic GF1, “Customized” firmware) 1080p 30fps 4:2:0 37Mb/s MOV container (Photo-JPEG) Played at full speed for timing & benchmark (Frame n) Timing (n+1) Deliver (n) Deliver (n-1) (n-1) (n) Output to device mem (In all cases Input is mapped device mem) 1 CPU decoding Thread / 1 Core 37 Timing (n-2) (n+2) (Frame n) (n+1) (n-1) (n-2) (n-3) Output to device mem 2 CPU decoding Threads / 2 Cores 38 (n) (n-1) Timing (n+3) (n) (n+4) (n+1) (n+2) Output to device mem 3 CPU decoding Threads / 3 Cores 39 (n+5) Timing (n) (n+4) (n+2) (n+6) (n+1) (n-1) Output to device mem 4 CPU decoding Threads / 4 Cores 40 (n+5) (n+3) Timing Deliver Output to mapped host men / Copy to Downsream Filter Input Buffer 1 CPU decoding Thread / 1 Core 41 Timing (Frame n) Render (n-1) Output to D3D10 Interop / Morgan DirectX 10 Video Renderer 1 CPU decoding Thread / 1 Core So far isn’t as efficient as 2 others output methods 42 Benchmark GPU vs CPU CPU outputs to host mem GPU outputs to device mem Connected to Null Renderer Reference clip played at full speed 43 Benchmark GPU vs CPU In-box MJPEG Decoder : 34.96 ms 44 (Lower is better) Benchmark GPU vs CPU In-box MJPEG Decoder : 29 fps 45 (Higher is better) Benchmark GPU vs CPU In-box MJPEG Decoder : 25 % Total 46 Benchmark GPU vs CPU vs optimized decoder • • • • x2.18 / 1 core x2.08 / 2 cores x2.11 / 3 cores x1.77 / 4 cores vs in-box decoder • • • • 47 x6.69 / 1 core x12.52 / 2 cores x18.52 / 3 cores x18.97 / 4 cores PSNR & SSIM 48 PSNR & SSIM Overview Test setup & definitions CPU only vs CPU+GPU decoding Comparing to reference decoder (IJG / integer mode) • CPU only vs Reference decoding • CPU+GPU vs Reference decoding 49 PSNR & SSIM Test setup & definitions Same system, same reference clip PSNR (Peak Signal-to-Noise Ratio) • Ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation • > 50 db => very good SSIM (Structural Similarity Index Metric) • A method for measuring the similarity between two images • Designed to improve on traditional methods like PSNR • Near 1 => very good 50 PSNR & SSIM CPU only vs CPU+GPU decoding CPU only 51 CPU+GPU PSNR & SSIM CPU only vs CPU+GPU decoding (iDCT output) CPU only 52 CPU+GPU PSNR & SSIM CPU only vs CPU+GPU decoding (iDCT output) (CPU+GPU) - (CPU only) 53 (CPU+GPU) / (CPU only) PSNR & SSIM CPU only vs CPU+GPU decoding PSNR • Overall : 51.2535 db SSIM • Average : 0.9967392 54 PSNR & SSIM CPU only vs Reference decoding PSNR • Overall : 62.4848 db SSIM • Average : 0.99965868 55 PSNR & SSIM CPU+GPU vs Reference decoding PSNR • Overall : 51.3943 db SSIM • Average : 0.9968016 56 Usage examples & benefits 57 Usage examples & benefits Overview Transcoding HD Webcam Security & Surveillance Heavy image processing 58 Usage examples & benefits Transcoding Connect to OpenCL Scaler (MEDIASUBTYPE_AMLV) Connect to MP4/H264 OpenCL Encoder (AML based) Connect to MP4 muxer & file writter Produce video for YouTube, Media Player box, iPhone, iPad, … Almost all transcoding done on GPU Benefit : Fast 59 Usage examples & benefits HD Webcam applications (All done on GPU) Picture processing (denoise, luma & chroma correction, …) Face/eyes/lips tracking Feature detection Fun decoration / augmented reality • 3D domain • Pixel shaders • Alpha blending Video conferencing (connected to MP4/H264 OpenCL encoder) Benefit : Fast, save CPU, even with 1080p @ 30fps 60 Usage examples & benefits Security & Surveillance applications (All done on GPU) Source : IP Camera, “JPEG” Security Camera, Webcam Intrusion detection Recognition (face, car plate, …) Benefit : Allows heavier real-time processing, save CPU 61 Usage examples & benefits Heavy image processing applications (All done on GPU) Medical imagery Science & research Benefit : Allows heavy processing, save CPU 62 Thank You, Questions ? 63 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 64