M - AMD

Transcription

M - AMD

M-JPEG Decoding Using OpenCL on Fusion
Guillaume de Bailliencourt | Morgan Multimedia | Manager
Session Description
3
Session Description
This session presents a project with the goal of decoding and displaying
an M-JPEG video stream using the GPU part of an AMD Fusion APU as
much as possible. An M-JPEG video stream is composed of a sequence of
JPEG images. To decode each JPEG image, the bitstream needs to pass
through five stages: huffman decoding, inverse quantization, inverse DCT
(Discrete Cosinus Transform), pixel upsampling, and color conversion. The
CPU part of the AMD Fusion APU handles the first decoding stage. The
GPU part of the APU performs all the other decoding stages and displays
the image. OpenCL and DirectX are used to code the GPU part in
DirectShow environment.
4
Overview
5
Overview
M-JPEG applications
M-JPEG “standard”
DirectShow graph
Sources
Morgan M-JPEG Decoder
Video Renderers
• Common DirectShow Video Renderers
• Morgan DirectX 10 Video Renderer (supporting D3D10 Interop)
Timing & Benchmark
PSNR & SSIM
Usage examples & benefits
6
M-JPEG applications Past & Present
7
M-JPEG applications
Past (90’s)
H/W Capture & Editing
•
•
•
•
•
•
Fast Screen Machine (still images)
Targa 1000 & 2000 (Mac & PC)
Matrox Rainbow Runner
Miro DC10, DC20 & DC30
Iomega Buz
Professional solutions (Avid, GrassValley, …)
All were using H/W codec (Zoran, ST Micro, Ti …)
8
M-JPEG applications
Present
Non MP4 Digicam & DSLR in video mode
• Up to 1080p 30fps
• MOV or AVI container
High-end DSLR in burst mode
• JPEG files sequence
Webcams & HD Webcams (Microsoft, Logitech)
• Chip streams M-JPEG data to USB (decoded in driver)
• Up to 1080p 30fps
• Up to 2.5K x 2K 10fps
IP Cameras
• Streams M-JPEG to IP network
Video & Digital Cinema Editing
• Transcoding DSLR sources
• Low-res proxy
9
10
There’s no real standard
Microsoft OpenDML AVI File Format Extensions
•
•
•
•
‘MJPG’ FourCC
Missing Huffman tables
Interlaced if height > 288 (2 JPEG per frame)
Not respected in many Digicam ‘MJPG’ AVI files (HD progressive, Huffman tables, …)
QuickTime File Format Specification (MOV)
• Photo-JPEG
• MJPEG-A & MJPEG-B
- Missing Huffman tables, or not …
- Missing JPEG markers, or not …
Others
• Old h/w bitstreams ‘TVMJ’, ‘FLJP’, …
- Little endian
- Missing JPEG tables and markers
Complex “universal” parsing & Huffman decoding
11
DirectShow graph
12
DirectShow graph
Source
13
Decoder
Renderer
Sources
14
Sources
File Source
• Need Demux
- AVI
- MOV
- Other container (MKV, JPEG file sequence, …)
AVStream
• Webcam / USB 2.0
• Other devices / Other buses
NetStream
• HTTP
• RTP / RTSP
• Other protocol
15
Morgan M-JPEG Decoder with GPU off-loading
16
Overview
JPEG decoding & display diagram
CPU part : C++, ASM & SIMD
GPU part : OpenCL & AML (AMD Media Library)
Overlapping CPU decoding & GPU decoding
Avoid Memory Transfer (Zero Memory Copy on APU)
Multithreaded decoding
Output to mapped host mem | device mem | D3D10 Interop
17
JPEG decoding & display diagram
Parsing + Huffman + De‐zigzag
18
iQ + iDCT
Upsampling
+ Color
Conversion + Scaling
Display
CPU part : C++, ASM & SIMD
C++ for core decoder object wrapper (multithreaded)
C++ for JPEG parser
ASM for Huffman decoder
• Output to small temp buffer, fits in cache
SIMD for mem transfer between CPU & GPU parts
SIMD (integer 16-bits signed) for CPU iQ & iDCT (benchmark GPU vs CPU)
CPU optimized decoder vs in-box M-JPEG decoder
•
•
•
•
19
x3 faster / 1 core
x6 faster / 2 cores
x9 faster / 3 cores
x11 faster / 4 cores
GPU part : OpenCL & AML (AMD Media Library)
“AML is a library of OpenCL-based kernels that allow codec developers to use
many degrees of freedom in implementing an optimized set of video encoders,
decoders, and transcoders that use combinations of CPU, GPU/APU shaders,
and GPU/APU dedicated hardware.” Mike Schmit - Sr Manager, Video Software - AMD
“JpegDecode” AML kernels
• Inputs
- Buffer of raw DCT coefficients (16-bits signed), 3 components, Planar or Interleaved supported
- Quantization Tables
- Buffer description
• Do iQ & iDCT (32-bits float precision)
• Output
- YCbCr buffer (8-bits unsigned)
20
Overlapping CPU & GPU decoding
Double buffered input for kernel
Enqueue OpenCL async commands once CPU decoding is finished
Requires async sources
21
Avoid Memory Transfer (Zero Memory Copy on APU)
Memory Copy
Zero Memory Copy
• CPU decoding : CL_MEM_READ_ONLY | CL_MEM_USE_PERSISTENT_MEM_AMD
• GPU decoding : CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR
22
N Cores => N Frames / N CPU decoding Threads running in parallel
1 context, 1 device
N command queues
N kernel instances
Threads synchronisation
Respect frame order (Out of order execution to In order delivery)
Requires async sources for performance boost (transcoding)
At nominal frame rate or with sync sources, balances load over N cores
23
Requires async sources for performance boost (transcoding)
24
At nominal frame rate or with sync sources, balances load over N cores
25
Output to mapped host mem | device mem | D3D10 Interop
Output to mapped host mem when
• Downstream filter need host memory (CPU encoder)
• Downstream filter is using DirectX 7 or 9 surfaces (no more D3D9 Interop, DXVA Sharing
undocumented)
• On APU, small impact (Zero Memory Copy)
Output to device mem when
• Downstream filter supports OpenCL (GPU encoder, GPU processing)
• Using MEDIASUBTYPE_AMLV (AML)
Output to D3D10 Interop when
• Using Morgan DirectX 10 Video Renderer
• Writing custom D3D10 processing/rendering code
26
Video Renderers
27
Video Renderers
Common DirectShow Video Renderers (in-box)
Video Mixing Renderer 7 (VMR7)
• It allocates DDRAW 7 Surfaces
• Morgan M-JPEG Decoder side
- Optimized Lock/Unlock using AM_GBF_NODDSURFACELOCK flag on GetBuffer()
- Query IVMRSurface on sample
- Call IVMRSurface::GetSurface to get LPDIRECTDRAWSURFACE7
- Lock LPDIRECTDRAWSURFACE7
- Copy kernel output (mapped host memory) to locked surface
- Unlock surface
28
Video Renderers
Video Mixing Venderer 9 (VMR 9)
• It allocates D3D9 Textures
- Query IVMRSurface9 on sample
- Call IVMRSurface9::GetSurface to get IDirect3DSurface9
- Call IDirect3DSurface9::LockRect
- Unlock surface
29
Video Renderers
Enhanced Video Renderer (EVR)
• It allocates D3D9 Textures
- Query IMFGetService on sample
- Call IMFGetService::GetService to get IDirect3DSurface9
- Call IDirect3DSurface9::LockRect
- Unlock surface
30
Video Renderers
Morgan DirectX 10 Video Renderer
Without D3D10 Interop
• It allocates D3D10 Textures (D3D10_USAGE_DYNAMIC, D3D10_CPU_ACCESS_WRITE)
- Optimized Map/Unmap using AM_GBF_NODDSURFACELOCK flag on GetBuffer()
- Call IVMRSurface10::GetTexture to get ID3D10Texture2D
- Call ID3D10Texture2D::Map
- Copy kernel output (mapped host mem) to mapped texture
- Unmap Texture
31
Video Renderers
With D3D10 Interop
• It allocates D3D10 Textures (D3D10_USAGE_DEFAULT,
D3D10_RESOURCE_MISC_SHARED)
- Set AM_GBF_NODDSURFACELOCK | AMD_MM_GPU_USE_D3D10_INTEROP flag on
GetBuffer()
- Call IVMRSurface10::GetTexture to get ID3D10Texture2D
- Call clCreateFromD3D10Texture2DKHR with ID3D10Texture2D (one time)
- Call clEnqueueAcquireD3D10ObjectsKHR
- Call clEnqueueCopyBufferToImage (Copy kernel output to D3D10 Texture)
- Call clEnqueueReleaseD3D10ObjectsKHR
32
Video Renderers
GPU Processing
•
•
•
•
•
•
•
•
Accepts inputs > 8 bpc
One or Two pass
Uses D3D10 Pixel Shaders
32-bits float precision
Upsampler / Scaler
YUV to RGB
RGB range
Chromatic adaptation
(optional)
• Output to 8, 10 or 16 bpc
33
Timing & Benchmark
34
Timing & Benchmark
Overview
System setup & Reference clip
Timing
• Output to device mem
• Output to mapped host mem
• Output to D3D10 Interop
Benchmark
• GPU vs CPU
35
System setup & Reference clip
System setup
•
•
•
•
•
•
•
•
•
Quad core Llano APU (no L3 cache, 4x1MB L2 cache)
“CPU” clock 24x100MHz = 2.4GHz
TurboCore (800MHz–2400MHz, can be even higher than 2.4GHz)
RAM 8GB DDR3 @ 667x2 = 1333MHz
“GPU” HD 6550D @ 594MHz (BeaverCreek)
APU set to “Performance” mode (TurboCore policy)
Win 7 x64
AMD APP Profiler, use clEnqueueMarker to mark key points in timeline (CPU start/stop, Deliver, …)
GraphStudio x64 (GraphEdit like)
Reference clip
•
•
•
•
36
Shot by DSLR (Panasonic GF1, “Customized” firmware)
1080p 30fps 4:2:0 37Mb/s
MOV container (Photo-JPEG)
Played at full speed for timing & benchmark
(Frame n)
Timing
(n+1)
Deliver (n)
Deliver (n-1)
(n-1)
(n)
Output to device mem (In all cases Input is mapped device mem)
1 CPU decoding Thread / 1 Core
37
Timing
(n-2)
(n+2)
(Frame n)
(n+1)
(n-1)
(n-2)
(n-3)
Output to device mem
2 CPU decoding Threads / 2 Cores
38
(n)
(n-1)
Timing
(n+3)
(n)
(n+4)
(n+1)
(n+2)
39
(n+5)
Timing
(n)
(n+4)
(n+2)
(n+6)
(n+1)
(n-1)
40
(n+5)
(n+3)
Timing
Deliver
Output to mapped host men / Copy to Downsream Filter Input Buffer
41
Timing
(Frame n)
Render
(n-1)
Output to D3D10 Interop / Morgan DirectX 10 Video Renderer
So far isn’t as efficient as 2 others output methods
42
Benchmark
GPU vs CPU
CPU outputs to host mem
GPU outputs to device mem
Connected to Null Renderer
Reference clip played at full speed
43
Benchmark
GPU vs CPU
In-box MJPEG Decoder : 34.96 ms
44
(Lower is better)
Benchmark
GPU vs CPU
In-box MJPEG Decoder : 29 fps
45
(Higher is better)
Benchmark
GPU vs CPU
In-box MJPEG Decoder : 25 % Total
46
Benchmark
GPU vs CPU
vs optimized decoder
•
•
•
•
x2.18 / 1 core
x2.08 / 2 cores
x2.11 / 3 cores
x1.77 / 4 cores
vs in-box decoder
•
•
•
•
47
x6.69 / 1 core
x12.52 / 2 cores
x18.52 / 3 cores
x18.97 / 4 cores
PSNR & SSIM
48
PSNR & SSIM
Overview
Test setup & definitions
CPU only vs CPU+GPU decoding
Comparing to reference decoder (IJG / integer mode)
• CPU only vs Reference decoding
• CPU+GPU vs Reference decoding
49
PSNR & SSIM
Test setup & definitions
Same system, same reference clip
PSNR (Peak Signal-to-Noise Ratio)
• Ratio between the maximum possible power of a signal and the power of corrupting noise
that affects the fidelity of its representation
• > 50 db => very good
SSIM (Structural Similarity Index Metric)
• A method for measuring the similarity between two images
• Designed to improve on traditional methods like PSNR
• Near 1 => very good
50
PSNR & SSIM
CPU only
51
CPU+GPU
PSNR & SSIM
CPU only vs CPU+GPU decoding (iDCT output)
CPU only
52
CPU+GPU
PSNR & SSIM
CPU only vs CPU+GPU decoding (iDCT output)
(CPU+GPU) - (CPU only)
53
(CPU+GPU) / (CPU only)
PSNR & SSIM
PSNR
• Overall : 51.2535 db
SSIM
• Average : 0.9967392
54
PSNR & SSIM
CPU only vs Reference decoding
PSNR
SSIM
• Average : 0.99965868
55
PSNR & SSIM
CPU+GPU vs Reference decoding
PSNR
SSIM
• Average : 0.9968016
56
57
Overview
Transcoding
HD Webcam
Security & Surveillance
Heavy image processing
58
Transcoding
Connect to OpenCL Scaler (MEDIASUBTYPE_AMLV)
Connect to MP4/H264 OpenCL Encoder (AML based)
Connect to MP4 muxer & file writter
Produce video for YouTube, Media Player box, iPhone, iPad, …
Almost all transcoding done on GPU
Benefit : Fast
59
HD Webcam applications (All done on GPU)
Picture processing (denoise, luma & chroma correction, …)
Face/eyes/lips tracking
Feature detection
Fun decoration / augmented reality
• 3D domain
• Pixel shaders
• Alpha blending
Video conferencing (connected to MP4/H264 OpenCL encoder)
Benefit : Fast, save CPU, even with 1080p @ 30fps
60
Security & Surveillance applications (All done on GPU)
Source : IP Camera, “JPEG” Security Camera, Webcam
Intrusion detection
Recognition (face, car plate, …)
Benefit : Allows heavier real-time processing, save CPU
61
Heavy image processing applications (All done on GPU)
Medical imagery
Science & research
Benefit : Allows heavy processing, save CPU
62
Thank You, Questions ?
63
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,
product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT,
SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED
HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD
is not responsible for the content herein and no endorsements are implied.
64

M - AMD

Transcription

Similar documents

Lapedo: A Hybrid Skeletal Framework for Programming

Some Notes on STANAG 4285

smi confidential for symmetry use only

Central Post - Central Philippine University

CPU graduates among the top 10 NLE board passers

XBOX 360 SYSTEM ARCHITECTURE

computer - UniMAP Portal

Practical DirectX 12

computers and systems

PL-370/T Quick Start Guide

Arquitectura de Computadores II