Optimizing Memory Bandwidth in OpenVX Graph
Transcription
Optimizing Memory Bandwidth in OpenVX Graph
Accelerating OpenVX Applications on Embedded Many-Core Accelerators Giuseppe Tagliavini, DEI (University of Bologna) Germain Haugou, IIS-ETHZ Andrea Marongiu, DEI (University of Bologna) & IIS-ETHZ Luca Benini, DEI (University of Bologna) & IIS-ETHZ Outline Introduction OpenVX acceleration Work in progress OpenVX overview Foundational API for vision acceleration – Focus on mobile and embedded systems Stand-alone or complementary to other libraries Enable efficient implementations on different devices – CPUs, GPUs, DSPs, manycore accelerators Accelerator template Multi-core processor Cluster-based design Cluster memory (optional) Host L3 Cluster MPMD Processing Elements PE PE … Cluster L2 Many-core accelerator PE PE PE PE … PE DDR3 memory SoC design (optional) CC Cluster controller (optional) L1 Low latency shared TCDM memory DMA HWS DMA engine (L1 ↔L3) HW synchronizer PULP Parallel Ultra-Low-Power platform SRAM VOLTAGE DOMAIN (0.5V – 0.8V) SRAM #0 SRAM #1 SCM #0 SCM #1 RMU RMU ... ... SRAM #M-1 SCM #M-1 RMU LOW LATENCY INTERCONNECT ... to RMUs INSTRUCTION BUS PE #N-1 I$ PE #1 I$ PE #0 I$ PERIPHERALS BRIDGES BRIDGE DMA CLUSTER BUS PERIPHERALS BRIDGE L2 MEMORY CLUSTER VOLTAGE DOMAIN (0.5V-0.8V) PERIPHERAL INTERCONNECT SoC VOLTAGE DOMAIN (0.8V) OpenVX programming model The OpenVX model is based on a directed acyclic graph of nodes (kernels), with data (images) as linkage vx_image imgs[] = { vxCreateImage(ctx, width, height, VX_DF_IMAGE_RGB), vxCreateVirtualImage(graph, 0, VX_DF_IMAGE_U8), … images are not required to actually reside in main memory Virtual vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8), They define a data dependency between kernels, but they cannot be read/written }; They are the main target of our optimization efforts An OpenVX program must vx_node nodes[] = { be verified to guarantee some mandatory properties: Inputs and outputs compliant to theimgs[0], node interface vxColorConvertNode(graph, imgs[1]), No cycles in the graph vxSobel3x3Node(graph, imgs[1], imgs[2], imgs[3]), Only a single writer node to anyimgs[2], data objectimgs[3], is allowed imgs[4]), vxMagnitudeNode(graph, Writes have higher priorities than reads. thresh, imgs[5]), vxThresholdNode(graph, imgs[4], Virtual image must be resolved into concrete types }; vxVerifyGraph(graph); vxProcessGraph(graph); Outline Introduction OpenVX acceleration Work in progress A first solution: using OpenCL to accelerate OpenVX kernels OpenCL is a commonly used programming model for many-core accelerators First solution: OpenVX kernel == OpenCL kernel – When a node is selected for execution, the related OpenCL kernel is enqueued on the device Main limiting factor: memory bandwidth OpenCL bandwidth Experiments performed on the STHORM evaluation board 10000 1391 922 1000 307 MB/s 290 100 71 199 71 38 31 15 10 1 OpenCL Available BW 779 Our solution We realized an OpenVX framework for many-core accelerators coupling a tiling approach with algorithms for graph partition and scheduling Main goals: – Reducing the memory bandwidth – Maximize the accelerator efficiency Several steps are required: – – – – – Tile size propagation Graph partitioning Node scheduling Buffer allocation Buffer sizing Common access patterns for image processing kernels (D) GLOBAL OPERATORS Compute the value of a point in the output image using the whole input image Support: Host exec / Graph partitioning (A) POINT OPERATORS Compute the value of each output point from the corresponding input point (E) GEOMETRIC OPERATORS Compute the value of a point in the output image using a non-rectangular input window Support: Host exec / Graph partitioning (B) LOCAL NEIGHBOR OPERATORS Compute the value of a point in the output image that corresponds to the input window Support: Tile overlapping (F) STATISTICAL OPERATORS Compute any statistical functions of the image points (C) RECURSIVE NEIGHBOR OPERATORS Like the previous ones, but also consider the previously computed values in the output window Support: Persistent buffer Support: Graph partitioning Support: Basic tiling Example (1) MEMORY DOMAINS NESTED GRAPH L3 ACCELERATOR SUB-GRAPH L1 L3 ADAPTIVE TILING HOST NODE Example (2) PEs CC DMAin DMAout B5 B4 B3 B2 B1 B0 L3 access S a b i1 N P c i2 NM d S N P NM … e i3 i4 o1 o2 time L1 memory buffers B0 and B5 use double buffering Bandwidth reduction 10000 1000 290 MB/s 1391 922 779 307 307 34 10 359 71 71 100 199 38 36 24 44 31 15 8 8 1 OVX OpenCL Available BW 15 18 22 Speed-up w.r.t. OpenCL 12,00 10,00 9,61 8,00 6,73 5,64 6,00 5,04 4,00 3,86 3,46 3,50 2,81 3,12 2,92 2,00 1,00 0,00 Random Edge Object Super FAST9 graph detector detection resolution Disparity Pyramid Optical Canny Retina Disparity preproc. S4 Outline Introduction OpenVX acceleration Work in progress OpenVX + Virtual Platform Application mapping Virtual Platform PE0 … TestApplications PEn i G Mem g x S y ADRENALINE Platform configuration Run-time support Run-time policies http://www-micrel.deis.unibo.it/adrenaline/ THANKS!!! Work supported by EU-funded research projects