Untitled - GPU

Transcription

Untitled - GPU
INVITED KEY TALKS
Connected Components Revisited on Kepler
Gernot Ziegler – NVIDIA
Locating connected regions in images and volumes is a substantial
building block in image and volume processing pipelines. We demonstrate
how the Connected Components problem strongly benefits from a new
feature in the Kepler architecture, direct thread data exchange through the
SHUFFLE instruction.
GCN Architecture, HSA Platform Evolution and the AMD
Developer Ecosystem
Benjamin Coquelle – AMD
This presentation will present the Graphics Core Next architecture and
how one can achieve the maximum performance on such architecture.
For that matter we will go through actual OpenCL examples and show
how OpenCL is mapped to AMD GPUs. Most of the hardware components
will be covered (scheduler, Compute units architecture, cache access…)
through different examples. We will also briefly talk about AMD tools to
debug and profile code: codeXL. Finally, we will also present how you can
use HSA to easily take benefit of an heterogeneous platform.
One Can Simply Walk Into Heterogeneous Parallelism
Alex Voicu – Microsoft
“Parallel programming is hard!” has become a seldom questioned truism
– but what lies beyond this statement? This presentation focuses on
clarifying exactly what it takes to write clean, efficient, code, which can run
on various classes of accelerators that can be subsumed under a common
abstract machine model. Starting from two fundamental parallel
programing primitives, reduce and scan, we introduce what we regard as
the minimal yet sufficient programming model for heterogeneous
parallelism. Along the way, we try to establish which are the current
dominating “schools-of-thought”, outlining their medium and long-term
shortcomings. Upon concluding, we should be in a position to re-assess
the introductory truism, and dismiss it as, at best, temporary.
SYCL: An Abstraction Layer for Leveraging C++ and OpenCL
Alastair Murray – Codeplay Software Ltd., Edinburgh
The performance and power advantages of parallel hardware have
caused it to be adopted in domains ranging from embedded systems to
data centres. This means that increasingly complex software must be run
on parallel hardware, which in turn has led to an increase in the desire
from developers for more powerful but simpler parallel programming
languages and models.
The SYCL abstraction layer is a solution to this problem. It adds the power
and flexibility of C++ to the performance and efficiency of the existing
OpenCL ecosystem. SYCL uses a shared source design to allow various
powerful C++ features, such as templates or inheritance, on
heterogeneous devices without the need for a separate device language.
This talk will describe the SYCL abstraction layer and how it can be used
to create complex programs or libraries. Codeplay's own implementation
will also be discussed, including looking at the use of OpenCL SPIR to
provide portability.
SPECIAL TOPICAL TALKS
Methodology for Heterogeneous Programming on OpenCL
and Its Applications
Olga Sudareva, Pavel Bogdanov1 – Institute of System Research,
Russian Academy of Sciences, Moscow (НИСИИ РАН)
During the last few years hybrid computers based on massively-parallel
accelerators gained great popularity. The most common massivelyparallel accelerators are GPGPUs (general purpose graphic processing
units). At the same time, Intel MIC appears promising. In the last Top500
list (November 2014), the first place is Tianhe-2[1] with 55 PFlops, based
on Intel MIC, and the second place is Cray Titan[2] with 27 PFlops, based
on NVidia Tesla K20X.
However, there is still no general programming model which could be
used to program all processors of one node. Therefore, there is no general
model to program the entire distributed system. Investigations of such
models are being conducted. A classification of actual heterogeneous
programming models, such as StarPU[3], OpenACC[4], ompSS[5],
OpenMP 4.0[6], VexCL[7], SYCL [8] and etc. will be given.
It is important, that with the contemporary development of scientific and
applied engineering software, one can seldom find theoretical estimates
for the computational capacity of algorithms and their expected
performance on the hardware-based target platforms.
We propose a method for estimating the expected performance of a task
prior to the implementation stage which allows deciding whether solving
of the task is sensible on such systems. We thereupon propose a
technique for analysis of the original problem for a compute node model
and the distributed system as a whole within the formalized notation of the
OpenCL standard. Models are defined by a set of parameters, such as
the number of massively parallel coprocessors, the memory size of each
one, memory bandwidth, peak performance, etc. Each computational task
1
[email protected]
can be analyzed within the models' frameworks and the expected
performance can be deduced as a formula depending on the parameters.
We demonstrate that a number of well-known task classes can be
implemented efficiently on heterogeneous parallel systems. Moreover, we
present our self-made infrastructure for heterogeneous computing, which
is implemented as a command scheduler based on the OpenCL model
and is able to exchange data with other schedulers via MPI. Program for
this scheduler is a dependency graph of commands executed on compute
devices. The general way to program the entire distributed system using
this infrastructure is the following. One device is controlled by one
scheduler, the synchronization of schedulers on one node is done by
adding dependencies between their commands and the nodes
communicate via MPI calls.
The methodology was applied to a wide range of problems, including
running the HPL[9] tests and NAS Parallel Benchmarks[10] (FFT, MG,
CG), solving model hydrodynamic problem, modeling toroidal plasma
evolution, 3D-modeling of transient dynamics of perturbations in
astrophysical disc flows. All codes related were implemented and
launched on distributed heterogeneous systems. All necessary compute
kernels were written and general algorithms were developed, which allow
these problems to be solved on distributed systems, utilizing all OpenCL
devices. As a "side-effect" we obtained a heterogeneous BLAS version,
which can speedup an application on OpenCL devices via a simple re-link
with the new library. Scalability of up to eight accelerators in one node
was achieved [11].
Test launches were performed on a wide range of processor architectures
and node configurations: one node with 2 Intel Xeon Sandy\Ivy Bridge
CPUs and 8 ACC (AMD Radeon\FirePro GPU, NVidia TITAN GPU),
supercomputer K100 (64 nodes with 2 Intel Xeon X5690 CPUs and 3
NVidia Tesla M2050 GPUs per node) [12], supercomputer K10 (6 nodes
with 2 Intel Xeon E5-2620 CPUs and 3 NVidia Tesla 2090 GPUs per
node)[13], promising mini-supercomputers configured at ISR RAS.
Implementation of any task consists of three stages: estimating theoretical
performance, writing compute kernels in OpenCL C and developing the
upper logic (dependency graph) as a scheduler program. In our
presentation we will not go into details of writing compute kernels (which
are a subject for a separate discussion), but rather focus on the technique
for working out theoretical estimates of expected performance, writing
corresponding dependency-graph programs, discussing the obtained
results and further prospects of the approach.
The simplicity and transparency of the programming model, the relative
ease of development of real-world codes and their excellent scalability
prove the viability of the approach chosen. In addition, one of the major
achievements is the code portability to all currently known hardware
platforms. In some cases, of course, critical OpenCL C compute kernels
require special porting to a particular accelerator. However, the upper
logic level remains the same for all hardware platforms. In future, we plan
to expand the package of applied software employing the program
infrastructure for heterogeneous computing.
1. J. J. Dongarra:Visit to the National University for Defense Technology Changsha,
China. Technical report, Oak Ridge National Laboratory, 18p, 2013.
2. http://www.olcf.ornl.gov/titan/
3. http://runtime.bordeaux.inria.fr/StarPU/
4. http://www.openacc-standard.org/
5. http://pm.bsc.es/ompss
6. http://openmp.org/wp/openmp-specifications/
7. https://www.khronos.org/opencl/sycl
8. https://github.com/ddemidov/vexcl
9. http://www.netlib.org/benchmark/hpl/
10. http://www.nas.nasa.gov/publications/npb.html
11. http://devgurus.amd.com/thread/159457
12. http://www.kiam.ru/MVS/resourses/k100.html
Efficient Large Scale Simulation of Stochastic Lattice Models
on GPUs
Jeffrey Kelling – Helmholtz-Zentrum Dresden-Rossendorf
With growing importance of nano-patterned surfaces and nano-composite
materials in many applications from energy technologies to nanoelectronics, a thorough understanding of the self-organized evolution of
nano-structures needs to be established. Modelling and simulations of
such processes can help in this endeavor and provide predictions for the
turnout of manufacturing processes.
In this talk GPGPU-enabled implementations of two stochastic lattice
models will be discussed, shedding light on the complications which arise
when simulations of stochastic processes are to make efficient use of
massively parallel GPU architectures.
A single-GPU implementation of the (2+1)-dimensional Roof-Top-model
allows very precise large-scale studies of surface growth processes in the
Kardar-Parisi-Zhang universality class.[1] Furthermore a multi-GPU
enabled version of the 3d kinetic Metropolis lattice Monte-Carlo method
[2] provides the capability to study the evolution of nano-structures both
towards and out-of-equilibrium at spatio-temporal scales of experiments
using only small to medium-sized GPU clusters.
[1] J. Kelling, G. Ódor Extremely large-scale simulation of a Kardar-Parisi-Zhang
model using graphics cards, Physical Review E 84, 061150 (2011)
[2] J. Kelling, G. Ódor, F. Nagy, H. Schulz, K. Heinig Comparison of different parallel
implementations of the 2+1-dimensional KPZ model and the 3-dimensional KMC
model, The European Physical Journal - Special Topics 210, 175-187 (2012)
GPU computing at ELI-ALPS
Sándor Farkas – Extreme Light Infrastructure - Attosecond Light Pulse Source,
Szeged (ELI-ALPS)
ELI-ALPS, the Attosecond Light Pulse Source will provide the possibility
for extreme short experiments performed at an outstanding high
frequency. Data are produced on-line at the laser diagnostics benches,
secondary sources and at the experimental end stations at the rates of 10
Hz-100 kHz. The predicted peak data rate can increase to 1Tb per second
and the peak data volume to tens of petabytes per year.
Scientific computing engineers are designing an efficient storage together
with a sound processing solution that will be able to receive the big data
flow, perform the processing and finally to store the data. Some state-ofthe art technologies have been investigated and evaluated for high
performance computing, cost-effective scale-out data storage, robust
virtualization and management: Ceph, OpenStack, Torque and others.
One important aspect in the data processing chain is the access and
efficient utilization of the local GPU cluster from the virtual machines
managed by OpenStack. As most promising the PCI pass-through
technology is being evaluated, tested and benchmarked with real physical
simulation (PIC) codes. The current status, results and the future plans of
the project will be presented.
CONTRIBUTED TALKS
Comparison of GPUs and Embedded Vision Coprocessors for
Automotive Computer Vision Applications
Zoltán Prohászka – Kishonti Informatics Ltd., Budapest
Current trends in ADAS (Advanced Driver Assistance Systems)
functionality of production passenger vehicles and corresponding
regulations forecasts that new vehicles in the upcoming years will require
Tera-operations/seconds class computing HW. This article compares
GPUs and more specialised DSP-like accelerators based on the following
aspects: Estimated transistor count and utilization of arithmetic units, data
feed problems and data access patterns, functional safety, programming
model and portability of algorithms resulted by academic/private research.
A case study will be presented on our autonomous car project
demonstrating vision only driverless cruising.
PERSEUS: Image and Video Compression in the Age of
Massive Parallelism
Balázs Keszthelyi – V-NOVA Ltd., Cambridge
This presentation is an introduction of V-Nova’s Perseus technology,
covering its background, use-cases and the most important, distinctive
design decisions which are driving its evolution. Perseus is a (currently)
software based family of image and video compression codecs, targeting
both contribution and distribution markets, just as well as OTT. The
foundations of Perseus are analogue to the hierarchical nature of human
vision, and this way it offers seamless, multi-scale experience. Using
global operators, massive parallelism could be maintained without
compromising compression efficiency. V-Nova’s engineers were
successful to make great use of high-end GPUs just as well as their lowpower counterparts, balancing between the GPU utilization/throughput
and low-latency required by contribution applications.
Fisher Information-based Classification on GPUs
Bálint Daróczy – Institute for Computer Science and Control, Hungarian Academy of
Sciences, Budapest (SZTAKI)
With the kernel methods we can solve classification and regression
problems with linear methods such that we project the data into a high
dimension space, where the linear surface can have complicated shape
when projected back into the original space. However, kernel methods in
the general form are hard to implement on GPUs, because of memory
limitations.
The talk has two part. On one hand we show that with the help of a model
based on Fisher information we can transform objects with complicated
similarity structures into linear ones, for example time series from multiple
sensors. On the other hand we present the details of the GPU
implementation of our Fisher information based method.
A GPU-based program suite for numerical simulations of
planetary system formation
Emese Forgács-Dajka, Áron Süli, László Dobos – Eötvös University, Budapest (ELTE)
An important stage of planetary system formation is the growth of
planetesimals via collision. For a realistic model, numerous other effects
beyond the gravitational force are needed to be taken into account: gas
drag force, migrational forces resulting from the gravitational interaction of
the gas disk and larger solid bodies, electromagnetic forces, etc.
Collisions among bodies are of paramount importance in the final
architecture of the emerging planetary system. For the sake of simplicity,
however, collisions are usually described as perfectly inelastic, and since
close multi-body interactions are extremely rare, only two-body collisions
are considered, with perfect conservation of the momentum. Our software
package implements Runge-Kutta integrators of various orders for
scalable precision and can adaptively switch between GPU and CPU
execution. This becomes important at later stages of planetary system
formation when the number of bodies drops below a certain limit and CPUbased execution becomes more efficient. Our software is primarily
designed for integrating systems of interacting planetesimals but also
applicable to other n-body problems that require precise integration of the
gravitational force, such as the motion of small objects of the Solar
System.
Semi-analytic modelling of galaxy spectra using GPUs
Dezső Ribli – Eötvös University, Budapest (ELTE)
Large sky surveys have measured the high resolution spectra of hundreds
thousands of galaxies. These measurements provide the possibility to
infer the physical properties of very distant galaxies. The most frequently
used framework to interpret the galaxy spectra is the stellar population
synthesis model (SPS). In SPS modelling galaxy spectra are constructed
as the superposition of spectra of stellar populations with different ages.
Usually, in SPS modelling, only intergrated spectral properties (Lick
indices) are used for parameter inference, or not rigorous fitting methods
are used (fitting by hand, MOPED). We constructed a command line
application (SPS-FAST), which uses Markov-chain Monte-Carlo method
to fit the SPS model parameters. Exploring the parameter likelihood space
with MCMC, instead of simply finding best-fitting parameters is particurarly
useful in the case of SPS, because the model suffers from serious
degenerations, and other uncertainties. The application is written in C++,
with the crucial parts written in OpenCL. It can run on both CPU and GPU,
and it is able to fit a galaxy spectra in 10-20 seconds.
Optimization possibilities of inhomogeneous cosmological
simulations on massively parallel architectures
Gábor Rácz, István Csabai, László Dobos – Eötvös University, Budapest (ELTE)
Cosmological n-body simulations are done on large scales to understand
the evolution of the distribution of dark and ordinary matter in the universe.
The largest simulations can fairly well reproduce observations even
though they all make a non-obvious assumption: homogeneous
expansion of space. While we know that the distribution of matter, hence
the expansion of space, is not homogeneous, the approximation is
necessary due to lack of numerical techniques for the direct solution of
Einstein’s equations. Nevertheless, toy models can be constructed that
account for the inhomogeneous expansion of space and be integrated
over time to determine the average of the scale factor as a function of
time. As large-scale simulation codes are written with the homogeneous
expansion encoded into their core, for our toy model, a brand new n-body
code had to be developed. We show some preliminary results from our
test runs and discuss the role of parallelization in a simulation where
brute-force n-body force kernels need to be combined with complex
algorithms like Voronoi tessellation.
GPUs in a global Earth-based space weather monitoring
network
Dávid Koronczay – Geodetic and Geophysical Institute, Research Centre for Astronomy
and Earth Sciences, Hungarian Academy of Sciences, Sopron (CSFKI GGI)
AWDANet (Automatic Whistler Detector and Analyzer Network) is a realtime plasmasphere monitoring tool based on whistler mode propagation
of ELF/VLF impulses through the plasmasphere. High data rate and low
bandwidth necessitates on-site processing at each monitoring station, and
real-time results are achieved through GPU's. We present this network
and its nodes.
Forecasting Gamma-Ray Bursts with gravitational wave
detectors
Gergely Debreczeni – Wigner Research Centre for Physics,
Hungarian Academy of Sciences, Budapest (Wigner RCP)
Modern gravitational wave (GW) detectors are hunting for GWs originating
from various sources among other from binary neutron star (BNS)
coalescence. It is assumed that in some cases it is the coalescence
(merging) of such BNS system which is responsible for the creation of
gamma-ray bursts (GRBs) - routinely detected by electromagnetic
observatories. Since already well before their merging, during the inspiral
phase, the binary system emits GWs, analysis groups of GW detectors
are performing in-depth search for such events around the time-window
of known, already detected GRBs. These joint analysis are very important
in increasing the confidence of a possible GW detection.
It is widely expected that the first direct detection of GWs will happen in
the next few years, and what is a matter of fact that the sensitivity of the
next generation of GW detectors will allow us to 'see' a few hundred
seconds of inspiral of the binary system before the merge - for specific
mass parameter range.
From the two above fact, it naturally follows, that one can (should!) turn
around the logic and use the GWs emitted during the inspiral phase of a
BNS coalescence process to predict, in advance the time and sky location
of a GRB and set up constraints on the physical parameters of the system.
Given the limited sensitivity of the detectors and the required high
computational power, there exists no such prediction algorithm, as of
today.
Despite the fact that it is not yet feasible to use this new method with the
current GW detectors, it will be of utmost importance in the late-Advanced
LIGO/Virgo era and definitely for Einstein Telescope.
The very goal of the research presented in this talk is to develop the above
described zero-latency, BNS coalescence 'forecasting' method and set up
and organize the associated alert system to be used by next generation
of gravitational wave detectors and collaborating EM observatories.
NIIF HPC services for research and education
Tamás Máray – National Information Infrastructure Development Institute, Budapest
(NIIF)
The National Information Infrastructure Development Institute (NIIF)
operates the largest supercomputers in Hungary, serving the HPC needs
of all kind of Hungarian research projects since 2001. Currently 6 running
supercomputers with more than 7000 CPU cores and more than 200
GPUs provide nearly 300 Tflop/s total computing performance for
the benefit of the users of universities and academic research institutes.
By the end of 2015 the performance will be doubled according to the
current development plans. The NIIF HPC infrastructure is part of the
European HPC ecosystem, called PRACE. In line with the international
trends, more and more portion of our computing capacity is provided by
GPGPUs. That's why the deep knowledge of GPUs and GPU applications
but also the ability for writing efficient code for the accelerators became
crucial importance for our user community. The presentation briefly
introduces the HPC infrastructure and services of NIIF and gives
information about how users can get access to the supercomputers.
GPU-based Iterative Computed Tomography Reconstruction
Zsolt Balogh – Mediso Ltd., Budapest
In the last decades, the iterative 3D image reconstruction methods has
become an intensively developed research area of medical image
processing. In general, these methods require more computing capacity
as other reconstruction methods, but they can reduce the noise level and
increase the quality of the reconstructed image. The algorithms developed
in this area are very complex, but they can be more efficient and faster
using GPU parallel computing.
At Mediso Ltd. we started to develop a GPU-based iterative Computed
Tomography reconstruction algorithm. To harness the full potential of the
GPU's we use CUDA which is a parallel computing platform that enables
using a GPU for general purpose. In my talk I would like to present some
techniques and problems that occur during the implementation process.
Fast patient-specific blood flow modelling on GPUs
Gábor Závodszky – Budapest University of Technology and Economics, Department of
Hydrodynamic Systems, Budapest (BME HDS)
Pathologic vessel malformations represent a significant portion of
cardiovascular diseases (the leading cause of death in modern societies).
The pathogenesis of these diseases, as well as their treatment methods,
are strongly related to the properties of the blood flow emerging inside the
concerned vessels. Thus, our main objective is to acquire the emergent
flow field accurately using a patient-specific geometry within a clinically
relevant time-frame.
For carrying out the computations we employed an implementation of the
lattice Boltzmann method (LBM). The key difference for this technique lies
in its highly parallel property. Because the statistical description of the
dynamics of the particle ensembles holds more information in every space
coordinate compared to the usual macroscopic description of the flow,
more computation can be done on a single numeric cell without requiring
information from the other cells, thus many parts of the computation can
be split into perfectly parallelizable chunks, making the method nearly
ideal to run on a highly parallel hardware. The programmable video cards
(GPUs) present a great match as a highly parallel hardware capable of
delivering a raw computational performance previously only available to
supercomputers. Using this approach the typical computational timeframe can be reduced from hours to minutes on a desktop machine, which
opens the possibility of integrating computational fluid dynamics (CFD)
based examinations on medical workstations.
Pattern Generation for Retina Stimulation Experiments
László Szécsi – Budapest University of Technology and Economics, Department of
Control Engineering and Information Technology, Budapest (BME HDS)
The mechanisms of how the cells in the retina process light are varied,
complex, and under heavy research. Retina cells are studied by placing
retinas surgically removed from test animals on multi-electrode arrays,
while projecting carefully designed light patterns on them. This talk
elaborates on the process of light pattern generation. We study the system
requirements arising from the measurement setup and classify the kinds
of visual stimuli widely used in research. In order to meet the image
processing capabilities required to meet researcher needs, we introduce
a GPU framework, and propose a filtering methods for the required stimuli.
QCD on the lattice
Ferenc Pittler – Hungarian Academy of Science - Eötvös University Lendület Lattice
Gauge Theory Group, Budapest
In numerical simulation of quantum chromodynamics we use the GPU
cluster at the Eötvös Loránd university. In most of the simulation time we
have to solve 𝐴𝑥 = 𝑏 like equations with a sparse matrix 𝐴. The commonly
used technique for solving this equation is some version of the conjugant
gradient algorithm. In the present talk we discuss an alternative algorithm
for overlap fermions, which is based on a general idea: preconditioning
with a cheaper fermionic discretization, augmented with a domain
decomposition multigrid method.
Hierarchical Bayesian Method with MCMC Integration on
GPUs
János Márk Szalai-Gindl – Eötvös University, Department of Complex Systems,
Budapest (ELTE)
We present a general hierarchical Bayesian method for estimating (latent)
characteristics of each object in a population and population-level
parameters. The observed data is the measurement of characteristics with
some noise. This method can be useful, for example, when the first goal
is the inference of the distribution of population parameters based on the
observed data and the other goal is the computation of the conditional
expected value for each object characteristic. Posterior sampling
approach can be involved for these purposes where MCMC algorithms
can be used. The next state of Markov chain of estimated characteristic
can be computed on GPU cores in parallel manner for each object
because these are independents. The presentation will delineate the
models and applied methods.
Efficient parallel generation of random p-bits for multispin
MC simulation
István Borsos – Institute for Technical Physics and Materials Science, Centre for
Energy Research, Hungarian Academy of Sciences, Budapest (EK)
Multispin MC simulations of certain stochastic models often require
decisions with the same probability p for many simultaneous state
transitions. As an example, consider the deterministic and stochastic
versions of the various cellular automata rules of Wolfram. Usually the
deterministic version is easily and very efficiently simulated by multispin
coding exploiting the bit parallelism in long computer words. In the
stochastic version, however, each of the deterministically computed state
changes, before being accepted, is subjected to a random acceptance
decision of fixed probability p. The straightforward standard solution is to
serialize this part of the simulation by making decisions one by one on the
deterministically computed transitions. This approach, however, loses the
speed gains of the multispin coding. To solve this, we present here an
algorithm to generate vectors of bits, each of them with probability p,
efficiently (in time) and economically (in terms of its use of fair random
bits). These vectors can be directly and simply combined with the
multispin deterministic part of the simulation, maintaining the advantages
of bit-parallelism. The algorithm needs very few hardware resources so it
is usable in resource limited GPU cores, as well.
Accelerated Monte-Carlo Particle Generators for the LHC
Gergely Gábor Barnaföldi – Wigner Research Centre for Physics,
Hungarian Academy of Sciences, Budapest (Wigner RCP)
AliROOT is a Monte Carlo based event generator and simulation
framework of the CERN LHC ALICE experiment that plays a central role
in the theoretical investigations and detector design simulations. As
simulation and reconstruction of particle tracks consumes large amount of
computing power any acceleration is very welcome in this field. One of the
central parts of the Monte-Carlo simulation is the Pseudo-Random
Number Generator (PRNG). In this work we ported the Mersenne-Twister
algorithm to GPUs and added it as a new selectable generator into the
AliROOT framework. This makes possible to utilize the GPUs in the LHC
Computing Grid system.
Accelerating the GEANT Particle Transport System with
GPUs
Gábor Bíró – Eötvös University, Budapest (ELTE) / Wigner Research Centre for
Physics, Hungarian Academy of Sciences, Budapest (Wigner RCP)
High Energy Physics (HEP) needs a huge amount of computing
resources. In addition data acquisition, -transfer, and -analysis require a
well-developed infrastructure too. In order to prove new physics
disciplines it is required to higher the luminosity of the accelerator
facilities, by which we can produce more-and-more data in the future
experimental detectors. Both testing new theories and detector R&D are
based on complex simulations. Today have already reach that level, that
the Monte Carlo detector simulation takes much more time than real data
collection. This is why speed up of the calculations and simulations
became important in the HEP community.
The Geant Vector Prototype (GeantV) project aims to optimize the mostused particle transport code, applying parallel computing and exploit the
capabilities of the modern CPU and GPU architectures as well. With the
maximized concurrency at multiple levels, the GeantV is intended to be
the successor of the Geant4 particle transport code that has been used
since two decades successfully. Here we present our latest result on the
GeantV tests performances, comparing CPU/GPU based vectorized
GeantV geometrical code to the standard Geant4 version.
Code-Generation for Differential Equation Solvers
Dániel Berényi – Wigner Research Centre for Physics,
Hungarian Academy of Sciences, Budapest (Wigner RCP)
In modern HPC computing code-generation is becoming a ubiquitous tool
for minimalizing program development time while maximizing the
effectiveness of hardware utilization. In this talk we present a framework
under development at the Wigner GPU Lab for generating parallelized
numerical solver codes targeting GPUs. One part of the research is
targeting the representation and manipulation of the symbolic set of
equations given by the user, the other one is focusing on the abstract
representation of the numerical solver program. The GPU code
generation part currently supports C++/OpenCL as its back-end. We
review some implementation considerations and early applications in the
area of statistical physics.