IT1 Benchmarking Different Parallel Sum Reduction

Transcription

IT1 Benchmarking Different Parallel Sum Reduction
Benchmarking Different Parallel Sum Reduction
Algorithms in CUDA over Varied GPU
Architectures
Munesh Singh Chauhan
Information Technology
College of Applied Sciences
Ibri, Sultanate of Oman
[email protected]
Abstract— SIMD architecture has become the main
architecture for fine-grain parallelism with the introduction of
GPUs (Graphical Processing Units). Many serial applications
have now become the focus of GPU computing in order to
enhance performance and attain higher throughputs. A new
initiative is on the parallel algorithms that can easily be coupled
with GPUs to minimize software development cycle both in terms
of time and cost. CUDA (Compute Unified Device Architecture)
is a parallel computing framework from NVIDIA Corporation
that is used to program NVIDIA manufactured GPUs. The
parallel algorithms and techniques such as stencil operations,
prefix sum reduction, sort, scans, provides optimal to nearoptimal performance enhancements. Sum Reduction using
CUDA is analyzed and different versions are surveyed as well as
benchmarked. The benchmarking process uses bandwidth and
execution time metric to differentiate between varied versions of
reduction algorithms.
Keywords—Graphical Processing Unitt; Compute Unified
Device Architetcure, Parallel Sum Reduction
I. INTRODUCTION
GPUs provide phenomenal computing power at commodity
prices. This would not have been true a decade back as
supercomputers provided the bulk of High Performance
Computing (HPC) at exorbitant price rates. With the
introduction of GPU computing to masses, HPC is now within
the reach of common programmer [1]. In order to parallelize
legacy applications, a thorough understanding of the
conventional parallel algorithms [2] along with their related
optimizations is essential. The parallel algorithms provide
essential and fast parallelization options to a developer thus
optimizing time and work. In most circumstances these parallel
collections of algorithms often grouped in libraries such as
“Thrust” [3], “CUBLAS” [4], etc. provide peer reviewed
logical and numerical solutions that have been vetted through a
strict process over a period in time. This saves critical
development time rather than re-inventing the wheel. There
exist many categories of parallel algorithms but an important
and widely used parallel sum reduction algorithm is
implemented and benchmarked against different GPU
architectures [5].
II. PARALLEL REDUCTION ANALYSIS
Parallel reduction [6] plays a key role in summing up vectors,
arrays and multi-dimensional matrices. The reduction
requirement is widely used in various branches of engineering
and sciences. As a result, an efficient and fast reduction that
can be processed in parallel can become a key factor in
attaining optimum performance gain in diverse applications.
Various parallel reductions using interleaved addressing are
proposed [7] but lacks benchmarking across varied different
GPU architectures. The GPU architecture plays a crucial role in
performance enhancements in the following contexts:
1. Number of Multiprocessors
2. Number of threads per block
3. Number of warps that can be executed concurrently
4. Dynamic memory allocation
Hence, it becomes imperative to evaluate each architecture on
the basis of the different reduction algorithm categories.
A GPU code can be termed as memory-bound or computebound. On the basis of this distinction profiling results can be
tailored. A compute-bound GPU kernel spends most of its time
in computations and processing. As a result GFLOPS (Giga
Floating Point Operations per Second) metric is often used to
measure performances. Whereas in case of a memory-bound
kernels, a kernel spend most of its time in memory operations,
hence GB/s (Giga Bytes per Second) metric is used. In the
present case, reductions have very low arithmetic intensity;
thus GB/s is used as a relevant metric for comparison of
various reduction algorithms.
III. REDUCTION 1: INTERLEAVED ADDRESSING
The Reduction1 algorithm [7] uses shared memory to amortize
the array sum using negligible computational latency, as shared
memory runs almost at the same speed as L1 cache. The
threads of a block, copy the content of the array that is to be
reduced into the shared memory with array size equivalent to
the thread block size (Number of threads per block). The
threads of a block wait for the entire copy operation to finish
using __syncthreads() function call. After the copy, the array
elements in the shared memory are reduced with the final result
allocated to the first element of the shared memory array. This
result represents the reduced sum of a block. Since multiple
blocks are scheduled, all the block sums are accumulated. The
final sum reduction of the block sum is not explained as the
emphasis is on the cost of reduction per block. The last steps
are common to all categories of reduction algorithms.
Reduction2 is the use of reverse loop and thread-based
indexing.
__global__ void redux2(const int *in, int *out)
{
//each thread loads one element from
//global to shared memory
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i=threadIdx.x + lockIdx.x*blockDim.x;
sdata[tid] = in[i];
__syncthreads();
__global__ void kernel1(const int *in, int *out)
{
//each thread loads one element from
//global to shared memory
extern __shared__ int sdata[];
//do reduction in shared memory
for (unsigned int s = 1; s < blockDim.x; s *= 2) {
int index = 2 * s*tid;
if (index<blockDim.x) {
sdata[index] += sdata[index + s];
}
syncthreads();
unsigned int tid = threadIdx.x;
unsigned int i = threadIdx.x+blockIdx.x*blockDim.x;
sdata[tid] = in[i];
__syncthreads();
//do reduction in shared memory
for (unsigned int s = 1; s < blockDim.x; s *= 2) {
if (tid % (2 * s) == 0) {
sdata[tid] += sdata[tid + s];
}
syncthreads();
}
//write result for this block to global memory
if (tid == 0)
out[blockIdx.x] = sdata[0];
}
Fig 1: Reduction 1: Interleaved Addressing
}
//write result for this block to global memory
if (tid == 0)
out[blockIdx.x] = sdata[0];
}
Fig 2: Reduction 2: Thread Divergence
__global__ void redux3(const int *in, int *out)
{
//each thread loads one element from
//global to shared memory
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i=threadIdx.x + blockIdx.x*blockDim.x;
Disadvantage of Reduction1 algorithm
The threads become highly divergent on successive strides. It
results in idle threads thus not fully utilizing the GPUs.
sdata[tid] = in[i];
__syncthreads();
//do reduction in shared memory
for (unsigned int s = blockDim.x / 2; s > 0;s>>=1)
IV. REDUCTION 2: THREAD DIVERGENCE
{
if (tid<s) {
sdata[tid] += sdata[tid + s];
}
syncthreads();
The Reduction 2 algorithm tries to remove the divergence of
the threads. Divergence of threads in a block leads to majority
of the threads lying idle and waste precious GPU computing
resources.
}
//write result for this block to global memory
if (tid == 0)
out[blockIdx.x] = sdata[0];
}
Disadvantages of Reduction2 algorithm
The Reduction2 algorithm uses strided index and successfully
removes the problem of thread divergence. But it suffers from
an additional problem related to shared memory bank conflicts,
as two different threads try to access the same memory bank.
V. REDUCTION 3: SHARED MEMORY BANK CONFLICT
Reduction3 Algorithm solves the shared memory bank conflict
issues and provides sequential addressing as each thread
accesses contiguous locations in the shared memory. The major
change done in Reduction3 algorithms as compared to
Fig 3: Reduction 3: Shared memory Bank Conflict
Disadvantages of Reduction2 algorithm
The major disadvantage of the algorithm is that half of the total
threads remain idle per iteration. In fact the first iteration starts
with only half the total threads functioning at any given time.
VI. REDUCTION 4: LOAD WHILE INITIALIZING SHARED MEMORY
In this algorithm the blocks are halved as a single per-thread
“Add” is done during the first load to the shared memory. This
further tends to increase the performance as an addition is
embedded in a shared memory load.
__global__ void redux4(const int *in, int *out)
{
//each thread loads one element from
//global to shared memory
extern __shared__ int sdata[];
__global__ void redux5(const int *in, int *out)
{
//each thread loads one element from
//global to shared memory
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned
int
i
=
blockIdx.x*(blockDim.x * 2);
//do reduction in shared memory
for (unsigned int s=blockDim.x/2;s > 32; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
syncthreads();
}
sdata[tid] = in[i]+in[i+blockDim.x];
__syncthreads();
//do reduction in shared memory
for(unsigned int s=blockDim.x/2; s > 0; s >>= 1) {
if (tid<s) {
sdata[tid] += sdata[tid + s];
}
syncthreads();
}
}
+
sdata[tid] = in[i] + in[i + blockDim.x];
__syncthreads();
unsigned int tid = threadIdx.x;
unsigned int i=threadIdx.x+
blockIdx.x*(blockDim.x/2);
//write result for this block to global memory
if (tid == 0)
out[blockIdx.x] = sdata[0];
threadIdx.x
if (tid < 32) {
sdata[tid]
sdata[tid]
sdata[tid]
sdata[tid]
sdata[tid]
sdata[tid]
+=
+=
+=
+=
+=
+=
sdata[tid
sdata[tid
sdata[tid
sdata[tid
sdata[tid
sdata[tid
+
+
+
+
+
+
32];
16];
8];
4];
2];
1];
}
Fig 5: Reduction 5: Loop Unrolling
Fig 4: Reduction 4: Load while initializing shared memory
VIII. PROFILING RESULTS
Advantages of Reduction4 algorithm
The shifting of the first level “add” operation with the store of
the shared memory amortizes the cost of add operations. This
provides avenues for further speed-up.
Disadvantage of Reduction4 algorithm
The number of blocks are halved which can result in lesser
parallelism especially when the size of the input array to be
reduced is small.
VII. REDUCTION 5: LOOP UNROLLING
In reduction 5, there are instructions that cause unnecessary
overheads. These ancillary instructions are not related to load,
store or any arithmetic computations. Such instructions are
basically used for iterations and addressing, thus causing extra
overheads. This aspect can be mitigated if some part of the
loops can be unrolled. It is noticed that the when s<=32
(Number of threads per block left is less than or equal to 32),
the last warp need not be synchronized thus inducing additional
parallelism in the reduction example.
Advantage of Reduction5 algorithm
Loop unrolling exposes additional parallelism into the fore.
Disadvantage of Reduction5 algorithm
The reduction in the number of blocks (halving) may lead to
deceleration of the program’s performance especially when the
input (array) is small. This factor is similar to the one
elucidated in the Reduction#4 algorithm.
As pointed out earlier, the common myth is to run as many
threads as possible on a multiprocessor. This aspect [8] is
shown to be flawed in this research work. According to the
Little’s law as mentioned below:
Parallelism (Number of Operations per Multiprocessor) =
Latency x Throughput
(1)
The law espouses the parallelism direct relation to the number
of cores per multiprocessor. A 100% throughput is not
achievable with less options of parallelism (lesser number of
cores per multiprocessor).
Effective Bandwidth is used as the parallel sum reduction
performance metric as the algorithm is heavily memory-bound.
The Effective Bandwidth Calculation is outlined as under:
Effective bandwidth calculation
Beff = ((Br+Bw)/ 109 )/ T
(2)
Where Beff = Effective Bandwidth in GB/s
Br = Number of Bytes Read
Bw = Number of Bytes Written
T = Execution Time in seconds
The profiling results are benchmarked over three different
GPUs (2 having Fermi Architecture and 1 having Kepler
Architecture). Each of the five reduction algorithms are
compared against different block sizes (Number of threads per
block) and corresponding bandwidth (in Giga Bytes per
second). The profiling results are displayed in Appendix A
(Table I, II and III) for each of the GPU device.
Each of the five Reduction algorithms are implemented and
graphically benchmarked against best bandwidths in Fig 6, 7 &
8.
Fig 6: Best Band Width Vs Block Sizes (NVS 5400M)
Issues in CUDA Device Synchronization while Profiling
Results
1) Host-Device synchronization must be considered before
aggregating profiling results. CUDA kernel launch from
CPU host is asynchronous. By asynchronous it means that
the program control returns immediately back to the CPU
after the kernel is launched. The CPU does not wait for the
launch kernel to complete. In other cases CUDA memory
copy (cudaMemcpy) functions are synchronous and as
such they wait for all the previous CUDA code to finish
before initiation.
2) If CPU timers are used to measure performance, then
cudaDeviceSynchronize() must be called after each kernel
launch. This subroutine usage is discouraged as it stalls the
GPU pipeline thus adversely affecting the performance.
3) Instead a better option of profiling is to use CUDA Event
timers. CudaEventSynchronize() blocks the CPU until the
event time is recorded. This routine is much lighter than
cudaDeviceSynchronize().
4) The most accurate CUDA timers can be observed using
Nsight Visual Profiler. This tool is not presently explored
in this paper.
5) The maximum threads-per-block size for different GPU
architectures is constrained. If more threads are allocated
per block than the limit, multiple blocks are queued in the
SMs (Streaming Multiprocessors). This may result in
further latency related delays.
IX. CONCLUSION
Fig 7: Best Band Width Vs Block Sizes (Tesla C2050)
The best overall bandwidth is achieved in case of NVS5400M
(4.508841 GB/sec) with block size 2 and Reduction
Algorithm2. It shows, as discussed before that by creating large
number of small blocks in flight can actually benefit the
memory bandwidth output often leading to peaking. This is in
contrast to a common belief in parallel code developers that
stuffing blocks with large number of threads can provide
performance gain which is not the case as shown. Creating
many blocks is the key. A Streaming Multiprocessor (SM) can
run many blocks together at the hardware level. Inside each
block the basic unit of execution is the warp (group of 32
threads) that is in most cases run concurrently. As a result if we
have many blocks running inside an SM we can expect many
warps too in simultaneous execution, thus exponentially
increasing the performance gains.
ACKNOWLEDGMENT
Fig 8: Best Band Width Vs Block Sizes (GT 740M)
This work has been sponsored by TRC, Sultanate of Oman
and is part of the project that investigates the use of High
Performance Parallel GPUs for solving compute-intensive and
complex applications related to weather modeling and fractal
image compression.
[3]
REFERENCES
[1]
[2]
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone and J. Phillips,
'GPU Computing', Proc. IEEE, vol. 96, no. 5, pp. 879-899,
2008.Sengupta, S., Harris, M., & Garland, M. (2008). Efficient parallel
scan algorithms for GPUs. NVIDIA, Santa Clara, CA, Tech. Rep. NVR2008-003, (1), 1-17.
2015. [Online]. Available: http://thrust. googlecode. com. [Accessed:
10- Apr- 2015]. Nvidia, C. U. D. A. (2008). Cublas library. NVIDIA
Corporation, Santa Clara, California, 15.
[4]
[5]
[8]
[9]
J. Nickolls, I. Buck, M. Garland and K. Skadron, 'Scalable parallel
programming with CUDA', Queue, vol. 6, no. 2, p. 40, 2008.
Y. Zhang and J. D. Owens, (2011, February). A quantitative
performance analysis model for GPU architectures. In High
Performance Computer Architecture (HPCA), 2011 IEEE 17th
International Symposium on (pp. 382-393). IEEE.
M. Harris, NVIDIA, Optimizing Parallel Reduction in CUDA, NVIDIA
Developer Technology
V. Volkov, Better Performance at Lower Occupancy, UC Berkeley, 22
September, 2010
M. Harris, Implementing Performance Metrics in CUDA C/C++,
NVIDIA Developer Zone
Appendix A
TABLE I.
Algorithm
Block Size
Reducion1
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
Reduction2
Reduction3
Reduction4
Reduction5
PROFILE DATA FOR REDUCTION ALGORITHMS (FERMI ARCHITECTURE: NVIDIA NVS5400M)
Execution
Time (in
milliseconds)
4.227936
4.203616
36.506687
43.082817
51.052639
4.203040
4.211136
4.210816
4.186080
4.547904
24.314495
28.322912
43.103970
4.488896
4.399776
4.230272
4.208448
4.484288
23.401920
26.102688
32.283329
4.362560
4.214208
4.237696
4.177152
4.350880
14.031360
15.659936
18.828575
4.372736
4.209664
4.215648
4.412512
4.559392
14.301600
15.412768
19.115744
4.426848
4.650528
4.350976
Br (Bytes
read per
kernel)
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
Bw (Bytes written
per kernel)
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
Effective
Bandwidth
(in GB/s)
4.464204
4.115862
0.463156
0.390178
0.328947
3.993635
3.984985
3.984801
4.508841
3.804281
0.695400
0.593512
0.389607
3.739318
3.814129
3.966474
4.484876
3.858250
0.722517
0.643994
0.520194
3.847605
3.982080
3.959525
4.267451
3.916302
1.200365
1.072393
0.891486
3.837714
3.985892
3.979990
4.039829
3.737200
1.177683
1.089590
0.878093
3.790804
3.608034
3.856201
Best
Effective
Bandwidth
4.464204
4.508841
4.484876
4.267451
4.039829
TABLE II.
Algorithm
Block Size
Reducion1
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
Reduction2
Reduction3
Reduction4
Reduction5
PROFILE DATA FOR REDUCTION ALGORITHMS (FERMI ARCHITECTURE: NVIDIA TESLA C2050)
Execution
Time (in
milliseconds)
6.451168
6.475936
11.808000
12.770624
14.195264
6.529248
6.762432
6.836160
7.271616
6.632576
9.582592
9.883584
12.133664
6.539328
6.713440
6.685536
7.137216
6.512896
10.225344
10.293088
10.870112
6.699008
6.645376
6.653024
7.227424
6.630112
8.590912
8.638432
8.860736
7.021984
6.645088
7.297760
7.006688
6.622656
8.507424
8.143008
8.737728
7.122016
6.628576
7.257088
Br (Bytes
read per
kernel)
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
Bw (Bytes written
per kernel)
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
Effective
Bandwidth
(in GB/s)
2.925729
2.671661
1.431935
1.316301
1.183042
2.570803
2.481550
2.454487
2.595622
2.608565
1.764480
1.700798
1.384050
2.566840
2.499659
2.509786
2.644500
2.656499
1.653567
1.633133
1.544933
2.505656
2.525261
2.522051
2.466410
2.569996
1.960531
1.944057
1.894358
2.389825
2.525063
2.299095
2.544111
2.572889
1.979771
2.062334
1.921027
2.356259
2.531353
2.311980
Best
Effective
Bandwidth
2.925729
2.608565
2.656499
2.569996
2.544111
TABLE III.
Algorithm
Block Size
Reducion1
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
8192
8
32
128
512
1024
2048
4096
1024
Reduction2
Reduction3
Reduction4
Reduction5
PROFILE DATA FOR REDUCTION ALGORITHMS (KEPLER ARCHITECTURE: NVIDIA GT 740M)
Execution
Time (in
milliseconds)
5.478656
4.903104
13.635552
17.617887
18.840544
6.134880
4.196352
4.441664
5.288512
4.505120
11.047904
14.091296
16.617439
6.115392
4.476576
4.418048
5.651808
4.489312
10.668032
12.560352
13.891552
6.007520
4.665568
4.431072
5.224032
5.095104
7.535552
9.200608
9.025536
6.024928
4.526176
4.789728
5.336128
5.284000
8.436736
9.103328
9.831392
6.342688
4.499872
5.234880
Br (Bytes
read per
kernel)
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
222*4
Bw (Bytes written
per kernel)
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
222*4/BLOCK_SIZE
Effective
Bandwidth
(in GB/s)
3.445073
3.528684
1.240015
0.954143
0.891354
2.736061
3.999024
3.777698
3.568937
3.840409
1.530452
1.192934
1.010601
2.744780
3.748694
3.797891
3.339528
3.853932
1.584949
1.338337
1.208907
2.794066
3.596842
3.786728
3.412267
3.344261
2.235105
1.825271
1.859769
2.785313
3.707161
3.502963
3.340585
3.224709
1.996359
1.844776
1.707328
2.645773
3.728831
3.205086
Best
Effective
Bandwidth
3.999024
3.840409
3.853932
3.707161
3.728831