Atomic Operations

Transcription

Atomic Operations
Mitglied der Helmholtz-Gemeinschaft
Atomic Operations
May 6, 2014
| Andrew V. Adinetz
Application: Histogram
Histogram
!  Assign each input element to
a bin
!  How many elements in each
bin?
!  ai — input data, 1 ≤ i ≤ n
h j =#{ai : bin(ai ) = j}
€
!  often, bin(ai ) = ai
€
€
€
May 6, 2014
Folie 2
Example Usage: Histogram Equalization
May 6, 2014
Folie 3
Histogram on CPU
a histogram of byte values
init bins with zeros
void histo_cpu
(int *histo, const unsigned char *data,
int n, int nbins) {
for(int j = 0; j < nbins; j++)
histo[j] = 0;
for(int i = 0; i < n; i++)
histo[data[i]]++;
}
accumulate histogram counters
May 6, 2014
Folie 4
“Histogram on GPU”
__global__ void histo_kernel
(int *histo, const unsigned char *data, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i < n) {
histo[data[i]]++;
}
}
this does not work!
May 6, 2014
Folie 5
Why Doesn’t it Work?
// PTX:
// histo[data[i]]++;
// ...
ld.global.u32
%r7, [%rd9];
add.s32
%r9, %r7, 1;
st.global.u32
[%rd9], %r9;
load
add
store
non-atomic update with multiple hardware
instructions
May 6, 2014
Folie 6
Why Doesn’t it Work?
histo is all zeroes initially
data
0
1
0
1
2
3
time
thread 1
t1
load (histo[1] == 0) load (histo[1] == 0)
t2 > t1
add 1 (result == 1)
add 1(result == 1)
t3 > t2
store (histo[1] = 1)
store (histo[1] = 1)
t
May 6, 2014
4
5
10
5
thread 2
histo[1]: must be 2, but is 1
Folie 7
Atomic Operations
Safe update in multi-threaded environment
!  Instructions for atomic read-modify-writes
!  atomicity guaranteed by hardware
!  atomicity scope = single instruction
!  For algorithms where multiple threads can write the same
memory location
!  Global and shared memory
!  Update visible to all GPU threads
!  For atomics or volatile reads
May 6, 2014
Folie 8
Atomic Operations (API)
atomicOp(T *addr, T val)
!  addr — shared or global-memory address
!  val — second value
!  returns old value (before update)
!  *addr Op= val done atomically
!  T = int, unsigned int, unsigned long long
!  Op = Add, Sub, And, Or, Xor, Min, Max, Inc,
Dec, Exch
!  For T = float, Op = Add, Exch
!  atomicAdd(&counter, 1);
May 6, 2014
Folie 9
Atomic Inc/Dec
Wrap-around on other values than T_MAX/T_MIN:
!  old — value of *addr before update
!  atomicInc(T *addr, T val)
!  *addr = old >= val ? 0 : old + 1
!  atomicDec(T *addr, T val)
!  *addr = (old == 0 || old > val) ? val :
old – 1
!  Slower than atomicAdd / atomicSub
May 6, 2014
Folie 10
Availability
!  CC 1.1 — integer, global memory 32-bit
!  CC 1.2 — integer, shared memory 32-bit, global memory
64-bit
!  CC 2.x (Fermi) — floating-point add, integer shared
memory 64-bit
!  CC 3.5 (Kepler) — integer 64-bit And, Or, Xor, Min, Max
May 6, 2014
Folie 11
Histogram with Atomics on GPU
__global__ void histo_kernel
(int *histo, const unsigned char *data, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i < n) {
atomicAdd(&histo[data[i]], 1);
}
}
May 6, 2014
Folie 12
Histogram Performance
Improved global atomics in Kepler, now resolved in L2 cache
May 6, 2014
Folie 13
Shared Memory Atomics
!  Faster than for global memory
!  Important on Fermi, less so on Kepler
!  Less conflicts
!  Less threads accessing (1 block)
!  Several copies of address (# of thread blocks)
!  Using:
!  Don‘t forget __syncthreads()
!  Process several elements in a thread
May 6, 2014
Folie 14
#define PER_THREAD 32
#define NCLASSES 256
#define BS 256
__global__ void histo_kernel(int *histo, const unsigned char*
data, int n) {
// shared memory histogram storage
__shared__ int lhisto[NCLASSES];
for(int i = threadIdx.x; i < NCLASSES; i += blockDim.x)
lhisto[i] = 0;
__syncthreads();
// compute per-block histogram
int istart = blockIdx.x * (BS * PER_THREAD) + threadIdx.x;
int iend = min(istart + BS * PER_THREAD, n);
for(int i = istart; i < iend; i += BS)
atomicAdd(&lhisto[data[i]], 1);
__syncthreads();
// update global memory histogram
for(int i = threadIdx.x; i < NCLASSES; i += blockDim.x)
atomicAdd(&histo[i], lhisto[i]);
}
May 6, 2014
Folie 15
Multiple Elements per Thread
Otherwise, global atomics are not amortized!
May 6, 2014
Folie 16
Filtering
!  copy only elements satisfying a predicate
!  on the host:
!  on the GPU:
nres = 0;
for(int i = 0; i < n; i++) {
if(data[i] > 0)
res[nres++] = data[i];
}
__global__ void fitler_k
(int *res, int *nres, const int *data, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i >= n) return;
if(data[i] > 0)
res[atomicAdd(nres, 1)] = data[i];
}
!  what is the problem?
May 6, 2014
Folie 17
Simple Filtering (K20X)
!  too many atomics
!  high degree of conflict
May 6, 2014
Folie 18
Filtering Aggregation
!  Shared memory atomics
!  same #atomics
!  still high degree of conflicts
!  Warp-aggregated increment
!  select the leader
!  leader performs atomic operation
!  leader broadcasts result
!  each thread computes its position
!  up to 32x less atomics
!  Combine shared memory and warp aggregation
May 6, 2014
Folie 19
Warp Intrinsics (before CC 3.0)
!  Reduction + sync across warp
!  for active threads only
!  CC 1.2 (__any/__all), CC 2.0 (__ballot)
!  int __any(int v)
!  non-zero iff v is non-zero on any active thread
!  int __all(int v)
!  non-zero iff v is non-zero on all active threads
!  unsigned __ballot(int v)
!  mask: bit i is non-zero iff v is non-zero for thread i
May 6, 2014
Folie 20
Bit Intrinsics
!  unsigned __brev(unsigned v)
!  reverse bits
!  int __clz(int v)
!  #consecutive higher-order zero bits
!  int __ffs(int v)
!  No. of least significant 1, starting at 1
!  __ffs(0) = 0
!  int __popc(unsigned v)
!  #bits set to 1
!  all intrinsics with ll suffix (e.g., __ffsll, __popcll)
!  for 64-bit integers
May 6, 2014
Folie 21
Warp Shuffle (CC 3.0+)
!  Intra-warp „collective operation“
!  int __shfl(int var, int lid)
!  read value of var from lane lid (0 .. warpSize – 1)
!  lane lid must also call __shfl()
!  Other intrinsics available
// read a value
int v = a[i];
// get a value from the lane at the right
int v_left = __shfl(v, (threadIdx.x + 1) % warpSize);
May 6, 2014
Folie 22
Intra-Warp Broadcast
#define WARP_SZ 32
#define MAX_NWARPS 32
int lane_id(void) { return threadIdx.x % WARP_SZ; }
int warp_id(void) { return threadIdx.x / WARP_SZ; }
int warp_bcast(int v, int leader) {
#if __CUDA_CC__ >= 300
return __shfl(v, src);
#else
volatile __shared__ int vs[MAX_NWARPS];
if(lane_id() == leader)
vs[warp_id()] = v;
return vs[warp_id()];
#endif
}
May 6, 2014
Folie 23
Warp-Aggregated Increment
int atomicAggInc(int *p) {
int mask = __ballot(1);
// select the leader
int leader = __ffs(mask) – 1;
// leader increments
int res;
if(lane_id() == leader)
res = atomicAdd(p, __popc(mask));
// broadcast result
res = warp_bcast(res, leader);
// each thread computes its own value
return res + __popc(mask & ((lane_id() << 1) – 1));
} // atomicAggInc
// ...
if(data[i] > 0)
res[atomicAggInc(nres)] = data[i];
// ...
May 6, 2014
Folie 24
Warp Aggregation and Shared Memory
!  Warp aggregation
!  self-contained
!  can stick anywhere (deeply nested ifs)
!  very specific use cases
!  Combining with shared memory
!  same code
May 6, 2014
Folie 25
Warp-Aggregated Atomics (K20X)
17x faster
filtering on K20X is cheap
May 6, 2014
Folie 26
Warp-Aggregated Atomics (M2070)
55x faster
Fermi: shared memory + aggregation is best
filtering not as cheap
May 6, 2014
Folie 27
Atomic Exch/CAS
!  Useful for:
!  synchronization primitives
!  list algorithms
!  atomicExch(T *addr, T val)
!  *addr = val, returns old value
!  are we the first to write new value?
!  atomicCAS(T *addr, T cmp, T val)
!  *addr = old == cmp ? val : old, returns old
value
!  compare-and-swap: has *addr changed in-between?
May 6, 2014
Folie 28
Atomics for Other Types with atomicCAS
!  double, 32-bit complex, etc.
!  either 32 or 64-bit
!  Slow, somewhat faster than critical section
__device__ double atomicAdd(double *address, double val) {
unsigned long long * address_as_ull =
(unsigned long long *)address;
unsigned long long old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while(assumed != old);
return __longlong_as_double(old);
} // atomicAdd
May 6, 2014
Folie 29
Atomics in OpenCL
!  Extensions and built-in
!  OpenCL 1.2: 32-bit built-in, 64-bit as extensions
!  separate extensions for global/local, 32/64-bit, base/
extended
!  atom[ic]_op(volatile qual T* p, T val)
!  T = int, unsigned, long, unsigned long
!  long, unsigned long — 64 bit
!  op = xchg, T = float
!  op = add, sub, xchg, min, max, and, or, xor
!  qual = global, local
!  atomic_cmpxchg(volatile qual T* p, T cmp, T val)
!  = CUDA atomicCAS
May 6, 2014
Folie 30
Questions?
?
May 6, 2014
Folie 31
Exercise 1
!  computing histogram of an „image“
!  = 3 histograms, one per color channel
!  /home/gpu/Atomics/exercises/histo/
!  task: task/
!  solution: solution/
!  TODO:
!  write the working kernel with the global atomics
!  call the kernel correctly
May 6, 2014
Folie 32
Exercise 2
!  partitioning the array into values > 0 and <= 0
!  = 2 filters done simultaneously
!  /home/gpu/Atomics/exercises/partition/
!  task: task/
!  solution: solution/
!  TODO:
!  write the working kernel with global atomics
!  optional: warp aggregation for better performance
May 6, 2014
Folie 33
Atomics with Sub-Word Integers
!  char, short, also non-power-of-two number of bits
!  must entirely fit into a single 4-byte word
!  Can save shared memory
!  but avoid overflows
__device__ void atomicAdd(short *addr, short val) {
size_t up = (size_t)addr, upi = up / sizeof(int)
* sizeof(int);
int sh = (int)(up - upi) * 8;
int *pi = (int*)upi;
atomicAdd(pi, (int)val << sh);
}
May 6, 2014
Folie 34
“GPU Implementation” of Critical Section (Mutex)
!  0 = free, 1 = locked
!  first thread to write 1 locks the mutex
// enter the critical section (lock)
while(atomicExch(&lock, 1));
// do some useful work
__threadfence();
// leave the critical section (unlock)
atomicExch(&lock, 0);
!  warp threads execute in lock-step
!  => doesn‘t work if multiple threads in the warp try to lock
May 6, 2014
Folie 35
Correct Mutex Implementation
!  loop executes synchronously, no deadlock
// try to lock
int want_lock = 1;
while(__any(want_lock)) {
if(want_lock && !atomicExch(&locks[d], 1)) {
// do useful work
__threadfence();
// unlock
atomicExch(&locks[d], 0);
want_lock = 0;
}
} // while(any wants to lock)
May 6, 2014
Folie 36
Performance on K20X: Multiple Approaches
“Artificial atomics” and mutexes/locks are very slow
May 6, 2014
Folie 37