Siyuan Fu

Avater

send mail Github Linkedin orcid Twitter

CUDA

Architectures

Occupancy

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentage of the hardware’s ability to process warps that is actively in use. Higher occupancy does not always result in higher performance, however, low occupancy always reduces the ability to hide latencies, resulting in overall performance degradation. Large discrepancies between the theoretical and the achieved occupancy during execution typically indicates highly imbalanced workloads.

Suppose 4 clock cycles are needed to dispatch the same instruction for all threads in a warp. If there is one global memory access every 4 instructions, how many warps are needed to fully tolerate 200-cycle memory latency?

The first warp will run for 16 cycles before a memory access is required. After this time this warp needs to wait for 200 cycles. At this time the warp scheduler can switch to another warp and start executing its first 4 instructions. After another 16 cycles this process repeats until the scheduler can switch back to the original warp after 200 cycles.

So number of warps = ceil(200/16) = 13.

Memory

Host (CPU) data allocations are pageable by default. Transfering pageable data to device is slow. To maxiumize the bandwidth, use cudaHostAlloc or cudaHostRegister.

Stream

cudaMemcpyAsync

__host__​__device__​cudaError_t cudaMemcpyAsync (void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0)

cudaMemcpyAsync won’t stall CPU while cudaMemcpy does.

If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams.

for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes,
                  cudaMemcpyHostToDevice, stream[i]);
} for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
} for (int i = 0; i < nStreams; ++i) {
  int offset = i * streamSize;
  cudaMemcpyAsync(&a[offset], &d_a[offset], streamBytes,
                  cudaMemcpyDeviceToHost, stream[i]);
} 

Event

An event is completed when all commands in the stream preceding it complete.

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop)
cudaEventRecord(start, stream1);
kernel<<<gridSize, blockSize, sharedMemSize, stream1>>>();
cudaEventRecord(stop, stream1);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
// cudaEventDestroy(...)

cudaEventSynchronize stalls CPU. We can make CPU do some extra jobs before calling cudaEventSynchronize.

Synchronization and Atomics

warp-level functions

Matrix Multiplication

CPU

GPU

Matrix Transpose

Naive Way

Reading has good memory coalescing. Writing has many cache misses.

Improvement

Inside the kernel, cache the output into shared memory before writing into the global memory.

Resolve Bank Conflicts

Instead of __shared__ float smem[BLOCK_SIZE][BLOCK_SIZE], use __shared__ float smem[BLOCK_SIZE][BLOCK_SIZE+1].

Others remain the same:

smem[threadIdx.y][threadIdx.x] = matA[i * BLOCK_SIZE+ j];
__syncthreads();
matB[jj * BLOCK_SIZE + ii] = smem[threadIdx.x][threadIdx.y];

Scan

Naive way:

Efficient way:

Up sweep is the same as parallel reduction.

When downsweep, the stride decrements after each step (oppsite to the parallel reduction). In each step, copy the value on the right and write to the left. Sum the both values and write to the right.

Reduction

__shared__ float partialSum[];
// ... load into shared memory
for (int stride = blockDim.x / 2; stride > 0; stride /= 2) {
  __syncthreads();
  if (tId < stride)
	  partialSum[tId] += partialSum[tId + stride];
}
if (tId == 0) output[bId] = partialSum[0];

Stream Compaction

  1. Create a mask array.
  2. Do exclusive scan on the mask array.
  3. The exclusive scan result is the indices of the stream compaction result.

Radix Sort

Perform stream compaction for each radix:

See also:

tags: Graphics