Assignment 02: GPU Architecture and cuTile

In this assignment you will explore GPU hardware characteristics and write tile-based kernels using cuTile. All code should be written in src/, one file per task. Use tensors with datatype FP16 for all tasks.

We assume the following import conventions:

import cuda.tile as ct
import cupy as cp

Task 1: GPU Device Properties

CuPy exposes all CUDA device attributes through a single dictionary.

Your task is to use the function cp.cuda.Device().attributes.items() to report the values for L2CacheSize, MaxSharedMemoryPerMultiprocessor and ClockRate on the DGX spark.

Task 2: Matrix Reduction Kernel

a) Your task is to write a cuTile kernel that reduces a 2D input matrix of arbitrary shape (M, K) along its last dimension (K), producing a 1D output vector of shape (M,) that contains the per-row sum.

Requirements:

Use either ct.reduce or ct.sum inside the kernel
Parallelize over the M dimension via the grid
Verify correctness by comparing the result to torch.sum(mat, dim=1) via torch.allclose.
Since the tiles can only have dimension sizes that are powers of 2, zero-padding inside the kernel can be necessary based on the provided matrix shape.

b) Report the theoretical impact on parallelization and the per-kernel-process load as the M and K dimensions increase or decrease.

Task 3: 4D Tensor Elementwise Addition

a) Your task is to write a cuTile kernel that adds two 4D tensors A and B element-wise and stores the result in C. All tensors have identical shape and dimensions (M, N, K, L).

Implement the kernel twice using the following two approaches:

Each kernel program is responsible for computing the sum of a 2D output tile that covers dimensions K and L
Each kernel program is responsible for computing the sum of a 2D output tile that covers dimensions M and N

Parallelize over the remaining two dimensions in each respective case.

Verify both implementations against PyTorch’s native A + B via torch.allclose.

b) Benchmark both of your kernels and report the average runtimes via Triton’s triton.testing.do_bench function (https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html) or a benchmark function of your choice provided by torch or cupy.

Use dimension sizes $|M| = 16$, $|N| = 128$, $|K| = 16$, $|L| = 128$.

Report whether you observe any runtime differences and explain why.

Task 4: Benchmarking Bandwidth

a) Your task is to write a cuTile kernel that copies a 2D matrix of shape (M,N). Each kernel program is responsible for copying a 2D slice of size (tile_M, tile_N) of the input matrix.

Verify the correctness of your kernel.

b) Run the kernel on input matrices where the first dimension (M) is fixed at 2048 and the last dimension (N) ranges from 16 to 128. Use a fixed tile height of tile_M = 64 and set tile_N = N (the full width).

For each shape, measure the kernel runtime and compute the effective memory bandwidth:

bandwidth (GB/s) = 2 * M * N * sizeof(element) / (time_s * 1e9)

Plot your results.

Optional Task

Run one of your programs from task 4 with the CUDA_TILE_LOGS=CUTILEIR environment variable. Search for assume_div_by hints generated by the compiler. Report why the compiler uses these hints.

Hint:: Look at the make_tensor_view call that was generated by the compiler to understand what the different variables are used for.