# Assignment 02: GPU Architecture and cuTile In this assignment you will explore GPU hardware characteristics and write tile-based kernels using [cuTile](https://github.com/nvidia/cutile-python). All code should be written in `src/`, one file per task. **Use tensors with datatype FP16 for all tasks**. We assume the following import conventions: ```python import cuda.tile as ct import cupy as cp ``` --- ## Task 1: GPU Device Properties CuPy exposes all CUDA device attributes through a single dictionary. **Your task** is to use the function `cp.cuda.Device().attributes.items()` to **report** the values for `L2CacheSize`, `MaxSharedMemoryPerMultiprocessor` and `ClockRate` on the DGX spark. --- ## Task 2: Matrix Reduction Kernel a) **Your task** is to write a cuTile kernel that reduces a 2D input matrix of arbitrary shape `(M, K)` along its **last** dimension (`K`), producing a 1D output vector of shape `(M,)` that contains the per-row sum. **Requirements:** - Use either `ct.reduce` or `ct.sum` inside the kernel - Parallelize over the `M` dimension via the `grid` - Verify correctness by comparing the result to `torch.sum(mat, dim=1)` via `torch.allclose`. - Since the tiles can only have dimension sizes that are powers of 2, **zero-padding** inside the kernel can be necessary based on the provided matrix shape. b) **Report** the theoretical impact on parallelization and the per-kernel-process load as the `M` and `K` dimensions increase or decrease. --- ## Task 3: 4D Tensor Elementwise Addition a) **Your task** is to write a cuTile kernel that adds two 4D tensors `A` and `B` element-wise and stores the result in `C`. All tensors have identical shape and dimensions `(M, N, K, L)`. **Implement the kernel twice using the following two approaches:** 1. Each kernel program is responsible for computing the sum of a 2D output tile that covers dimensions `K` and `L` 2. Each kernel program is responsible for computing the sum of a 2D output tile that covers dimensions `M` and `N` Parallelize over the remaining two dimensions in each respective case. Verify both implementations against PyTorch's native `A + B` via `torch.allclose`. b) **Benchmark** both of your kernels and report the average runtimes via Triton's `triton.testing.do_bench` function (https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html) or a benchmark function of your choice provided by `torch` or `cupy`. Use dimension sizes $|M| = 16$, $|N| = 128$, $|K| = 16$, $|L| = 128$. **Report** whether you observe any runtime differences and explain why. --- ## Task 4: Benchmarking Bandwidth a) **Your task** is to write a cuTile kernel that copies a 2D matrix of shape `(M,N)`. Each kernel program is responsible for copying a 2D slice of size `(tile_M, tile_N)` of the input matrix. **Verify** the correctness of your kernel. b) Run the kernel on input matrices where the **first dimension (`M`) is fixed at 2048** and the **last dimension (`N`) ranges from 16 to 128**. Use a fixed tile height of `tile_M = 64` and set `tile_N = N` (the full width). For each shape, measure the kernel runtime and compute the effective memory bandwidth: ``` bandwidth (GB/s) = 2 * M * N * sizeof(element) / (time_s * 1e9) ``` **Plot your results.** ## Optional Task Run one of your programs from task 4 with the `CUDA_TILE_LOGS=CUTILEIR` environment variable. Search for `assume_div_by` hints generated by the compiler. **Report** why the compiler uses these hints. **Hint:**: Look at the `make_tensor_view` call that was generated by the compiler to understand what the different variables are used for.