# Assignment 03: Matrix Multiplication with cuTile

In this assignment you will implement and optimize GPU matrix multiplication kernels using [cuTile](https://github.com/nvidia/cutile-python). You will explore the impact of numeric precision, tiling strategies, and memory access patterns on performance.

All code should be written in `src/`, one file per task.

We assume the following import conventions:

```python
import cuda.tile as ct
import cupy as cp
import torch
import triton
```

Unless mentioned otherwise, use the datatype FP16 for both input matrices and accumulate in FP32.

---

## Task 1: FP32 vs FP16 Performance

a) **Your task** is to implement two cuTile kernels that each compute a matrix multiplication $A \times B = C$ with `shape(A) = (64, 4096)`, `shape(B) = (4096, 64)` and `shape(C) = (64, 64)`. 

1. **`kernel_fp16`**: Inputs `A` and `B` are `FP16`, accumulator/output `C` is `FP32`
2. **`kernel_fp32`**: Inputs `A` and `B` are `FP32`, accumulator/output `C` is `FP32`

Both kernels should use `ct.mma` to perform the tile-level multiply-accumulate. Use a single block (1 block in the grid) per kernel launch and use the fixed tile shape of `(m_tile=64, n_tile=64, k_tile=64)`, meaning that each `mma` instruction performs a square matrix multiplication of shape `m = n = k = 64`. Loop over the contraction dimension inside the kernel to accumulate results in the output matrix.

**Verify** that both kernels compute correct results via `torch.matmul` using `torch.allclose`.

b) Use `triton.testing.do_bench` (or an equivalent benchmark function from `torch` / `cupy`) to measure the average kernel runtime for both variants. **Report** the measured runtimes and the resulting **speedup** of `kernel_fp16` over `kernel_fp32`.

---

## Task 2: Simple Matrix Multiplication Kernel

Write a cuTile kernel that computes `C = A @ B` for input matrices:

- `A` of shape `(M, K)` 
- `B` of shape `(K, N)`
- `C` of shape `(M, N)`

**Requirements:**

- Each kernel program is responsible for producing one output tile of shape `(m_tile, n_tile)`
- The kernel should work with tile sizes that are specified by the calling function
- Map block IDs (BIDs) in row-major order: `BID 0` covers the top-left output tile, `BID 1` the tile to its right, and so on, wrapping to the next row when the current row of tiles is exhausted
- **The kernel should support matrix shapes that are not powers of 2**
- Use `ct.mma` for the inner accumulation step. The tile sizes correspond to the shape of the `ct.mma` instruction.

**Verify** correctness by comparing `C` against `torch.matmul(A, B)` using `torch.allclose`.

---

## Task 3: Benchmarking the Matrix Multiplication Kernel

Use the kernel from Task 2 for all benchmarks in this task. Report all performance numbers in **TFLOPS**, computed as:

$$
\text{TFLOPS} = \frac{2 \cdot M \cdot N \cdot K}{t_s \cdot 10^{12}}
$$

where $t_s$ is the measured kernel runtime in seconds.

a) Benchmark your kernel with tile shapes `(64, 64, 64)` for square matrix multiplications of sizes:

$$M = N = K \in \{256,\ 512,\ 1024,\ 2048,\ 4096,\ 8192\}$$

**Plot** the achieved TFLOPS and **report** your observations.

b) Fix the matrix size at `2048 × 2048 × 2048`, as well as `512 × 512 × 512`, and benchmark all tile shape combinations (27 total):

$$m\_{tile},\ n\_{tile},\ k\_{tile}\ \in \{32,\ 64,\ 128\}$$

**Visualize** your results as a **heatmap** with `m_tile` on one axis and `n_tile` on the other, fixing `k_tile = 64`.

**Report** the best-performing tile shape combination.

---

## Task 4: L2 Cache Optimization via Block Swizzling

a) **Your task** is to implement a swizzled matrix multiplication kernel. The requirements are the same as in _Task 2_, except block IDs should not be mapped in row-major order. Swizzle them for L2 cache reuse. You can assume a contraction dimension size of `4096`.

**Report** how you choose to map the BIDs and why. **Verify** correctness of the swizzled kernel against `torch.matmul`.

b) Repeat the tile shape sweep from _Task 3b_ for your swizzled kernel and **report** the best performing tile shape combination. **Compare** the performance of your swizzled kernel to the performance of your kernel from _Task 2_ for a matrix multiplication of shape `8192 × 8192 × 4096` (`m × n × k`).