# Assignment 02: GPU Architecture and cuTile

In this assignment you will explore GPU hardware characteristics and write tile-based kernels using [cuTile](https://github.com/nvidia/cutile-python). All code should be written in `src/`, one file per task. **Use tensors with datatype FP16 for all tasks**.

We assume the following import conventions:

```python
import cuda.tile as ct
import cupy as cp
```

---

## Task 1: GPU Device Properties

CuPy exposes all CUDA device attributes through a single dictionary. 

**Your task** is to use the function `cp.cuda.Device().attributes.items()` to **report** the values for `L2CacheSize`, `MaxSharedMemoryPerMultiprocessor` and `ClockRate` on the DGX spark.

---

## Task 2: Matrix Reduction Kernel
a)
**Your task** is to write a cuTile kernel that reduces a 2D input matrix of arbitrary shape `(M, K)` along its **last** dimension (`K`), producing a 1D output vector of shape `(M,)` that contains the per-row sum.

**Requirements:**
- Use either `ct.reduce` or `ct.sum` inside the kernel
- Parallelize over the `M` dimension via the `grid`
- Verify correctness by comparing the result to `torch.sum(mat, dim=1)` via `torch.allclose`.
- Since the tiles can only have dimension sizes that are powers of 2, **zero-padding** inside the kernel can be necessary based on the provided matrix shape.

b)
**Report** the theoretical impact on parallelization and the per-kernel-process load as the `M` and `K` dimensions increase or decrease.

---

## Task 3: 4D Tensor Elementwise Addition

a) **Your task** is to write a cuTile kernel that adds two 4D tensors `A` and `B` element-wise and stores the result in `C`. All tensors have identical shape and dimensions `(M, N, K, L)`.

**Implement the kernel twice using the following two approaches:**

1. Each kernel program is responsible for computing the sum of a 2D output tile that covers dimensions `K` and `L`
2. Each kernel program is responsible for computing the sum of a 2D output tile that covers dimensions `M` and `N`

Parallelize over the remaining two dimensions in each respective case.

Verify both implementations against PyTorch's native `A + B` via `torch.allclose`.

b) **Benchmark** both of your kernels and report the average runtimes via Triton's `triton.testing.do_bench` function (https://triton-lang.org/main/python-api/generated/triton.testing.do_bench.html) or a benchmark function of your choice provided by `torch` or `cupy`.

Use dimension sizes $|M| = 16$, $|N| = 128$, $|K| = 16$, $|L| = 128$.

**Report** whether you observe any runtime differences and explain why.

---

## Task 4: Benchmarking Bandwidth

a)
**Your task** is to write a cuTile kernel that copies a 2D matrix of shape `(M,N)`. Each kernel program is responsible for copying a 2D slice of size `(tile_M, tile_N)` of the input matrix.

**Verify** the correctness of your kernel.

b)
Run the kernel on input matrices where the **first dimension (`M`) is fixed at 2048**
and the **last dimension (`N`) ranges from 16 to 128**.  Use a fixed tile height of `tile_M = 64` and set `tile_N = N` (the full width).

For each shape, measure the kernel runtime and compute the effective memory bandwidth:

```
bandwidth (GB/s) = 2 * M * N * sizeof(element) / (time_s * 1e9)
```

**Plot your results.**


## Optional Task
Run one of your programs from task 4 with the `CUDA_TILE_LOGS=CUTILEIR` environment variable. Search for `assume_div_by` hints generated by the compiler. **Report** why the compiler uses these hints. 

**Hint:**: Look at the `make_tensor_view` call that was generated by the compiler to understand what the different variables are used for.