# Assignment 02: GPU Architecture and cuTile


The file `assignments/02_assignment/src/__main__.py` contains the main function that runs all the tasks for this assignment. Each task is implemented in a separate file in the same directory. The results of each task are printed to the console when the main function is executed.


## Task 1: GPU Device Properties


```{literalinclude} ../../assignments/02_assignment/src/task1.py
:language: python
```

**Output:**
```
CUDA Device Attributes:
        ClockRate: 2418000
        L2CacheSize: 25165824
        MaxSharedMemoryPerMultiprocessor: 102400
```


## Task 2: Matrix Reduction Kernel
a) cuTile kernel that reduces a 2D input matrix of arbitrary shape `(M, K)` along its **last** dimension (`K`)

```{literalinclude} ../../assignments/02_assignment/src/task2.py
:language: python
```

b) As `M` increases, the parallelization increases as more Streaming Multiprocessors (SM) are active. The theoratically performace maximum is when all SM's are used.
The per-kernel-process load increases with `K`, because every thread needs to load more data from the memory and performs more operations. The increase is non-linear due to the requirement that tile sizes must be powers of 2.
Any K that is not a power of 2 requires zero-padding to the next power of 2, which introduces computational overhead.


## Task 3: 4D Tensor Elementwise Addition

a) cuTile kernel that adds two 4D tensors `A` and `B` element-wise and stores the result in `C`. All tensors have identical shape and dimensions `(M, N, K, L)`. 

```{literalinclude} ../../assignments/02_assignment/src/task3.py
:language: python
```

b) **Benchmark** 
```{literalinclude} ../../assignments/02_assignment/src/task3_benchmark.py
:language: python
```
**Output:**
```
tensor_add_KL benchmark: 0.39 ms
tensor_add_MN benchmark: 0.67 ms
```

The `tensor_add_KL` kernel is faster than the `tensor_add_MN` kernel. This is because the data accessed per kernel program is continuous in memory.

---

## Task 4: Benchmarking Bandwidth

a) cuTile kernel that copies a 2D matrix of shape `(M,N)`

```{literalinclude} ../../assignments/02_assignment/src/task4.py
:language: python
```

b)
Benchmarking 
```{literalinclude} ../../assignments/02_assignment/src/task4_benchmark.py
:language: python
```

**Output:**
```{literalinclude} ../../assignments/02_assignment/src/task4_benchmark.out
```
![alt text](../../assignments/02_assignment/src/task4_benchmark.png)
Because the bandwidth was still increasing at `N=128`, we included a few more points up to `N=2048` to show the trend more clearly. The bandwidth increases as `N` increases, which is expected because larger matrices allow for better utilization of the GPU's memory bandwidth. However, the bandwidth reaches its peak at around `N=512` and then starts to decrease. Probably this is the point we reach hardware limits of the GPU.