Assignment 04: Tensor Contractions on GPUs

In this assignment, you will implement and optimize GPU tensor-contraction kernels using cuTile. You will explore the effects of parallelization strategies, primitive-size merging, and kernel fusion on performance.

All code should be written in src/, one file per task.

We assume the following import conventions:

import cuda.tile as ct
import cupy as cp
import torch
import triton

Use FP16 data type for tensor inputs and outputs, accumulate in FP32. We assume row-major order for all tensors.

Be careful with choosing dimension sizes for the following tasks. Assert that all tensors will fit in memory (less than 32 GiB) first. If the machine runs out of memory, it crashes. This will impact you and other students. If this happens, please notify us via the Matrix channel!

Task 1: Tiled Contraction Kernel Variants

a) Classify all dimensions in the einsum string eabklxy, ecklyz -> eabcxz.

b) Implement a cuTile kernel that computes the contraction eabklxy, ecklyz -> eabcxz. Use dimensions xyz as your GEMM dimensions. Sequentialize all other K-dimensions, parallelize the remaining dimensions. The kernel should work with arbitrary dimension sizes. You can hand them to your kernel as function arguments.

c) Implement a cuTile kernel that computes the contraction eabklxy, ecklyz -> eabcxz. Use dimensions xyz as your GEMM dimensions. Sequentialize all other K-dimensions, as well as the b dimension. Parallelize the remaining dimensions. The kernel should work with arbitrary dimension sizes. You can hand them to your kernel as function arguments.

Find one configuration (dimension sizes) where your kernel from b) performs better and one configuration where your new kernel from c) performs better.

d) Implement a cuTile kernel that computes the contraction eabklxy, ecklyz -> eabcxz. Use dimensions xyzl as your GEMM dimensions by permuting the input tiles of the ct.mma instruction, as well as reshaping so that y and l are merged.

Find one configuration (dimension sizes) where your kernel from b) performs better and one configuration where your new kernel from d) performs better.

e) Implement a cuTile kernel that computes the contraction eabklxy, ecklyz -> eabcxz. Use dimensions exyz as your GEMM dimensions, meaning that you perform a 3D ct.mma inside the kernel. Sequentialize all other K-dimensions, parallelize the remaining dimensions. The kernel should work with arbitrary dimension sizes.

Verify every kernel variant against torch.einsum().

Use triton.testing.do_bench (or a similar benchmark function provided by torch/cupy) for all benchmarks.

Task 2: Kernel Fusion

a) Implement a cuTile kernel for the contraction eabklxy, ecklyz -> eabcxz where you fuse an elementwise tensor multiplication of a tensor D of shape eabcxz with the output tensor. The output tensor can be overwritten by the multiplication.

b) Implement a kernel that computes the elementwise multiplication only. Compare runtime results of your fused kernel with sequentially calling the cuTile contraction kernel, then the elementwise multiplication. Choose tensor sizes such that the FLOP count of the contraction is similar to a 2048x2048x2048 matrix multiplication.

Task 3: GEMM Dimension Size Sweep

a) Implement a contraction kernel that computes the contraction ackm, bcnk -> abnm. Assume fixed dimension sizes |a| = 16, |b| = 16, and |c| = 32. The kernel should be able to handle arbitrary sizes for dimensions mnk.

b) Perform the following benchmarks, visualize your results and explain your findings:

Assume dimension sizes |k| = 64, |m| = 64. Benchmark a dimension size sweep for dimension n, ranging from 17 to 129 (non-power-of-two sizes included).
Assume dimension sizes |m| = 64, |n| = 64. Benchmark a dimension size sweep for dimension k, ranging from 17 to 129 (non-power-of-two sizes included).