Session

Kraken: Hackable Triton Kernels for Computation and Multi-GPU Communication Fusion

Modern GPUs are so fast that reaching peak performance demands kernel fusion—not just between compute operations, but by interleaving computation with multi-GPU communication within a single kernel. Achieving this requires efficient in-kernel messaging at the tile/threadblock level and easy integration with existing compute kernels.

We introduce Kraken, a collection of hackable Triton kernels that overlap computation and communication using symmetric memory-style in-kernel communication. Kraken delivers state-of-the-art performance compared to AsyncTP-style fusion ops, while providing full flexibility for both intra-node (NVLink) and inter-node (GPUDirect RDMA) peer-to-peer transfers.

Rather than a rigid framework, Kraken is a hands-on tutorial: developers can embed its techniques into xformers, FlashAttention, TorchInductor-generated kernels—or any custom Triton code. We preserve CUDA graph compatibility and unlock unprecedented prologue/epilogue fusion flexibility. Though Kraken currently targets NVIDIA-specific APIs, it’s designed for future expansion to heterogeneous hardware across the Triton ecosystem.

Surya Subramanian

Meta, Software Engineering Intern. CS @ Georgia Tech.

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Jump to top