FlashQLA releases TileLang linear-attention kernels with 2–3x forward speedups
Alibaba Qwen introduced FlashQLA, a TileLang-based linear-attention kernel stack that reports 2–3x faster forward passes and 2x faster backward passes. The release gives edge and long-context deployments a new optimization lever below the model layer itself.

TL;DR
- Alibaba Qwen shipped FlashQLA on GitHub, a TileLang-based linear-attention kernel library that Alibaba_Qwen's launch thread says delivers 2 to 3 times faster forward passes and roughly 2 times faster backward passes for GDN chunked prefill.
- The main trick, according to Alibaba_Qwen's launch thread and the FlashQLA README, is automatic intra-card context parallelism that kicks in for tensor-parallel setups, long sequences, and small head counts.
- Qwen also says in Alibaba_Qwen's benchmark thread that FlashQLA beats both FLA Triton and FlashInfer in many long-context forward tests, with the biggest gains showing up in TP-heavy and smaller-model configurations.
- The release is narrowly targeted: the official repository lists SM90-or-newer GPUs, CUDA 12.8+, and PyTorch 2.8+ as requirements, which makes this a Hopper-class optimization story, not a generic drop-in kernel for every deployment.
You can browse the repo, inspect the raw H200 benchmark file, and the screenshots in Alibaba_Qwen's benchmark thread make the target pattern obvious: long context, chunked prefill, and Qwen-family head layouts. The README also ties FlashQLA directly to Gated Delta Network chunked prefill, while TileLang is doing the low-level codegen work underneath.
Automatic context parallelism
FlashQLA is built for the Gated Delta Network attention path used in Qwen-family models, and the repo frames its biggest win as better SM utilization during chunked prefill on long sequences. Qwen's claim is that the gate structure itself lets the kernel trigger intra-card context parallelism automatically instead of requiring a separate manual decomposition step.
The implementation details in the README boil down to three changes:
- Gate-driven automatic intra-card context parallelism.
- An algebraic reformulation that cuts Tensor Core, CUDA Core, and SFU overhead.
- TileLang-built fused kernels with manual warpgroup specialization.
That combination is unusually specific for an open kernel drop. Qwen is not just publishing another attention primitive, it is publishing a Qwen-shaped optimization stack tuned around GDN chunked prefill and Hopper scheduling behavior.
Two kernels, not one giant fusion
One of the more interesting tradeoffs sits in the launch copy itself. Qwen says it deliberately did not fuse the entire GDN flow into a single kernel, because that made context parallelism and backward efficiency harder to optimize on constrained hardware.
Instead, Alibaba_Qwen's launch thread says FlashQLA splits the flow into two specialized kernels:
- Fused Solve.
- CP Preprocess.
- Fused Gated Delta Rule.
The diagram in the launch image shows CP preprocess as an auto-triggered middle stage, and the text adds the cost: more memory I/O overhead at large batch sizes versus a fully fused path. Qwen's bet is that the trade is worth it for edge devices, small models, and long-context runs, where occupancy and memory pressure hurt more than an extra write.
The backward pass got its own custom pipeline. According to Alibaba_Qwen's launch thread, the team built a 16-stage warp-specialized pipeline under tight on-chip memory limits to get the reported 2 times kernel-level speedup.
Benchmark shape
The benchmark file gives the launch numbers more texture than the headline tweet. On an H200, the official benchmark table compares FlashQLA against FLA Triton and FlashInfer across Qwen3.5 and Qwen3.6-style head layouts from TP1 to TP8.
A few patterns stand out:
- Against FLA Triton, forward speedups regularly land around 2x or higher, and reach 4.74x on a 397B or 122B TP8 run at sequence length 8192.
- Against FlashInfer, FlashQLA is often faster on single long-sequence cases, but not universally faster on smaller batched cases. In the same table, some TP1 and TP2 multi-batch rows show FlashInfer ahead.
- The README says the gains are especially pronounced in pretraining and edge-side agentic inference, which matches the charts in Alibaba_Qwen's benchmark thread more than a blanket "faster everywhere" claim.
That caveat matters because the release is fairly honest about where the optimization lives. This is a chunked-prefill kernel story with model-family-specific head assumptions, not a universal attention benchmark crown.
Hardware envelope and API
The repo sets a tight deployment envelope: SM90+, CUDA 12.8+, and PyTorch 2.8+. The benchmark setup also pins specific library versions, including FLA 0.5.0, FlashInfer 0.6.9, and TileLang 0.1.8, in the benchmark notes.
The API surface is small. The README exposes a high-level chunk_gated_delta_rule call plus lower-level forward and backward entrypoints, with support for optional initial state and variable-length sequences through cu_seqlens.
That makes FlashQLA feel less like a research artifact and more like a swappable kernel package for teams already running GDN-based Qwen models on Hopper-class hardware.