releaseApril 23, 2026

DeepSeek releases Tile Kernels with Engram, mHC, and FP4/FP8 ops for SM90 and SM100 GPUs

DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.

3 min read

DeepSeek releases Tile Kernels with Engram, mHC, and FP4/FP8 ops for SM90 and SM100 GPUs

TL;DR

DeepSeek published TileKernels on GitHub, an MIT-licensed TileLang repo for optimized LLM GPU kernels, and scaling01's repo screenshot highlights DeepSeek's claim that some kernels were already used in internal training and inference.
The repo is not a grab bag of toy ops: LLMpsycho's summary calls out MoE routing, FP8 and FP4 quantization, gating, and hyper-connection kernels, with support aimed at SM90 and SM100 GPUs.
Early community reads centered on Engram and mHC, but eliebakouch's follow-up says the V4 stack uses mHC and not Engram, which makes the release more like a partial kernel dump than a direct architectural recipe.
One day later, lmsysorg's DeepSeek V4 on SGLang post tied the same low-level work into serving and RL infrastructure, including reused TileLang mHC kernels, Flash Compressor, Lightning TopK, and MegaMoE.

You can inspect the repo README, skim the SGLang DeepSeek V4 cookbook, and check the DeepSeek V4 model cards that landed alongside the broader release. The fun bit is that DeepSeek shipped kernels first, then the next wave of posts and docs started showing where pieces of that work fit: mHC appears again in the V4 serving stack, while Engram seems to be in the open repo without being part of V4's actual architecture.

TileKernels

DeepSeek's README says the project contains optimized GPU kernels for LLM operations built with TileLang, and adds two notable caveats: most kernels are near hardware limits for compute intensity and memory bandwidth, and the code does not represent best practices yet.

That combination is catnip for systems people. DeepSeek is publishing kernels it says were used internally, while also saying the code quality and docs are still being cleaned up, which is a much rarer posture than the usual polished benchmark repo.

SM90 and SM100 coverage

The public summaries point to four buckets in the repo:

MoE routing kernels
FP8 and FP4 quantization kernels
Gating kernels
Hyper-connection kernels such as mHC

The same inventory says the repo targets SM90 and SM100 GPUs, so this is squarely Hopper and Blackwell-era plumbing rather than a portability-first release.

Engram versus mHC

Early reactions treated Engram and mHC as the headline primitives in the drop, and teortaxesTex's early post framed the repo around both. A day later, eliebakouch's architecture note argued that DeepSeek V4 uses mHC as a residual-mixing mechanism and does not use Engram.

That matters mostly as a reading aid. The repo exposes multiple internal ingredients, but it does not map one-to-one to the architecture DeepSeek just shipped elsewhere.

Where the kernels show up next

SGLang's day-zero V4 post is the clearest clue that TileKernels is part of a larger open stack, not an isolated code drop. The graphic and thread tie V4 serving and RL training to several concrete components:

TileLang mHC plus split-k kernels reused for Miles RL training
Flash Compressor, described as 10x faster than naive implementations
Lightning TopK, listed at 15 microseconds for 1M context indexing
MegaMoE and MXFP4 MoE kernels in the serving path
ShadowRadix and HiSparse for long-context KV handling

That gives the TileKernels release a second read: not just reusable low-level code, but a public slice of the kernel layer under DeepSeek's newer serving and training stack.

TL;DR

TileKernels

SM90 and SM100 coverage

Engram versus mHC

Where the kernels show up next

Discussion across the web