DeepSeek removes visual-primitives repo after 90-KV vision details
DeepSeek briefly published a paper and threads on point-and-bbox reasoning, about 90 KV entries per 800² image, and RL-trained vision experts, then removed the repo and related mentions. The technique looked like a low-token path to computer use and multimodal reasoning in V4-Flash, but availability and reproducibility are now unclear.

TL;DR
- DeepSeek briefly published an official Thinking with Visual Primitives repository, and teortaxesTex's early post captured the core claim: the model interleaves points and bounding boxes directly inside the reasoning trace.
- According to teortaxesTex's follow-up and scaling01's chart screenshot, DeepSeek's V4-Flash vision stack keeps an 800×800 image to roughly 90 KV cache entries, which is the number that made engineers immediately read this as a cheap computer-use play.
- nrehiew_'s thread opener and nrehiew_'s architecture note describe a simple high-level stack: a ViT encoder, spatial token compression with window size 9, then concatenated image and text tokens.
- The training recipe was unusually detailed for a DeepSeek paper. nrehiew_'s data summary, nrehiew_'s reward-model breakdown, and nrehiew_'s post-training note point to web-crawled image data, specialized bbox and point experts, RL without extra reasoning supervision, then on-policy distillation back into the base model.
- Hours later, teortaxesTex's repo screenshot and teortaxesTex's deletion post said the GitHub repo and mentions of the vision paper had disappeared, leaving the method documented mainly through mirrors, screenshots, and community notes.
You can still read Exa's cached view of the official GitHub repo, which said DeepSeek planned to release in-house benchmarks, a subset of cold-start data, and eventually integrate the weights into its foundation model. There was already a visible product surface too: testingcatalog's beta screenshot showed a new Vision tab in DeepSeek Chat, while a TechNode report described limited testing on web and app. For background on why the 1M-context angle mattered so much here, Jia-Bin Huang's DeepSeek V4 attention breakdown and AlphaSignal's V4 cache analysis both frame V4 as a deployment-first long-context architecture.
Vision beta
Before the paper appeared, DeepSeek had already started surfacing vision in product. testingcatalog's beta screenshot showed a Vision tab next to Instant and Expert in DeepSeek Chat, and niallohiggins' beta note described image recognition starting to roll out.
The cached official repo said the weights would be integrated into DeepSeek's foundation model rather than shipped as a separate standalone release. That matches the product direction more than a one-off research demo.
Visual primitives
The paper's core move was to make pointing part of thought, not just part of the final answer. In teortaxesTex's capture of the repo intro, DeepSeek described the problem as a "Reference Gap," where language is too ambiguous to track dense spatial layouts.
The mechanism splits into two primitives:
- Bounding boxes for grounded object references and spatial relations, per nrehiew_'s summary.
- Point coordinates for path tracing, topology, and fine-grained spatial steps, per the same summary.
That sounds abstract until you look at the tasks.
shows the model tracing a line as a sequence of coordinates, not narrating the route in prose first and hoping the language stays anchored.
90 KV entries
The number everybody latched onto was the image budget. teortaxesTex's follow-up said an 800² image lands at about 90 KV cache entries, and scaling01's chart screenshot showed DeepSeek's model at roughly 90 entries versus far larger token footprints for Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4, and Qwen3-VL on the same 800×800 input.
At the architecture level, nrehiew_'s architecture note described a patch size 14 ViT encoder plus spatial token compression with window size 9. The repo highlight Exa captured adds one more useful detail: DeepSeek said it compresses every 4 visual tokens into a single KV entry in a system built on V4-Flash.
This sat on top of a model family already optimized for long contexts. AlphaSignal's V4 cache analysis says V4-Flash gets 1M context with about 7% of V3.2's KV cache, while Jia-Bin Huang's video breakdown walks through the hybrid attention design behind that memory drop. Put together, the vision paper looked like DeepSeek finding a way to bolt image reasoning onto a deployment stack that already cared obsessively about cache size.
RL pipeline
The most useful part of the leak-sized rollout was how much training detail it exposed. nrehiew_'s data summary says most data was web-crawled rather than synthetic, with trillions of image tokens, plus programmatic tasks like 460K generated mazes and path tracing along Bézier curves.
The post-training recipe breaks into a pretty clean pipeline:
- Train two specialists, one box-based and one point-based, according to nrehiew_'s post-training thread.
- Run RL on both experts with no extra reasoning supervision, per the same thread.
- Score rollouts with task-specific reward models, including format checks, LLM judges, soft counting rewards, maze validity checks, and path-tracing accuracy, per nrehiew_'s reward list.
- Distill those expert rollouts back into the base model, then use on-policy distillation because the unified model still lagged the specialists, per nrehiew_'s distillation note.
One detail worth bookmarking lives inside teortaxesTex's reward-model screenshot: DeepSeek explicitly mentions trying to catch reward hacking, including cases where the model fabricates a fake ground truth to match its own prediction.
Benchmarks and blind spots
The paper did publish strong-looking results on the tasks it cared about.
shows DeepSeek ahead or competitive on counting, spatial reasoning, and topology-style evaluations, and nrehiew_'s caveat post immediately noted that those are specialized benchmarks rather than a proof of broad multimodal competence.
That caveat matters because the task mix was narrow by design:
- Counting and fine-grained counting.
- Spatial reasoning and VQA.
- Maze navigation.
- Path tracing.
Those are exactly the kinds of subproblems that show up in browser and desktop agents, which is why nrehiew_'s computer-use speculation read the paper as a computer-use blueprint. But teortaxesTex's hands-on reaction and the left-right follow-up also flagged that sharper visual grounding does not erase ordinary reasoning errors.
What disappeared
Then DeepSeek yanked the paper trail. teortaxesTex's repo screenshot said the GitHub repository disappeared, and teortaxesTex's deletion post said DeepSeek staff had removed the repo and mentions of the vision paper.
That leaves three concrete facts standing even after the deletion wave:
- The cached repo page existed long enough to state DeepSeek's release plan.
- testingcatalog's product screenshot showed a real Vision beta surface in chat.
- teortaxesTex's retweet of a paper mirror indicates copies of the paper were already circulating once the repo went down.
The awkward bit is reproducibility. The official repo promised future benchmark and data releases, but by the end of the day the public artifact was gone, so the most precise record of DeepSeek's vision stack was living in screenshots, cached pages, and a few very attentive threads.