Skip to content
AI Primer
release

DeepSeek removes visual-primitives repo after 90-KV vision details

DeepSeek briefly published a paper and threads on point-and-bbox reasoning, about 90 KV entries per 800² image, and RL-trained vision experts, then removed the repo and related mentions. The technique looked like a low-token path to computer use and multimodal reasoning in V4-Flash, but availability and reproducibility are now unclear.

6 min read
DeepSeek removes visual-primitives repo after 90-KV vision details
DeepSeek removes visual-primitives repo after 90-KV vision details

TL;DR

You can still read Exa's cached view of the official GitHub repo, which said DeepSeek planned to release in-house benchmarks, a subset of cold-start data, and eventually integrate the weights into its foundation model. There was already a visible product surface too: testingcatalog's beta screenshot showed a new Vision tab in DeepSeek Chat, while a TechNode report described limited testing on web and app. For background on why the 1M-context angle mattered so much here, Jia-Bin Huang's DeepSeek V4 attention breakdown and AlphaSignal's V4 cache analysis both frame V4 as a deployment-first long-context architecture.

Vision beta

Before the paper appeared, DeepSeek had already started surfacing vision in product. testingcatalog's beta screenshot showed a Vision tab next to Instant and Expert in DeepSeek Chat, and niallohiggins' beta note described image recognition starting to roll out.

The cached official repo said the weights would be integrated into DeepSeek's foundation model rather than shipped as a separate standalone release. That matches the product direction more than a one-off research demo.

Visual primitives

The paper's core move was to make pointing part of thought, not just part of the final answer. In teortaxesTex's capture of the repo intro, DeepSeek described the problem as a "Reference Gap," where language is too ambiguous to track dense spatial layouts.

The mechanism splits into two primitives:

  • Bounding boxes for grounded object references and spatial relations, per nrehiew_'s summary.
  • Point coordinates for path tracing, topology, and fine-grained spatial steps, per the same summary.

That sounds abstract until you look at the tasks.

shows the model tracing a line as a sequence of coordinates, not narrating the route in prose first and hoping the language stays anchored.

90 KV entries

The number everybody latched onto was the image budget. teortaxesTex's follow-up said an 800² image lands at about 90 KV cache entries, and scaling01's chart screenshot showed DeepSeek's model at roughly 90 entries versus far larger token footprints for Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4, and Qwen3-VL on the same 800×800 input.

At the architecture level, nrehiew_'s architecture note described a patch size 14 ViT encoder plus spatial token compression with window size 9. The repo highlight Exa captured adds one more useful detail: DeepSeek said it compresses every 4 visual tokens into a single KV entry in a system built on V4-Flash.

This sat on top of a model family already optimized for long contexts. AlphaSignal's V4 cache analysis says V4-Flash gets 1M context with about 7% of V3.2's KV cache, while Jia-Bin Huang's video breakdown walks through the hybrid attention design behind that memory drop. Put together, the vision paper looked like DeepSeek finding a way to bolt image reasoning onto a deployment stack that already cared obsessively about cache size.

RL pipeline

The most useful part of the leak-sized rollout was how much training detail it exposed. nrehiew_'s data summary says most data was web-crawled rather than synthetic, with trillions of image tokens, plus programmatic tasks like 460K generated mazes and path tracing along Bézier curves.

The post-training recipe breaks into a pretty clean pipeline:

  1. Train two specialists, one box-based and one point-based, according to nrehiew_'s post-training thread.
  2. Run RL on both experts with no extra reasoning supervision, per the same thread.
  3. Score rollouts with task-specific reward models, including format checks, LLM judges, soft counting rewards, maze validity checks, and path-tracing accuracy, per nrehiew_'s reward list.
  4. Distill those expert rollouts back into the base model, then use on-policy distillation because the unified model still lagged the specialists, per nrehiew_'s distillation note.

One detail worth bookmarking lives inside teortaxesTex's reward-model screenshot: DeepSeek explicitly mentions trying to catch reward hacking, including cases where the model fabricates a fake ground truth to match its own prediction.

Benchmarks and blind spots

The paper did publish strong-looking results on the tasks it cared about.

shows DeepSeek ahead or competitive on counting, spatial reasoning, and topology-style evaluations, and nrehiew_'s caveat post immediately noted that those are specialized benchmarks rather than a proof of broad multimodal competence.

That caveat matters because the task mix was narrow by design:

  • Counting and fine-grained counting.
  • Spatial reasoning and VQA.
  • Maze navigation.
  • Path tracing.

Those are exactly the kinds of subproblems that show up in browser and desktop agents, which is why nrehiew_'s computer-use speculation read the paper as a computer-use blueprint. But teortaxesTex's hands-on reaction and the left-right follow-up also flagged that sharper visual grounding does not erase ordinary reasoning errors.

What disappeared

Then DeepSeek yanked the paper trail. teortaxesTex's repo screenshot said the GitHub repository disappeared, and teortaxesTex's deletion post said DeepSeek staff had removed the repo and mentions of the vision paper.

That leaves three concrete facts standing even after the deletion wave:

The awkward bit is reproducibility. The official repo promised future benchmark and data releases, but by the end of the day the public artifact was gone, so the most precise record of DeepSeek's vision stack was living in screenshots, cached pages, and a few very attentive threads.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR6 posts
Vision beta1 post
Visual primitives2 posts
90 KV entries2 posts
RL pipeline3 posts
Benchmarks and blind spots4 posts
What disappeared2 posts
Share on X