Skip to content
AI Primer
workflow

V4-Pro users report 700K-context misses in multi-document retrieval tests

Two practitioner writeups found long-context prompts still missed multi-document questions, so retrieval stayed in the loop while reranking got looser. Keep RAG in place and test top-50 recall into V4-Pro instead of assuming longer context replaces retrieval.

4 min read
V4-Pro users report 700K-context misses in multi-document retrieval tests
V4-Pro users report 700K-context misses in multi-document retrieval tests

TL;DR

  • Deannaoliver's LLMDevs post reports that a 700K-token V4-Pro prompt handled single-fact lookups but missed one of three documents on a multi-document outage comparison, matching the long-context drop shown in DeepSeek's V4 announcement.
  • The recovery pattern in that same LLMDevs writeup was not "delete RAG," but keep retrieval, widen recall to top-50 chunks, and let V4-Pro do more of the final filtering inside the prompt.
  • A smaller practitioner example from Low_Edge7695 on Reddit found that three lines of cross-encoder reranking plus a score threshold moved average retrieval relevance from -0.28 to +3.80 across 10 queries.
  • the HN discussion summary and the main HN thread both point to the same split in early usage: V4-Flash is getting the "cheap and fast" praise, while Pro is the model people test for long-context and coding-agent workloads.
  • Multi-turn API behavior also changed, because Deannaoliver's migration note says V4-Pro requires reasoning_content to be passed back on later turns, a wrapper bug that caused 400 errors in LiteLLM and Roo Code.

You can read DeepSeek's launch post, skim the main HN thread, and compare two very different practitioner failures: a 700K-token long-context miss and a tiny RAG pipeline that fixed hallucinations by reranking harder. The useful weird bit is that both end up at the same place: retrieval stays, but the strict reranker is getting demoted. Another buried gotcha is the reasoning_content turn-state requirement, which surfaced first in the LLMDevs post rather than the launch chatter.

700K context

r/LLMDevs

We tried deleting our RAG pipeline after V4-Pro shipped. Two weeks later we put most of it back.

11 comments

Y
Hacker News

DeepSeek v4

2.1k upvotes · 1.6k comments

According to Deannaoliver's LLMDevs writeup, V4-Pro was fine on direct lookups over a roughly 3M-token internal corpus, then failed when the query asked for a comparison across multiple documents and a new recommendation. The miss was not random, because the post ties it to DeepSeek's own long-context chart in DeepSeek's announcement, which says MRCR accuracy stays above 0.82 through 256K tokens and drops to 0.59 at 1M.

That lines up with the practical claim in the same post that a 700K-token prompt sits inside the advertised window but already past the useful range for multi-hop work. impact_sy's HN summary also frames V4 as a model engineers are testing in long-context and agentic workflows, which makes this gap between headline context length and usable retrieval behavior the real story here.

Top-50 recall

r/AI_Agents

Your RAG is hallucinating because of garbage retrieval — here's the 3-line fix (with real scores)

13 comments

r/LLMDevs

We tried deleting our RAG pipeline after V4-Pro shipped. Two weeks later we put most of it back.

11 comments

The two practitioner posts disagree on scale, but not on failure mode. In Low_Edge7695's smaller RAG example, the model hallucinated because noisy chunks made it into context, and a cross-encoder threshold of 1.5 fixed the run by filtering out low-score results. In Deannaoliver's larger V4-Pro test, the opposite problem showed up: strict filtering threw away recall too early, so the replacement was retrieve top-50, skip the strict reranker, and let V4-Pro sort through a broader set.

The useful structure from those two posts is:

  1. Retrieval still decides whether the right evidence enters the prompt at all, per Deannaoliver's conclusion.
  2. Reranking still helps when the candidate set is noisy, per Low_Edge7695's before-and-after scores.
  3. Long context changes the tradeoff, because the V4-Pro test argues the reranker is now the most negotiable layer, not retrieval itself.

That is close to the HN practitioner split in the discussion summary, where Flash gets praise for speed and availability while Pro gets used for larger-context, more agentic runs. The model window got bigger, but the retrieval tax did not disappear.

Cache economics

r/LLMDevs

We tried deleting our RAG pipeline after V4-Pro shipped. Two weeks later we put most of it back.

11 comments

One of the sharper details in Deannaoliver's post is cost behavior under caching. A cache-miss prefill on a 700K-token V4-Pro prompt was reported at $0.305 per query, but repeated queries on the same prefix dropped to $0.0025 with a 92% hit rate after warmup.

That produces a very different result depending on workload shape. In the same post, long context was cheaper than RAG when most of the prompt prefix stayed stable across calls, and more expensive when context changed every query and forced full prefill again.

reasoning_content

r/LLMDevs

We tried deleting our RAG pipeline after V4-Pro shipped. Two weeks later we put most of it back.

11 comments

The migration gotcha that kept surfacing in comments was not retrieval at all. Deannaoliver's post says V4-Pro requires reasoning_content to be sent back on each subsequent turn in thinking mode, while R1 rejected it, so wrappers that stripped reasoning blocks caused 400 errors.

That matters because the bug showed up inside normal OpenAI-compatible tooling rather than a custom DeepSeek stack. the LLMDevs report names LiteLLM issue #26395 and Roo Code issue #12177, and says the failure mode looked like a generic bad request until the reasoning block was preserved.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

Share on X