Skip to content
AI Primer
breaking

Together AI ranks DeepSeek V4 Pro #1 on Artificial Analysis latency and speed

Together AI said its DeepSeek V4 Pro deployment now leads Artificial Analysis on both output speed and latency. The claim matters because it turns V4 serving into an inference-systems story about KV cache reuse, prefix reuse, kernels, and endpoint profiles rather than model weights alone.

4 min read
Together AI ranks DeepSeek V4 Pro #1 on Artificial Analysis latency and speed
Together AI ranks DeepSeek V4 Pro #1 on Artificial Analysis latency and speed

TL;DR

  • togethercompute's post said Together AI's DeepSeek V4 Pro deployment now sits at the top of Artificial Analysis for both output speed and latency.
  • Artificial Analysis' provider page showed the same tradeoff table: Together at 171.3 output tokens per second, ahead of Lightning AI at 147.5 and Fireworks at 111.9, while DeepSeek itself remained the cheapest provider by blended price.
  • In Together's serving breakdown, the company framed V4 serving as a systems problem, not just a weights problem, with KV cache policy, prefix reuse, kernels, and endpoint selection all moving the result.
  • Together's launch post and quickstart add the deployment details the ranking tweet skipped: 512K context on Together, three reasoning modes, and cached-input pricing at $0.20 per 1M tokens.

You can open the Artificial Analysis provider table, read Together's serving writeup, and compare that with the quickstart caveat that the first 256K tokens are more reliable than the back half of the 512K window. The interesting bit is that the victory lap tweet names four tuning levers, while the docs quietly show the product knobs those optimizations are backing: reasoning effort, streaming, tool calling, and very cheap cached context.

Artificial Analysis

Artificial Analysis' provider page ranked Together first on both output speed and latency for DeepSeek V4 Pro. The page lists Together at 171.3 tokens per second, versus 147.5 for Lightning AI and 111.9 for Fireworks.

The same table keeps the win narrow and specific. DeepSeek's own endpoint was still the price leader by blended cost at $0.18 per 1M tokens, while the provider spread across the field reached roughly 10.9x.

Inference stack

Together's serving breakdown explains why the benchmark moved. DeepSeek V4's hybrid attention compresses context before KV storage, but Together argues the real gains only show up if the serving engine can manage cache layout, local-state recovery, batching, and endpoint profiles.

The post breaks the work into four layers, matching the shorthand in togethercompute's post:

  • KV cache policy: on HGX B200, an SWA-aware policy raised total KV-cache capacity from about 1.2M tokens to about 3.7M tokens.
  • Prefix reuse: V4 turns reuse into a policy problem across CSA, HCA, SWA, and uncompressed local state, rather than a simple single-cache hit.
  • Kernels: Together says new attention kernels are only part of the story, because decode still depends on memory bandwidth and long-context batching behavior.
  • Endpoint profiles: the company describes separate operating points for throughput-heavy and latency-sensitive traffic, instead of one universal best configuration.

That is catnip for inference nerds. The model ranking is being claimed as a systems benchmark as much as a model benchmark.

Product knobs

Together's April launch post fills in the endpoint details behind the leaderboard:

  • deepseek-ai/DeepSeek-V4-Pro
  • 512K context on Together, even though the model family advertises 1M-token context
  • Three reasoning modes: Non-Think, Think High, Think Max
  • Pricing: $2.10 per 1M input tokens, $4.40 per 1M output tokens, $0.20 per 1M cached input tokens
  • Deployment on serverless and reserved infrastructure

The quickstart doc adds two practical notes missing from the posts above: reasoning is on by default, and streaming is recommended because reasoning output can run long.

Context caveat

Together's quickstart also puts a limit on how literally to read the long-context headline. It says retrieval quality is not uniform across the 512K window, and that the first 256K tokens are more reliable than later tokens.

That is new context for the speed claim. Together may have the fastest DeepSeek V4 Pro endpoint on the public benchmark, but its own docs still treat the back half of the context window as a caveat, not a solved problem.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
Artificial Analysis1 post
Inference stack3 posts
Share on X