Skip to content
AI Primer
TOPIC25 stories

Cost Optimization

Reducing inference spend and improving unit economics.

RELEASE13th May
Nous Research releases TST with 2-3x pretraining speedup at matched FLOPs

Nous Research introduced Token Superposition Training, which bags tokens early in pretraining before returning to next-token prediction. The team says TST cuts wall-clock training 2-3x at matched FLOPs while leaving the deployed model unchanged.

NEWS10th May
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

RELEASE9th May
OpenRouter launches Pareto Code with min_coding_score tiers and Nitro routing

OpenRouter released Pareto Code, which routes requests to the cheapest coding model above a chosen score threshold and can re-rank for speed with Nitro. Use the API to trade cost against latency with benchmark-based routing controls.

RELEASE1w ago
vLLM 0.20.1 fixes DeepSeek V4 TopK deadlocks and tool-call errors

The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.

NEWS1w ago
Developers report DeepSeek V4 Flash handles 32M-token coding runs for $0.25

Users reported moving long coding sessions from Claude to DeepSeek V4 Flash and seeing tens of millions of tokens cost only cents. Hacker News discussion also leaned toward Flash over Pro for day-to-day use, so teams should test whether the low published prices hold in their own workflows.

RELEASE2w ago
IBM releases Granite 4.1 30B/8B/3B open models under Apache 2.0

IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.

NEWS2w ago
DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens

DeepSeek said cache-hit pricing across its API series is now one-tenth of launch levels, on top of the temporary V4-Pro discount through May 5. The cut lowers costs for cache-heavy long-context and agent workloads, so teams should recheck spend assumptions.

NEWS4w ago
Claude Code raises Opus 4.7 subscriber limits after token burn increases

Anthropic raised Claude subscriber limits and shipped Claude Code 2.1.112 after Opus 4.7's adaptive thinking and tokenizer changes increased token use. Users still report fast quota depletion and inconsistent cache or effort behavior across web and CLI sessions.

RELEASE4w ago
Claude Code updates desktop app with side-by-side sessions and integrated terminal

Anthropic rebuilt Claude Code on desktop into a drag-and-drop multi-session workspace with file editing, HTML and PDF preview, and sidebar session management. The same rollout also shipped 2.1.108 features, including an optional 1-hour cache TTL, recap, and new built-ins that affect cost and session handoff.

NEWS4w ago
Claude Code users report a 5-minute cache TTL and 5x Pro Max quota burn in 1.5 hours

Anthropic acknowledged a March 6 cache optimization change, and Pro Max users report that the shorter TTL plus hidden session context now burns through Claude Code quota much faster. Watch for 500 errors and stalled streams, and apply the 2.1.105 patch if your UI hangs.

NEWS1mo ago
OpenAI launches $100 ChatGPT Pro tier with 5x more Codex usage

OpenAI added a $100 ChatGPT Pro tier with 5x more Codex usage than Plus and kept the $200 tier as the highest-capacity option. The new tier resets Codex limits again and temporarily doubles Pro usage through May 31.

RELEASE1mo ago
Anthropic adds beta advisor tool to Messages API for Opus calls

Anthropic added a beta advisor tool to the Messages API so Sonnet or Haiku can call Opus mid-run inside one request. Anthropic says Sonnet plus Opus scored 2.7 points higher on SWE-bench Multilingual while cutting per-task cost 11.9%.

RELEASE1mo ago
Google releases Veo 3.1 Lite in Gemini API at $0.05 per second

Google released Veo 3.1 Lite in Gemini API and AI Studio with 720p and 1080p output, 4-8 second clips, and text-to-video plus image-to-video support. Watch the April 7 Veo 3.1 Fast pricing drop if you need lower video generation costs.

NEWS1mo ago
TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports

New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

NEWS1mo ago
MiniMax introduces Token Plan for flat-rate text, speech, music, video, and image APIs

MiniMax introduced a flat-rate Token Plan that covers text, speech, music, video, and image APIs under one subscription. It gives teams one predictable bill across modalities and can be used in third-party harnesses, not just MiniMax apps.

NEWS1mo ago
MiniMax M2.7 ranks #5 on PinchBench at $0.30 per million input tokens

Kilo said MiniMax M2.7 placed fifth on PinchBench, 1.2 points behind Opus 4.6 at much lower input cost, while community tests showed strong multi-loop agent behavior on graphics tasks. If you route coding-agent traffic by price, M2.7 looks worth a controlled bake-off.

RELEASE1mo ago
Imbue releases Offload to split Playwright runs across 200 Modal sandboxes

Imbue open-sourced Offload, a Rust CLI that spreads test suites across local or Modal sandboxes from one TOML config. It is useful when agent-heavy teams are bottlenecked on verification instead of generation, especially in browser or CI-heavy stacks.

RELEASE1mo ago
Unsloth releases Studio: local training UI for 500+ models with 70% less VRAM

Unsloth Studio launched as an open-source web UI to run, fine-tune, compare, and export local models, with file-to-dataset workflows and sandboxed code execution. Try it if you want to move prototype training and evaluation off cloud notebooks and onto local or rented boxes.

RELEASE1mo ago
Hankweave adds runtime budgets for dollars, tokens, and wall-clock limits

Hankweave shipped budget controls that cap spend, tokens, and elapsed time globally or per step, including loop budgets and shared pools. Use them to prototype or productionize long agent runs without hand-managing model switches and failure states.

WORKFLOW2mo ago
oMLX supports Claude Code locally with tiered KV cache and Anthropic Messages API

oMLX now supports local Claude Code setups on Apple Silicon with tiered KV cache and an Anthropic Messages API-compatible endpoint, with one setup reporting roughly 10x faster performance than mlx_lm-style serving. If you want private on-device coding agents, point Claude Code at a local compatible endpoint and disable the attribution header to preserve cache reuse.

NEWS2mo ago
Tiiny claims pocket AI server runs local 120B models with an OpenAI-compatible API

Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.

NEWS2mo ago
Ollama updates cloud to NVIDIA B300 for Kimi K2.5 and GLM-5 on $0, $20, and $100 plans

Ollama says its cloud now runs Kimi K2.5 and GLM-5 on NVIDIA B300 hardware while keeping fixed $0, $20, and $100 plans. Try it if you want hosted open models with more predictable spend for always-on agent workloads.

NEWS2mo ago
Epoch AI reports top chip designers used about 90% of HBM and CoWoS supply in 2025

Epoch AI estimates that NVIDIA, Google, AMD, and Amazon consumed nearly all high-bandwidth memory and advanced packaging tied to frontier AI chips in 2025. Track this if you are planning compute, custom silicon, or open-weight infrastructure strategy.

NEWS2mo ago
Google adds Gemini API spend caps in AI Studio with project-level dollar limits

Google AI Studio now lets developers set experimental per-project spend caps for Gemini API usage. Use it as a native billing guardrail, but account for roughly 10-minute enforcement lag and possible batch-job overshoot.

RELEASE2mo ago
Hugging Face launches Storage Buckets for mutable checkpoints, logs, and agent traces

Hugging Face introduced Storage Buckets, a mutable S3-like repo type for checkpoints, processed data, logs, and traces that do not fit Git workflows. Use it to move overwrite-heavy or high-volume artifacts out of versioned repos without leaving the Hub.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.