Skip to content
AI Primer
TOPIC40 stories

Cost Optimization

Reducing inference spend and improving unit economics.

RELEASE27th June
Junior adds memory and cuts one analytics task from 3m to 1m

Junior’s first memory system cut one analytics task from about 3 minutes to 1 minute in early tests, with tokens down two-thirds and tool calls down 60%. The feature moves persistent task learning into the agent loop, though the results are still internal.

RELEASE23rd June
Kilo Code launches Auto Efficient routing with KiloBench model selection

Kilo Code added an Auto Efficient mode that routes each request to the cheapest model that clears its benchmark bar using public KiloBench results. The router stays session-aware and falls back to stronger paid models when confidence is low.

NEWS22nd June
GLM-5.2 adds Perplexity Agent API and Droid support on Baseten at >280 TPS

GLM-5.2 added Perplexity Agent API, Droid, and more hosting options, while Baseten reported over 280 TPS and sub-0.8s TTFT. Builders should watch the cost and benchmark data as it moves into production agent stacks.

NEWS1w ago
Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

Wafer said its GLM-5.2 deployment leads Artificial Analysis on throughput and latency, and priced usage at $1.20 input and $4.10 output per million tokens. Compare serverless and dedicated endpoints if you need speed at scale.

RELEASE1w ago
Batchwork launches a unified batch API for 7 AI providers

Batchwork launched a wrapper that normalizes batch submission, polling, and result handling across seven AI providers. It turns provider-specific async batch formats into one interface for evals, migrations, and large offline jobs.

NEWS3w ago
OpenRouter adds cache-hit pricing telemetry as Devin exposes adaptive routing

Vendors pushed routing and spend controls closer to the default app layer, including OpenRouter's cache-hit pricing telemetry and Devin's adaptive routing. The discussion frames model choice more as a budget-control problem than a pure quality setting.

NEWS3w ago
Uber cuts AI coding-tool spend to $1,500 per employee per tool each month

Uber set a $1,500 monthly limit for each AI coding tool an employee uses, covering products such as Cursor and Claude Code. The cap gives enterprises an early benchmark for coding-agent spend as token costs outgrow typical software-seat budgets.

RELEASE3w ago
OpenRouter launches Pareto Code with min_coding_score and 1B routed tokens per day

OpenRouter launched Pareto Code, a free experimental coding router that filters by min_coding_score and says it is already handling about 1 billion tokens a day. The release adds a tunable routing path for coding workloads where cost and model quality need to be balanced.

RELEASE3w ago
Factory introduces Router with 25% lower AI spend and 99% of Opus 4.7 Terminal-Bench 2

Factory put Router into private preview in its CLI and desktop app to route coding tasks across models, claiming 20-25% lower spend. The launch targets rising agent costs, though session continuity and routing behavior remain active points of debate.

RELEASE4w ago
Firecrawl launches /monitor webhooks with up to 90% lower token use

Firecrawl launched /monitor, a URL watcher that only pings agents when tracked pages actually change and can send results by webhook. Use it for change-only ingestion to cut LLM token spend on monitored pages.

NEWS4w ago
Ramp reports business AI token spend at 13x January 2025 levels

Ramp data and operator reports said enterprise AI token spending is rising far faster than budget controls and procurement cycles. Teams should plan for routing, cheaper defaults, and spend caps to become core engineering infrastructure.

NEWS1mo ago
DeepSeek cuts V4 Pro pricing 75% to $0.435 input and $0.87 output

DeepSeek made the temporary 75% V4 Pro discount permanent, cutting first-party pricing to $0.435 per million input tokens and $0.87 output. Artificial Analysis now places it on the cost-performance frontier, but practitioners still question per-task efficiency on harder coding work.

NEWS1mo ago
Turbopuffer reports $100M run-rate and a 95% Cursor code-search cost cut

Turbopuffer said it crossed a $100M run-rate while staying profitable on less than $1M raised, and said Cursor moved production search onto the stack with a 95% cost reduction. The milestone matters because AI products increasingly compete on retrieval quality and cost, not just model output.

RELEASE1mo ago
OpenUI launches OpenUI Lang with 67% fewer tokens than JSON

OpenUI open-sourced a generative UI framework that streams OpenUI Lang instead of JSON, claiming 67% fewer tokens and 3x faster rendering across seven scenarios. Use the renderer only with registered components and typed contracts to keep execution risk down.

RELEASE1mo ago
Together AI launches Gemma-4-31B-it-Pearl endpoint with 25%+ discounted pricing

Together AI launched Gemma-4-31B-it-Pearl as a serverless endpoint that uses Pearl's proof-of-useful-work emissions to offset inference cost. It matters because the pricing model ties serving economics to compute-side byproducts instead of token billing alone.

RELEASE1mo ago
Nous Research releases TST with 2-3x pretraining speedup at matched FLOPs

Nous Research introduced Token Superposition Training, which bags tokens early in pretraining before returning to next-token prediction. The team says TST cuts wall-clock training 2-3x at matched FLOPs while leaving the deployed model unchanged.

NEWS1mo ago
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

RELEASE1mo ago
OpenRouter launches Pareto Code with min_coding_score tiers and Nitro routing

OpenRouter released Pareto Code, which routes requests to the cheapest coding model above a chosen score threshold and can re-rank for speed with Nitro. Use the API to trade cost against latency with benchmark-based routing controls.

RELEASE1mo ago
vLLM 0.20.1 fixes DeepSeek V4 TopK deadlocks and tool-call errors

The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.

NEWS1mo ago
Developers report DeepSeek V4 Flash handles 32M-token coding runs for $0.25

Users reported moving long coding sessions from Claude to DeepSeek V4 Flash and seeing tens of millions of tokens cost only cents. Hacker News discussion also leaned toward Flash over Pro for day-to-day use, so teams should test whether the low published prices hold in their own workflows.

RELEASE2mo ago
IBM releases Granite 4.1 30B/8B/3B open models under Apache 2.0

IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.

NEWS2mo ago
DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens

DeepSeek said cache-hit pricing across its API series is now one-tenth of launch levels, on top of the temporary V4-Pro discount through May 5. The cut lowers costs for cache-heavy long-context and agent workloads, so teams should recheck spend assumptions.

NEWS2mo ago
Claude Code raises Opus 4.7 subscriber limits after token burn increases

Anthropic raised Claude subscriber limits and shipped Claude Code 2.1.112 after Opus 4.7's adaptive thinking and tokenizer changes increased token use. Users still report fast quota depletion and inconsistent cache or effort behavior across web and CLI sessions.

RELEASE2mo ago
Claude Code updates desktop app with side-by-side sessions and integrated terminal

Anthropic rebuilt Claude Code on desktop into a drag-and-drop multi-session workspace with file editing, HTML and PDF preview, and sidebar session management. The same rollout also shipped 2.1.108 features, including an optional 1-hour cache TTL, recap, and new built-ins that affect cost and session handoff.

NEWS2mo ago
Claude Code users report a 5-minute cache TTL and 5x Pro Max quota burn in 1.5 hours

Anthropic acknowledged a March 6 cache optimization change, and Pro Max users report that the shorter TTL plus hidden session context now burns through Claude Code quota much faster. Watch for 500 errors and stalled streams, and apply the 2.1.105 patch if your UI hangs.

NEWS2mo ago
OpenAI launches $100 ChatGPT Pro tier with 5x more Codex usage

OpenAI added a $100 ChatGPT Pro tier with 5x more Codex usage than Plus and kept the $200 tier as the highest-capacity option. The new tier resets Codex limits again and temporarily doubles Pro usage through May 31.

RELEASE2mo ago
Anthropic adds beta advisor tool to Messages API for Opus calls

Anthropic added a beta advisor tool to the Messages API so Sonnet or Haiku can call Opus mid-run inside one request. Anthropic says Sonnet plus Opus scored 2.7 points higher on SWE-bench Multilingual while cutting per-task cost 11.9%.

RELEASE2mo ago
Google releases Veo 3.1 Lite in Gemini API at $0.05 per second

Google released Veo 3.1 Lite in Gemini API and AI Studio with 720p and 1080p output, 4-8 second clips, and text-to-video plus image-to-video support. Watch the April 7 Veo 3.1 Fast pricing drop if you need lower video generation costs.

NEWS3mo ago
TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports

New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

NEWS3mo ago
MiniMax introduces Token Plan for flat-rate text, speech, music, video, and image APIs

MiniMax introduced a flat-rate Token Plan that covers text, speech, music, video, and image APIs under one subscription. It gives teams one predictable bill across modalities and can be used in third-party harnesses, not just MiniMax apps.

NEWS3mo ago
MiniMax M2.7 ranks #5 on PinchBench at $0.30 per million input tokens

Kilo said MiniMax M2.7 placed fifth on PinchBench, 1.2 points behind Opus 4.6 at much lower input cost, while community tests showed strong multi-loop agent behavior on graphics tasks. If you route coding-agent traffic by price, M2.7 looks worth a controlled bake-off.

RELEASE3mo ago
Imbue releases Offload to split Playwright runs across 200 Modal sandboxes

Imbue open-sourced Offload, a Rust CLI that spreads test suites across local or Modal sandboxes from one TOML config. It is useful when agent-heavy teams are bottlenecked on verification instead of generation, especially in browser or CI-heavy stacks.

RELEASE3mo ago
Unsloth releases Studio: local training UI for 500+ models with 70% less VRAM

Unsloth Studio launched as an open-source web UI to run, fine-tune, compare, and export local models, with file-to-dataset workflows and sandboxed code execution. Try it if you want to move prototype training and evaluation off cloud notebooks and onto local or rented boxes.

RELEASE3mo ago
Hankweave adds runtime budgets for dollars, tokens, and wall-clock limits

Hankweave shipped budget controls that cap spend, tokens, and elapsed time globally or per step, including loop budgets and shared pools. Use them to prototype or productionize long agent runs without hand-managing model switches and failure states.

WORKFLOW3mo ago
oMLX supports Claude Code locally with tiered KV cache and Anthropic Messages API

oMLX now supports local Claude Code setups on Apple Silicon with tiered KV cache and an Anthropic Messages API-compatible endpoint, with one setup reporting roughly 10x faster performance than mlx_lm-style serving. If you want private on-device coding agents, point Claude Code at a local compatible endpoint and disable the attribution header to preserve cache reuse.

NEWS3mo ago
Ollama updates cloud to NVIDIA B300 for Kimi K2.5 and GLM-5 on $0, $20, and $100 plans

Ollama says its cloud now runs Kimi K2.5 and GLM-5 on NVIDIA B300 hardware while keeping fixed $0, $20, and $100 plans. Try it if you want hosted open models with more predictable spend for always-on agent workloads.

NEWS3mo ago
Tiiny claims pocket AI server runs local 120B models with an OpenAI-compatible API

Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.

NEWS3mo ago
Epoch AI reports top chip designers used about 90% of HBM and CoWoS supply in 2025

Epoch AI estimates that NVIDIA, Google, AMD, and Amazon consumed nearly all high-bandwidth memory and advanced packaging tied to frontier AI chips in 2025. Track this if you are planning compute, custom silicon, or open-weight infrastructure strategy.

NEWS3mo ago
Google adds Gemini API spend caps in AI Studio with project-level dollar limits

Google AI Studio now lets developers set experimental per-project spend caps for Gemini API usage. Use it as a native billing guardrail, but account for roughly 10-minute enforcement lag and possible batch-job overshoot.

RELEASE3mo ago
Hugging Face launches Storage Buckets for mutable checkpoints, logs, and agent traces

Hugging Face introduced Storage Buckets, a mutable S3-like repo type for checkpoints, processed data, logs, and traces that do not fit Git workflows. Use it to move overwrite-heavy or high-volume artifacts out of versioned repos without leaving the Hub.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.