Skip to content
AI Primer
workflow

Codex supports open-weight models via Ollama, vLLM, and Responses-compatible endpoints

Codex workflows can now run against open-weight models served through compatible Responses API endpoints, with Ollama and vLLM publishing direct paths for GLM-5.2 and Kimi K2.7 Code. That matters because teams can keep the Codex interface while swapping to self-hosted or lower-cost inference backends.

3 min read
Codex supports open-weight models via Ollama, vLLM, and Responses-compatible endpoints
Codex supports open-weight models via Ollama, vLLM, and Responses-compatible endpoints

TL;DR

  • thsottiaux's reminder said the Codex app, CLI, and SDK can target open-weight models, not just OpenAI-hosted ones.
  • Ollama's launch post gave the most direct path: ollama launch codex and ollama launch codex-app, with GLM-5.2 and Kimi K2.7 Code as named examples.
  • vLLM's guide thread framed the same idea around self-hosting: serve a tool-calling model behind an OpenAI Responses API-compatible endpoint, then point Codex at it.
  • Cedric Chee's summary and reach_vb's reply both narrowed the actual requirement to API compatibility, not any specific backend.

You can jump from the Codex model note to vLLM's guide and its serving recipe, while Ollama's post reduces the whole setup to two launch commands. The interesting bit is that the packaging differs, but the contract is the same: if a model exposes a compatible Responses API and supports tool calling, Codex can sit on top.

Ollama

Ollama published the cleanest demo of the rollout. Its post names GLM-5.2 and Kimi K2.7 Code explicitly, then shows ollama launch codex for the CLI and ollama launch codex-app for the app.

That makes the story less about model support matrices and more about transport. Codex keeps its interface, while Ollama handles the local model and endpoint plumbing underneath.

vLLM

vLLM pushed the self-hosted version harder. Its thread says Codex can talk to models served on your own GPUs across NVIDIA, AMD, and other hardware, as long as the server speaks the same OpenAI Responses API that Codex expects.

The examples line up with Ollama's. vLLM's guide thread names GLM 5.2, Kimi K2.7 Code, and MiniMax M3 as drop-in options, then links out to both a setup guide and a serving recipe.

Responses API compatibility

The narrow requirement appears in two separate posts. thsottiaux's reminder says Codex works with any open source model, not only OpenAI models, and Cedric Chee's summary compresses that into one sentence: any open source model that exposes a compatible Responses API endpoint can work.

For engineers, that turns Codex into more of a front end and agent harness than a hard binding to one inference provider. The model choice moves behind the endpoint, provided the replacement can actually handle the tool-calling workflow Codex uses.

Earlier than this week's demos

One useful wrinkle is that this may be less of a brand-new capability than a newly visible path. In reach_vb's reply, OpenAI's Varun Bansal said Codex "always did" work with llama.cpp, Ollama, and MLX too.

That lines up with the rest of the evidence. This week's change looks like better-publicized recipes and named model examples, not a fresh lock removal.

Share on X