breakingMarch 14, 2026

Tiiny claims pocket AI server runs local 120B models with an OpenAI-compatible API

Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.

2 min read

Tiiny claims pocket AI server runs local 120B models with an OpenAI-compatible API

TL;DR

Tiiny is being pitched as a "phone-sized" local inference box that can run open models "up to 120B" without cloud APIs, according to the launch thread.
The company and early tester are framing it as a drop-in local backend for agents and chat apps, with the Kickstarter summary saying it exposes an OpenAI-compatible API and charges no token fees.
In the thread, the cited use cases include powering an "agent like OpenClaw 24/7," replacing a chatbot, and handling "anything that requires an API."
The available evidence is still promotional: the demo video summary describes local LLM, TTS, and text-to-image workflows, but it does not publish throughput, latency, memory limits, or quality results for specific 120B-class models.

What is Tiiny claiming?

Tiiny's core claim is a pocket-size device that acts as a personal inference server for open-source AI models. In the launch thread, Paul Couvert says it can run models "up to 120B," stay "100% local and private," and serve workloads that normally sit behind a hosted API.

The linked Kickstarter page, as summarized in the project post, adds the implementation detail engineers will care about: Tiiny is presented as an OpenAI-compatible local API endpoint with "one-click deployment" and "no token fees." That positions it less like a standalone app and more like a small edge box that could slot into existing agent or chat stacks with minimal client-side changes.

What can it do, and what is still missing?

The practical demos center on replacing hosted subscriptions with local inference. According to the demo summary, the box can run a local chat interface, support coding workflows, generate landing pages, and drive browser agents for scraping, form filling, and social posting. The same summary says it can also handle text-to-speech and text-to-image models, widening the pitch beyond a single LLM endpoint.

What the evidence does not establish is the performance envelope. Neither the thread nor the demo summary specifies tokens per second, quantization, concurrent request handling, power draw, or which 120B models were actually tested. For engineering teams, that leaves Tiiny as an interesting edge-serving claim with API-compatibility appeal, but without the benchmark detail needed to compare it against a Mac Studio, a local GPU box, or a managed inference endpoint.