Skip to content
AI Primer
release

H Company releases Holotron-12B: 8.9k tok/s on H100 and 80.5% WebVoyager

H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.

2 min read
H Company releases Holotron-12B: 8.9k tok/s on H100 and 80.5% WebVoyager
H Company releases Holotron-12B: 8.9k tok/s on H100 and 80.5% WebVoyager

TL;DR

  • H Company launched Holotron-12B launch, an open multimodal model built with NVIDIA for "computer-use agents," and says it is tuned for web, Android, and mobile interaction workloads rather than generic vision-language chat.
  • The company says Holotron-12B is post-trained from Nemotron-Nano-12B-v2-VL and uses a hybrid SSM-attention stack that targets the "KV Cache" bottleneck for higher concurrency architecture details.
  • On H Company's reported benchmarks, the model reaches 8.9k tokens/s on a single H100, runs at "over 2x" the throughput of Holo2-8B, and improves WebVoyager from 35.1% to 80.5% performance claims.
  • H Company also said in the partner update that it has early access to NVIDIA's Nemotron 3 Omni and plans to use its MoE base for future low-latency enterprise agent deployments.

What shipped for agent builders?

Holotron-12B is available now as an open model on Hugging Face via the model card, with a deeper product writeup in H Company's technical post. H Company describes it as a "high-throughput, open-source, multimodal model" built specifically for the "age of computer-use agents," which is a more implementation-specific claim than a general-purpose VLM launch.

The architectural hook is in the architecture note: Holotron-12B is post-trained from NVIDIA's open Nemotron-Nano-12B-v2-VL and uses a hybrid SSM-attention design to reduce the KV-cache bottleneck. H Company says that gives it the "linear scaling and high-concurrency performance" needed for online reinforcement learning and production agent workloads across browser and mobile environments.

How strong are the reported speed and accuracy gains?

H Company's headline numbers are unusually deployment-oriented: "8.9k tokens/s on a single H100," "over 2x faster than Holo2-8B," and a much smaller memory footprint that allows larger effective batch sizes on the same hardware. For teams serving browser or UI agents, that matters more than a generic model-quality claim because concurrency and memory pressure usually dominate cost envelopes.

The accuracy claim is also concrete. H Company says WebVoyager performance rose from 35.1% to 80.5%, suggesting the throughput work did not come with an obvious tradeoff on web-task execution. In the same thread, the company said it is an early-access partner for NVIDIA's Nemotron 3 Omni and expects that MoE foundation to push the next round of "reasoning and low-latency precision" for enterprise-scale autonomous computer-use systems enterprise deployment thread.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR1 post
What shipped for agent builders?1 post
How strong are the reported speed and accuracy gains?1 post
Share on X