Skip to content
AI Primer
release

H Company launches Holo3 with 78.9% on OSWorld-Verified

H Company introduced Holo3, a computer-use model family with a 122B API model and an Apache 2.0 35B release on Hugging Face. Check the benchmark and pricing claims before assuming the model is ready for field deployment.

4 min read
H Company launches Holo3 with 78.9% on OSWorld-Verified
H Company launches Holo3 with 78.9% on OSWorld-Verified

TL;DR

  • H Company says Holo3-122B-A10B hit 78.9% on OSWorld-Verified, ahead of the GPT-5.4 and Opus 4.6 points shown in its launch chart H Company launch tweet.
  • The release is split in two: a 122B model served through H's API, and a lighter Holo3-35B-A3B model that H says is open on Hugging Face under Apache 2.0 Release breakdown.
  • The benchmark table circulating with the launch shows the small model staying unusually close to the big one, 77.8% vs. 78.9% on OSWorld-Verified and 64.8% vs. 64.9% on WebArena Benchmark table.
  • H's own corporate evals are where the gap opens up: Holo3 leads on E-Commerce, Business Software, and Collaboration, but Multi-Apps is a weaker spot for both Holo3 variants Benchmark table.

The official launch post has more useful detail than the tweet thread, including H's "agentic learning flywheel," a 486-task in-house benchmark, and an example multi-app workflow that crosses PDFs, budgets, and email. You can also inspect the open 35B model card, and H published the same writeup as a Hugging Face blog post.

Holo3-122B and Holo3-35B

H Company launched two models at once. Holo3-122B-A10B is the flagship API model at $0.40 per million input tokens and $3.00 per million output tokens, while Holo3-35B-A3B is positioned as the lighter release at $0.25 per million input and $1.80 per million output Release breakdown.

The company says all Holo3 models are available through its inference API, and the 35B weights are openly available on Hugging Face under Apache 2.0 with a free API tier in H's own stack, according to the launch post. The model card tags it as an image-text-to-text vision-language model for computer use and GUI agents, built as a finetune of Qwen3.5-35B-A3B.

OSWorld-Verified and WebArena

The launch chart makes the headline claim simple: Holo3 is being sold as a computer-use model that reaches frontier scores without frontier pricing. H's plot places Holo3-122B-A10B at 78.9% on OSWorld-Verified and Holo3-35B-A3B at 77%, while GPT-5.4 and Opus 4.6 sit at roughly similar scores but much farther right on cost H Company launch tweet.

The fuller table adds two details that matter more than the scatter plot:

  • OSWorld-Verified: 78.9% for Holo3-122B, 77.8% for Holo3-35B, 72.5% for Claude Sonnet 4.6, 63.3% for Kimi-K2.5.
  • WebArena: 64.9% for Holo3-122B, 64.8% for Holo3-35B, 65.6% for Claude Sonnet 4.6, 63.4% for Kimi-K2.5.
  • UI grounding stays tighter than the computer-use gap suggests: on OSWorld-G, Holo3-122B scores 79.4% and Qwen3.5-397B-A17B scores 78.4% Benchmark table.

That near-tie between the two Holo3 variants is the interesting part. H is effectively arguing that most of the gain comes from specialized agent training, not just scaling the base model.

H Corporate benchmark

H's launch post says it built a proprietary "Synthetic Environment Factory" and a 486-task H Corporate benchmark to test enterprise workflows inside synthetic business software Benchmark table. The four categories are listed in the post and table: E-Commerce, Business Software, Collaboration, and Multi-Apps.

The results split cleanly by task shape:

  • E-Commerce: Holo3-122B scores 94.8%, Holo3-35B scores 94.1%.
  • Business Software: 85.2% for 122B, 86.3% for 35B.
  • Collaboration: 72.3% for 122B, 76.0% for 35B.
  • Multi-Apps: 59.5% for 122B, 50.0% for 35B, behind Kimi-K2.5 at 64.3% and Claude Sonnet 4.6 at 69.0% Benchmark table.

The official writeup gives one concrete example of what H means by Multi-Apps: pulling equipment prices from a PDF, checking each employee's remaining budget, then sending approval or rejection emails automatically. That is a much better read on Holo3's current ceiling than the OSWorld headline score, because it shows exactly where the model still bends under longer cross-application workflows.

Share on X