IBM releases Granite 4.1 30B/8B/3B open models under Apache 2.0
IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.

TL;DR
- IBM shipped Granite 4.1 as three dense open-weight instruct models, 30B, 8B, and 3B, under Apache 2.0, with official docs positioning them as upgrades over Granite 4.0 in tool calling, instruction following, coding, and math ArtificialAnlys' launch summary and Granite 4.1 docs.
- The cleanest number in the release came from ArtificialAnlys' efficiency chart, which put Granite 4.1 8B at about 4 million output tokens on its Intelligence Index run, versus 78 million for Qwen3.5 9B.
- The tradeoff is capability, because ArtificialAnlys' benchmark breakdown and the Granite 4.1 blog post both frame these as small enterprise models, not frontier open models trying to top every leaderboard.
- Availability moved fast: wandb's day-zero listing priced Granite 4.1 8B at $0.05 per 1M input tokens and $0.10 per 1M output tokens, while replicate's links post pushed both the language and speech variants live the same day.
You can read IBM's official launch post, skim the Granite 4.1 docs, and jump straight to the Hugging Face 8B model card. The odd little tell is that third parties spent more time on token burn than raw score, while IBM used the same drop to ship speech, vision, embeddings, and Guardian updates in one bundle.
Dense language models
IBM's docs describe Granite 4.1 as a dense family with 3B, 8B, and 30B sizes, each offered in base and instruct variants, plus optional FP8 quantization for cheaper deployment Granite 4.1 docs. The 8B model card adds that IBM reworked post-training with supervised finetuning and reinforcement learning alignment, and lists 12 supported languages including English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese Hugging Face 8B model card.
The official positioning is narrow and pretty clear:
- 3B: edge and resource-constrained deployments, per the docs
- 8B: general-purpose enterprise applications, per the docs
- 30B: higher-capacity model for more complex tasks, per the docs
Artificial Analysis scored the three instruct models at 15, 12, and 9 on its Intelligence Index for 30B, 8B, and 3B respectively, according to ArtificialAnlys' launch thread. That leaves Granite 4.1 below peers like Qwen3.5 and Gemma on raw composite score, but ahead of its own 4.0 predecessors in the 30B and 3B slots, with ArtificialAnlys' evaluation breakdown highlighting the biggest 30B gains in tool use and agentic tasks.
Output-token profile
The most distinctive Granite 4.1 claim is not benchmark leadership. It is output efficiency.
According to ArtificialAnlys' launch thread, Granite 4.1 8B used roughly 4 million output tokens for the Intelligence Index, compared with 13 million for Ministral 3 8B, 8 million for Gemma 4 E4B, and 78 million for Qwen3.5 9B. The same pattern showed up across the family, with the 30B model at 4.6 million tokens and the 3B model at 2.7 million.
That framing matters because Artificial Analysis is effectively describing score per token budget, not just score. IBM's own post leans into the same lane, calling out improvements in tool calling and instruction following for real-world systems rather than a chase for top-end reasoning scores official launch post.
The per-benchmark chart makes the shape of the release easier to read:
- Granite 4.1 30B posted its clearest gains on τ²-Bench Telecom and GDPval-AA, according to ArtificialAnlys' benchmark breakdown
- The family remained middling on harder science and QA evals like GPQA Diamond and Humanity's Last Exam, according to ArtificialAnlys' benchmark breakdown
- All three models kept the long context and tool-use story central, which is also how the 8B model card describes intended use
Openness and access
On openness, Artificial Analysis gave all three Granite 4.1 models a 61 on its Openness Index, ahead of Qwen3.5, Gemma 4, and Mistral Small 4, according to ArtificialAnlys' openness follow-up. The cited reason was not just Apache 2.0 weights, but extra disclosure around pre-training data, post-training data, and methodology.
The practical rollout was immediate:
- W&B Inference listed Granite 4.1 8B on day one at $0.05 per 1M input tokens and $0.10 per 1M output tokens, with 131K context, according to wandb's launch post
- Replicate linked live pages for both Granite 4.1 language and Granite Speech 4.1 on launch day, according to replicate's links post
- Replicate's model page says the 8B model is aimed at summarization, extraction, question answering, RAG, code tasks, function calling, and multilingual dialogue Replicate model page
That makes Granite 4.1 a familiar IBM move: permissive weights, more documentation than most peers, and fast placement on hosted inference surfaces instead of a Hugging Face-only drop.
Speech and vision
Granite 4.1 was not only a text-model refresh. IBM's launch post says the same release bundled speech, vision, embeddings, and Guardian updates, with Granite speech pitched on transcription accuracy and Granite vision on chart and table extraction official launch post.
Replicate's speech listing adds a concrete detail the main language-model chatter skipped: Granite Speech 4.1 2B supports multilingual ASR and bidirectional speech translation across English, French, German, Spanish, Portuguese, Japanese, with Apache 2.0 licensing and variants for speaker attribution and higher-throughput non-autoregressive decoding Granite Speech 4.1 2B page. Separately, vLLM's v0.20.0 thread and the matching vLLM release notes show Granite 4.1 Vision already landed as built-in support in vLLM v0.20.0, so the ecosystem pickup extended beyond the 8B text model that got most of the launch-day attention.