releaseMay 12, 2026

Sentence Transformers 5.5.0 adds train-sentence-transformers skill with one-shot 0.8856 NDCG@10

Sentence Transformers 5.5.0 adds an agent skill for fine-tuning embeddings, rerankers, and sparse encoders from Claude Code, Codex, Cursor, and Gemini CLI. The author reports a one-shot German embedding run rising from 0.6720 to 0.8856 NDCG@10 on a local PC.

4 min read

Sentence Transformers 5.5.0 adds train-sentence-transformers skill with one-shot 0.8856 NDCG@10

TL;DR

tomaarsen's release thread says Sentence Transformers 5.5.0 is built around a new train-sentence-transformers agent skill that lets coding agents fine-tune embedding, reranker, and sparse encoder models.
According to tomaarsen's install example, the skill already targets Claude Code, Codex, Cursor, Gemini CLI, and other agent shells through hf skills add train-sentence-transformers.
In tomaarsen's benchmark post, a one-shot German embedding run on a local PC moved from 0.6720 to 0.8856 NDCG@10 in about 30 minutes.
tomaarsen's EmbedDistillLoss post and tomaarsen's ADRMSELoss post add two new training losses, one for embedding distillation and one for LLM-distilled reranking.
The rest of the release is less flashy but useful: tomaarsen's processing_kwargs note adds per-call processor overrides, while tomaarsen's fixes roundup patches padding, distributed training, and DeepSpeed-adjacent rough edges.

You can jump straight to the release notes, inspect the one-shot German model, and browse tomaarsen's training checklist for the parts the skill now covers, from hard-negative mining to Matryoshka training. tomaarsen's model-upload note also claims the agent completed the training and Hub upload without manual README or upload steps.

train-sentence-transformers

The headline feature is a Hugging Face skill package that turns agentic coding shells into Sentence Transformers training wrappers. The install surface is tiny, hf skills add train-sentence-transformers, and the prompt surface is equally blunt: describe the model you want, then let the agent assemble the run.

tomaarsen's training checklist says the skill ships guidance across all three supported model families. The covered pieces are:

base model selection
loss and evaluator choice
hard-negative mining
distillation
LoRA
Matryoshka training
multilingual setups
static embeddings
template training scripts the agent can adapt

That is the interesting bit. This is not just a single canned finetune command, it is a packaged set of training heuristics and scripts aimed at the fiddly parts people usually look up from old notebooks or example repos.

German retrieval run

The only concrete result in the evidence pool is a German retrieval run that tomaarsen's benchmark post says Claude Code completed on the author's own PC in about 30 minutes. The reported metric moved from 0.6720 to 0.8856 NDCG@10.

The follow-up matters almost as much as the score. In tomaarsen's model-upload note, tomaarsen says the resulting model was uploaded in a fully one-shot flow, without manually editing the model README or handling the upload step, and that the run can still be interrupted for edits, extra experiments, or SLURM handoff.

New losses

The release also adds two new objective functions:

EmbedDistillLoss for SentenceTransformer: tomaarsen's EmbedDistillLoss post says it matches student embeddings directly against precomputed teacher embeddings, instead of distilling teacher scores as MarginMSELoss does. It also supports an optional learnable projection when teacher and student embedding dimensions differ.
ADRMSELoss for CrossEncoder: tomaarsen's ADRMSELoss post describes it as a listwise learning-to-rank loss from the Rank-DistiLLM paper, aimed at reranker training from an LLM's document ordering.

For people training retrieval stacks rather than just serving them, that makes 5.5.0 more than an agent-skill release.

processing_kwargs

A smaller API change in tomaarsen's processing_kwargs note gives encode() and predict() a per-call processing_kwargs override. That means max length, image resolution, or video FPS can change for one invocation without rebuilding the model object.

The fixes list in tomaarsen's fixes roundup and tomaarsen's DeepSpeed and loading fixes is mundane in the good way:

CLS pooling now picks the first real token with left-padding tokenizers, rather than silently taking a PAD token.
AdaptiveLayerLoss and CrossEncoder losses now train under DistributedDataParallel and torch.compile.
model.config now delegates to the underlying Transformers config, which tomaarsen says improves DeepSpeed ZeRO behavior.
Loading from local paths and private trust_remote_code repos is more robust.

Those are the kinds of changes that disappear in launch summaries and show up later as fewer weird bug reports.

TL;DR

train-sentence-transformers

German retrieval run

New losses

processing_kwargs

Discussion across the web