workflowApril 5, 2026

Suno users report v5.5 misses duet tags and instrument cues despite stronger vocals

Reddit posts said v5.5 improved voice tone but still ignores gender-labeled sections, switches singers mid-part, and struggles with detailed instrument instructions. Creators are iterating on renders until the emotion fits, then generating lipsync video to work around the gaps.

4 min read

Suno users report v5.5 misses duet tags and instrument cues despite stronger vocals

TL;DR

One Suno user said v5.5 still ignores section-level gender tags, starts the wrong singer on a verse, or flips between male and female voices mid-part, even after repeated rewrites in the editor.
A second Reddit thread argued that v5.5 improved voice tone and audio polish, but still misses detailed instrument and arrangement instructions, especially for instrumental-heavy genres.
According to the official v5.5 announcement, Suno positioned the release around stronger voice fidelity, more expressive output, and personalization features like Voices, Custom models, and My Taste.
One creator outside the Suno subreddit showed a practical workaround: render the audio first until the emotion lands, then make the lipsynced video as a separate pass.

Suno's own pitch centered on expression and control in the v5.5 launch post, while the Song Editor announcement promised section rewrites and replacements down to individual beats. The Reddit complaints cut straight at that promise: gender-labeled verses still drift, instrument cues still get ignored, and users are burning credits on rerolls. Meanwhile, a Hugging Face workflow pack for LTX 2.3 points to the adjacent workaround culture, where creators split audio generation and video generation into separate stages.

Gender tags still drift inside a section

r/SunoAI

Help with properly understanding Suno AI when it comes to parts with female and male vocals.

10 comments

The clearest complaint was not that duet prompting fails completely, but that it fails inconsistently. The original post described the same two misses over and over: the wrong voice starts a labeled section, or the right voice starts and then switches away before the section ends.

The most concrete community formatting suggestion came from a reply inside that same thread, which told users to number verses in the style box and tag the lyrics with labels like [verse 1 female], [verse 2 male], and [verse 3 male & female]. Another experienced commenter in the thread said they now keep gender prompts almost entirely in the lyrics, using simple labels like [Verse - Male] and [Chorus - Female], because extra instructions in the style box seem to confuse the model.

Even that advice came with a shrug. One commenter in the duet thread said the tagged format worked better than other prompt layouts, but still got the duet wrong in about two of six generations.

Better vocals, weaker obedience

r/SunoAI

After making 40 covers for a song of mine, v5.5 is very inferior so far

7 comments

That tradeoff showed up again in the second Reddit thread. The original poster said v5.5's sound quality and voices were improved, but called the model "incredibly limited" at following specific directions for instruments, styles, and composition.

Replies in the discussion split in a useful way:

One camp said v5.5 sounds cleaner, with better bass, stronger "musicians," and improved vocals.
Another said recent generations feel less original than older v5 outputs.
Several commenters said literal instrument instructions still break down, even in relatively standard genres like metal, jazz, and bachata.

That lines up awkwardly with Suno's official framing. The launch post for v5.5 describes the model as Suno's "best and most expressive" release yet, while the Song Editor post promises lyric replacement, section reworks, and beat-level editing control.

Two-pass workflows are becoming the fallback

r/StableDiffusion

The Queen of Thorns has a message about SOTA AV methods (omnivoice, ltx2.3)

0 comments

The third post was not about Suno directly, but it captured the production workaround that keeps surfacing around current AI music tools. The creator's note was blunt: get the audio right first, keep rerendering until the emotion fits, then do the lipsync video after that.

The linked LTX-2.3 workflow collection is built around ComfyUI-style video pipelines, which makes the split explicit. Audio generation and facial performance are separate problems, and creators are increasingly treating them that way when prompt-level control inside a single music model stays unreliable.