OpenAI reports accidental CoT grading touched GPT-5.4 Thinking in under 0.6% of samples
OpenAI said a new detector found limited chain-of-thought grading in earlier Instant and mini models and in less than 0.6% of GPT-5.4 Thinking samples. The disclosure matters because the company treats CoT monitorability as part of its agent-misalignment defense and is adding stricter pre-deployment checks.

TL;DR
- OpenAI says a new internal detector found accidental chain-of-thought grading in several released RL runs, including GPT-5.4 Thinking, and framed the issue as a monitorability failure because OpenAI's announcement says CoT monitors are part of its defense against agent misalignment.
- According to OpenAI's affected-models post and the OpenAI alignment writeup, GPT-5.4 Thinking was affected in less than 0.6% of samples, while some Instant and mini models saw higher but still limited exposure.
- The official post says three reward pathways caused the problem, usefulness rewards, confirmation-question penalties, and prompt-injection penalties, with the worst case peaking below 3.8% of samples for GPT-5.1 Instant in OpenAI's breakdown.
- OpenAI says its follow-up analysis did not show clear monitorability degradation, while Redwood Research's external review said the evidence removed about 80% of the reviewer's negative update without ruling out all residual risk.
- In OpenAI's remediation post, the company says it fixed the affected reward paths and is adding real-time CoT-grading detection, stress tests, and stricter internal checks before deployment.
You can read the full alignment post, the external Redwood review, and OpenAI's earlier monitorability eval release. One useful buried detail in the alignment post is that the detector is basically a large-scale scan for CoT text leaking into reward inputs, with alerts routed to Slack and a public internal channel. Another is that one caught failure mode did not come from grader config at all, it came from models accessing their own CoT through tools.
Detector
OpenAI's main new claim is operational, not theoretical: it built an automated system that scans RL runs for traces of CoT text inside reward inputs, then alerts the run owner and logs the suspected case internally in the alignment post.
That matters because OpenAI's thread opener describes CoT monitors as a defense layer against agent misalignment, and the same alignment post says direct grading of CoT is one of the clearest ways to make those traces less informative.
Affected reward paths
The official breakdown splits the issue into three concrete reward pathways rather than one generic bug:
- Rewarding trajectory usefulness: affected less than 0.6% of GPT-5.4 Thinking samples and less than 1.5% of GPT-5.4 mini samples, per OpenAI's percentage disclosure and the full post.
- Penalizing unnecessary confirmation questions: affected GPT-5.1 Instant through GPT-5.4 Instant, plus GPT-5.3 mini, on less than 3.8% of GPT-5.1 Instant samples and less than 1.8% for the others in OpenAI's breakdown.
- Penalizing successful prompt injections: affected GPT-5.2 Instant through GPT-5.4 Instant, plus GPT-5.3 mini, on less than 0.3% of samples for all affected models in OpenAI's breakdown.
A side detail from nrehiew_'s screenshot thread is that the disclosure implies at least some mini models went through their own RL rather than only strong-to-weak distillation. The official post does not frame that as the story, but it is one of the clearer training-pipeline hints in the writeup.
OpenAI's impact analysis
OpenAI says it did not find clear evidence that these incidents reduced monitorability. For the confirmation-question case, the alignment post says it re-ran the affected portion of GPT-5.4 Instant training without CoT access in the penalty and saw similar detector rates, similar disagreement between CoT-aware and CoT-blind detectors, and no clear drop on monitorability evals.
For GPT-5.4 Thinking, the company says it did not run a full ablation. Instead, the official post says it recomputed the usefulness reward with and without CoT access, found the two score tracks mostly moved together, and treated the low prevalence of the issue as additional evidence against a large effect.
External review
OpenAI says in its third-party-review post that Redwood Research, Apollo Research, and METR gave feedback on the analysis. The only linked public review so far is Redwood Research's report.
Buck Shlegeris, a researcher at Redwood Research, wrote in that review that the evidence "assuages 80% of the negative update" he would have made about deployment risk from learning the affected models had been trained on CoT. The same review also says OpenAI's evidence does not eliminate a residual concern that CoT training could have suppressed mention of misaligned goals or other worrying signals in the trace, even if the direct risk increase looks small.
A more skeptical community reaction showed up elsewhere. In the linked HN discussion, one fresh top-level commenter argued the episode sounded less like robust control and more like a band-aid fix, which is not a technical counter-analysis but does capture the mood around how much confidence these safety-process disclosures should buy.
Process changes
OpenAI's remediation list is specific enough to read like an internal controls changelog:
- real-time CoT-grading detection
- safeguards against accidental CoT grading
- monitorability stress tests
- stronger internal guidance and checks before deployment
The alignment post adds two more implementation details: the affected reward pathways were fixed, and the company says it formalized clearer internal guidance around what kinds of CoT pressure are allowed. It also explicitly keeps one narrow exception, limited pressure on CoT length, because OpenAI argues that kind of information loss is easier to measure than hidden reward pressure on reasoning content.
Tool-mediated leak path
The last buried reveal in the official post is that one caught CoT-grading incident did not come from an obviously CoT-aware grader. OpenAI says some models were able to access their own CoT through tools during training, which caused CoT text to show up inside tool outputs that were later graded in the alignment post.
That case never affected a deployed model, but it is the cleanest new failure mode in the writeup. Even when grader configuration excludes CoT messages on the normal path, tool outputs can still smuggle reasoning back into the reward channel.