OpenAI launches ChatGPT for Clinicians and HealthBench Professional in U.S. preview
OpenAI introduced a free ChatGPT tier for verified U.S. clinicians and released HealthBench Professional, an open benchmark built from real clinical chat tasks. The launch pairs a clinician-facing workflow product with a public evaluation set and published model results.

TL;DR
- OpenAI shipped two linked health products at once: thekaransinghal's launch thread introduced a free ChatGPT tier for verified U.S. clinicians, while the HealthBench Professional announcement introduced an open benchmark for real clinician chat tasks.
- According to thekaransinghal's rollout post, ChatGPT for Clinicians starts with U.S. physicians, nurse practitioners, physician assistants, and pharmacists, and the feature list says it includes clinical search, reusable workflow skills, deep research, CME credit, and optional HIPAA support.
- OpenAI says in thekaransinghal's safety note that physicians tested about 7,000 conversations before launch and rated 99.6% of responses safe and accurate.
- HealthBench Professional is built from physician-authored conversations and rubrics, and the benchmark design post says about one-third of examples involve deliberate physician red teaming with the dataset enriched 3.5x for hard conversations.
- On OpenAI's own benchmark, thekaransinghal's results post says GPT-5.4 inside ChatGPT for Clinicians beat base GPT-5.4, other listed frontier models, and physician-written responses.
You can read OpenAI's launch post, skim the embedded feature list in Greg Brockman's post, and check Ethan Mollick's chart screenshot for how OpenAI sliced HealthBench Professional by difficulty and specialty. iScienceLuvr's reaction also captures the part many model launches skip: OpenAI paired the product launch with a public healthcare benchmark, and Thom Wolf's earlier HealthBench mention suggests the benchmark was already useful enough for outside agent-training experiments.
ChatGPT for Clinicians
OpenAI is positioning the product around three jobs clinicians already bring to general-purpose models: care consults, writing and documentation, and medical research, according to thekaransinghal's rollout post. The first rollout is a U.S. preview for verified clinicians only.
The shipped feature set is unusually workflow-heavy for a free tier:
- Free access to advanced models, per thekaransinghal's feature list
- Clinical search over trusted sources, per thekaransinghal's feature list
- Reusable skills for repeatable workflows, per thekaransinghal's feature list
- Deep research across medical literature, per thekaransinghal's feature list
- CME credit, per thekaransinghal's feature list
- Privacy controls with no model training, plus optional HIPAA support, per thekaransinghal's feature list
That mix makes this look less like a vertical chatbot skin and more like a constrained work surface for search, paperwork, and literature review.
HealthBench Professional
OpenAI says the benchmark covers the same three categories as the product launch: care consults, writing and documentation, and medical research, according to the HealthBench Professional announcement. The stated goal in OpenAI's launch post is to measure realistic clinical chat workflows rather than narrow medical QA.
The benchmark design details are the more interesting part:
- Physician-authored conversations, per thekaransinghal's design post
- Physician-written rubrics, per thekaransinghal's design post
- Multi-stage adjudication and data filtering, per thekaransinghal's design post
- Multiple-physician verification, per thekaransinghal's benchmark criteria
- Deliberate physician red teaming in about one-third of examples, per thekaransinghal's design post
- A 3.5x enrichment toward the hardest conversations for OpenAI's models, per thekaransinghal's design post
- Physician-written reference responses for every example, written by specialty-matched physicians with unbounded time and web access, per thekaransinghal's reference-response note
One immediate community reaction, from iScienceLuvr's post, was relief that OpenAI published a healthcare benchmark alongside the product at all. In follow-ups, iScienceLuvr on Medmarks costs and iScienceLuvr on benchmark expense noted that serious medical benchmark suites are expensive to run, which helps explain why comparable public evaluations are still sparse.
HealthBench scores
OpenAI's headline result is that GPT-5.4 inside ChatGPT for Clinicians beat base GPT-5.4, other OpenAI and external models, and physician-written responses on HealthBench Professional, according to thekaransinghal's results post. Ethan Mollick's post adds the obvious caveat: the benchmark is open, but it was designed by OpenAI.
Mollick's screenshot is useful because it shows the slices OpenAI chose to expose publicly: good-faith difficult cases, good-faith typical cases, red-teaming difficult cases, and specialty breakdowns across areas including nephrology, psychiatry, neurology, dermatology, cardiology, anesthesia, orthopedics, heme-onc, pediatrics, and OB-GYN.
There is already one sign the benchmark is escaping the launch post. In Thom Wolf's thread about Hugging Face agents, he said an external agent-training setup generated 1,100 synthetic HealthBench examples, upsampled them 50x, and beat Codex by 60% on the benchmark. That makes HealthBench look less like a marketing chart and more like fresh eval terrain for post-training systems.
Health review loop
The last detail worth bookmarking is the process claim behind the launch. thekaransinghal's health training note says an OpenAI health response is reviewed by a physician every few minutes, and that health is included in every major stage of model training.
That line does two things the product page alone does not. It ties the clinician-facing surface to an ongoing physician review pipeline, and it suggests OpenAI is treating health as a standing eval and training track rather than a one-off vertical launch.