Skip to content
AI Primer
breaking

OpenAI launches ChatGPT for Clinicians and HealthBench Professional in U.S. preview

OpenAI introduced a free ChatGPT tier for verified U.S. clinicians and released HealthBench Professional, an open benchmark built from real clinical chat tasks. The launch pairs a clinician-facing workflow product with a public evaluation set and published model results.

5 min read
OpenAI launches ChatGPT for Clinicians and HealthBench Professional in U.S. preview
OpenAI launches ChatGPT for Clinicians and HealthBench Professional in U.S. preview

TL;DR

  • OpenAI shipped two linked health products at once: thekaransinghal's launch thread introduced a free ChatGPT tier for verified U.S. clinicians, while the HealthBench Professional announcement introduced an open benchmark for real clinician chat tasks.
  • According to thekaransinghal's rollout post, ChatGPT for Clinicians starts with U.S. physicians, nurse practitioners, physician assistants, and pharmacists, and the feature list says it includes clinical search, reusable workflow skills, deep research, CME credit, and optional HIPAA support.
  • OpenAI says in thekaransinghal's safety note that physicians tested about 7,000 conversations before launch and rated 99.6% of responses safe and accurate.
  • HealthBench Professional is built from physician-authored conversations and rubrics, and the benchmark design post says about one-third of examples involve deliberate physician red teaming with the dataset enriched 3.5x for hard conversations.
  • On OpenAI's own benchmark, thekaransinghal's results post says GPT-5.4 inside ChatGPT for Clinicians beat base GPT-5.4, other listed frontier models, and physician-written responses.

You can read OpenAI's launch post, skim the embedded feature list in Greg Brockman's post, and check Ethan Mollick's chart screenshot for how OpenAI sliced HealthBench Professional by difficulty and specialty. iScienceLuvr's reaction also captures the part many model launches skip: OpenAI paired the product launch with a public healthcare benchmark, and Thom Wolf's earlier HealthBench mention suggests the benchmark was already useful enough for outside agent-training experiments.

ChatGPT for Clinicians

OpenAI is positioning the product around three jobs clinicians already bring to general-purpose models: care consults, writing and documentation, and medical research, according to thekaransinghal's rollout post. The first rollout is a U.S. preview for verified clinicians only.

The shipped feature set is unusually workflow-heavy for a free tier:

That mix makes this look less like a vertical chatbot skin and more like a constrained work surface for search, paperwork, and literature review.

HealthBench Professional

OpenAI says the benchmark covers the same three categories as the product launch: care consults, writing and documentation, and medical research, according to the HealthBench Professional announcement. The stated goal in OpenAI's launch post is to measure realistic clinical chat workflows rather than narrow medical QA.

The benchmark design details are the more interesting part:

One immediate community reaction, from iScienceLuvr's post, was relief that OpenAI published a healthcare benchmark alongside the product at all. In follow-ups, iScienceLuvr on Medmarks costs and iScienceLuvr on benchmark expense noted that serious medical benchmark suites are expensive to run, which helps explain why comparable public evaluations are still sparse.

HealthBench scores

OpenAI's headline result is that GPT-5.4 inside ChatGPT for Clinicians beat base GPT-5.4, other OpenAI and external models, and physician-written responses on HealthBench Professional, according to thekaransinghal's results post. Ethan Mollick's post adds the obvious caveat: the benchmark is open, but it was designed by OpenAI.

Mollick's screenshot is useful because it shows the slices OpenAI chose to expose publicly: good-faith difficult cases, good-faith typical cases, red-teaming difficult cases, and specialty breakdowns across areas including nephrology, psychiatry, neurology, dermatology, cardiology, anesthesia, orthopedics, heme-onc, pediatrics, and OB-GYN.

There is already one sign the benchmark is escaping the launch post. In Thom Wolf's thread about Hugging Face agents, he said an external agent-training setup generated 1,100 synthetic HealthBench examples, upsampled them 50x, and beat Codex by 60% on the benchmark. That makes HealthBench look less like a marketing chart and more like fresh eval terrain for post-training systems.

Health review loop

The last detail worth bookmarking is the process claim behind the launch. thekaransinghal's health training note says an OpenAI health response is reviewed by a physician every few minutes, and that health is included in every major stage of model training.

That line does two things the product page alone does not. It ties the clinician-facing surface to an ongoing physician review pipeline, and it suggests OpenAI is treating health as a standing eval and training track rather than a one-off vertical launch.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR3 posts
ChatGPT for Clinicians2 posts
HealthBench Professional5 posts
HealthBench scores2 posts