OpenAI reports internal reasoning model disproves Erdős's 1946 unit-distance conjecture
OpenAI said an internal general-purpose reasoning model disproved Erdős's 1946 unit-distance conjecture without a math-specific scaffold or Lean. If the linked proof and expert commentary hold up, it shifts frontier-model discussion toward original research, not just benchmark performance.

TL;DR
- OpenAI says an internal general-purpose reasoning model disproved Erdős's 1946 unit-distance conjecture on the planar unit distance problem, which OpenAI's announcement calls the first autonomous solution of a prominent open problem central to a math subfield.
- According to polynoamial's thread and polynoamial's Lean reply, the system was not a math-specific scaffold, was not targeted at this problem, and did not use Lean.
- SherylHsu02's post says leading mathematicians including Tim Gowers would accept the work into Annals of Mathematics "without any hesitation," while the companion remarks PDF adds outside commentary on the proof.
- The interesting jump is the time horizon: SherylHsu02's thread and alexwei_'s post both frame this as a move from IMO-gold performance to original research in about ten months.
- The open question now is productization, not just bragging rights. polynoamial's follow-up says OpenAI has not pushed the model to the limit on open problems because the priority is to get it out quickly for public use.
OpenAI put up an official writeup, there is a separate remarks PDF from outside experts, and the weirdest reveal is how much of the story is about what the model was not: not math-specialized, not scaffolded, not Lean-backed. willdepue's reaction fixated on the published chain-of-thought summary being plain English, while emollick's question noted that OpenAI still has not publicly identified the exact model family behind either this result or the earlier IMO-gold run.
The result
OpenAI's core claim is concrete: the model found a new family of point constructions for the planar unit distance problem that beats the square-grid intuition mathematicians had leaned on for decades.
The company's post says the conjecture had stood since 1946 and that the model disproved the belief that near-square-grid constructions were best. OpenAI also published the full announcement, which is the canonical source for the claim.
A useful detail from kimmonismus's summary and alexwei_'s thread is the cross-domain route: the proof reportedly pulls tools from algebraic number theory, specifically class field towers, into a discrete-geometry question where many people did not expect them to matter.
General-purpose model
The most important implementation detail is negative space. OpenAI employees kept repeating what the system was not.
According to polynoamial, this was a general-purpose internal LLM, not a system targeted at mathematics or this problem. the same thread's follow-up adds that it was not a scaffold, and polynoamial's later reply says it did not use Lean.
That lines up with SherylHsu02's post, which says the model was not trained with the goal of doing math research, and with her next post, which describes it as a general daily-driver model for debugging experiments and writing technical reports.
The unresolved product question sits right in the middle of the hype. polynoamial's follow-up says OpenAI has not pushed this model to the limit on open problems because the focus is getting it out quickly, but emollick's post notes the company still has not publicly named the internal model used for the earlier IMO-gold milestone.
External mathematicians
The story gets teeth because OpenAI attached outside commentary, not just an internal victory lap.
The companion remarks PDF is where the stronger academic framing lives. In the tweet evidence, SherylHsu02 says Tim Gowers would accept the result into Annals of Mathematics without hesitation, and scaling01's quote calls out Gowers describing it as the first really clear example of AI solving a well-known open math problem.
The best corrective comes from deredleritt3r's quote thread: the model did not invent a new branch of mathematics. The claimed breakthrough is expert execution at extreme breadth, deep command of known results, and a useful choice of tools and parameters that human researchers had not combined this way.
That distinction matters because it moves the claim away from magical "AI discovered new physics" framing and toward something engineers can map onto existing systems work: broader retrieval, longer coherent search, and better cross-domain composition.
Time horizon
The most startling number in the evidence is not from the proof. It is the gap between milestones.
OpenAI staffers and close observers kept returning to the same comparison:
- IMO gold-level performance, roughly mid-2025, per polynoamial's thread
- original math research, May 2026, per SherylHsu02's thread
- a jump from 1.5-hour olympiad proof horizons to research efforts that alexwei_ describes as hundreds of hours
emollick's timeline post compresses the public perception shift even harder, from strawberry-counting memes in 2024 to Olympiad gold in 2025 to a famous combinatorial-geometry result in 2026.
The speculation around the hidden model is still just speculation. daniel_mac8's follow-up says "word on the street" was GPT-5.6 rather than GPT-6, while scaling01's thread offered another guess. No primary source in the evidence names the model.
Resource estimates
One late wrinkle in the discussion was cost and energy. It is thinly sourced and explicitly conditional, but it is new information.
Using public estimates for LLM throughput and infrastructure intensity, willdepue's napkin math guessed the run might have taken roughly 5 to 32 hours and perhaps $120 to $1,000 in tokens. emollick's estimate put the energy range at 0.6 to 6.3 kWh and direct cooling water at about 3 to 31 liters, citing an arXiv paper and an LBNL report.
Those numbers are back-of-the-envelope, not OpenAI disclosures. The useful point is narrower: the public debate moved almost immediately from "can a model do original research" to "how much test-time compute did it need," which is a very different frontier conversation from benchmark leaderboards.